Python Content Scraper for OneManga.com

Published on August 8, 2008. python (64), screen-scraping (3)

I spent a couple hours today writing a Python screen scrapper for OneManga.com, which is a website that aggregates manga scans online. In particular I have a slow connection sometimes, so I wanted a convenient way to grab the images without a lot of work. This was a fun chance to work with BeautifulSoup and Httplib2. The implementation focused on two concepts: usability and kindness. Usability as in allowing users to really get an easy handle on the available data, and kindness as in not pounding the server with unnecessary requests.

Lets start with an example use case:

>>> import onemanga_reader
>>> x = OMReader()
x = OMReader()
>>> x
OMReader(http://onemanga.com/)
>>> pluto = x['Pluto']
>>> pluto
OMSeries(Pluto)
>>> len(pluto)
55
>>> pluto
OMSeries(Pluto, 55 episodes)
>>> ep = pluto[-1]
>>> ep
OMEpisode(Pluto 55)
>>> urls = [ page.url for page in ep[:5] ]
>>> ep
OMEpisode(Pluto 55, 26 pages)
>>> ep[0].show() # open 0th page in webbrowser

All the objects in the scraper work with the Python list api. However, the OMReader object also responds to dict like queries:

>>> x["Pluto"]
OMSeries(Pluto, 55 episodes)
>>> x[155]
OMSeries(D.Gray-Man)

This allows a great deal of flexibility in accessing the data. You can iterate through the list of all available manga, or you can attempt to retrieve a specific one by name. You can also iterate through a manga's list of episodes, and through an episodes list of pages. For example, if you wanted to retrieve the url of every picture of every manga on the website (although, I'd ask that you, please, do not do this):

scraper = OMReader()
series = scraper[:]
episodes = reduce(lambda a,b: a[:]+b[:],series)
pages = reduce(lambda a,b: a[:]+b[:],episodes)
urls = map(lambda a: a.url,pages)

Or you might compress that down to...

urls = map(lambda a: a.url,reduce(lambda a,b: a[:]+b[:],reduce(lambda a,b: a[:]+b[:],OMReader()[:])))

Which is a slight bit ugly, but you get the idea: its pretty easy to get ahold of the data here. However, the implementation is still fairly kind. When you first create an OMReader it doesn't know anything, nor does it create an http requests. Instead it will lazily retrieve the data it needs.

Lets look at this exchange:

>>> x = OMReader()
>>> x
OMReader(http://onemanga.com/)
>>> x[0]
OMSeries(+Anima)
>>> x
OMReader(http://onemanga.com/, 1514 series)

When we first look at x it is a blank slate. Then we retrieve the 0th manga, and it has to retrieve the list of available manga in order to do that. Thus, when we represent x the second time it knows how many series are available. Fortunately, the list of available manga is present in a select element in every page, so simply retrieving the front page (which hopefully is heavily cached, since it doesn't change often) is sufficient. After that, we know all the available manga and can iterate through them or can attempt to retrieve them by name:

>>> names = [ a.name for a in x[100:103]]
>>> names
[u'Boku ni Natta Watashi  ', u'Boku no Hatsukoi wo Kimi ni Sasagu', u'Boku no Watashi no Yusha Gaku']
>>> x['Bleach']
OMSeries(Bleach)

So far, all of this has only required one http request. However, lets investigate the Pluto manga a bit (I might be recommending it, yes).

>>> p = x['Pluto']
>>> p
OMSeries(Pluto)
>>> len(p)
55
>>> p
OMSeries(Pluto, 55 episodes)

Whenever new information pops up, thats an indication that an additional http request has been made. Here we had to retrieve Pluto's index page to find its episode list. Now, lets take a look at an actual episode:

>>> a = p[0]
>>> a
OMEpisode(Pluto 1)
>>> len(a)
32
>>> a
OMEpisode(Pluto 1, 32 pages)
>>> a[:4]
[OMPage(Page 01 of Pluto 1), OMPage(Page 02 of Pluto 1), OMPage(Page 03 of Pluto 1), OMPage(Page 04 of Pluto 1)]

Once again new information has appeared, indicating another http request has occured. To drill down from the overview to this list of pages for the first episode has only taken three http requests, but from here on out fetching each page will require an http request, so its important to be considerate. Thus, retrieving the entire first episode of Pluto would require an additional 32 requests. Additionally, this library is naive about http requests, and does not handle them in parallel, so fetching a larger number of pages can be rather slow.

And thats all there is to it. It was a fun exercise in (sort-of) socially responsible screen scraping. I'm not certain it will get any use, but the code has some mildly interesting points in it, and at about 120 lines it won't wear you out. You can grab the file here.