November 24, 2012.
I needed to extract titles, canonical urls, descriptions and images from HTML pages,
and decided to split that functionality out into its own library.
Then I decided to actually go through the process of uploading it to PyPi
(a first for me), and the result is
Simple content extraction is pretty routine, but comes up frequently and often explodes into a hard to maintain mess when it evolves from scratch instead of designed around a framework for processing multiple extraction techniques.
extraction's framework turns out to be a bit useful
(the current techniques themselves are quite bareboned, but should become
more robust as I use it a bit more).
The README on Github is more indepth than anything I'll write here, so I'll do a quick example.
pip install extraction pip install requests pip install html5lib
Then let's play around a bit.
>>> import extraction, requests >>> ext = extraction.Extractor() >>> url = "http://www.cnn.com/2012/11/23/politics/fiscal-cliff/index.html" >>> x = ext.extract(requests.get(url).text, source_url=url) >>> x <Extracted: (title: 'Some Republicans move away from no-tax pledge', 3 more), (url: 'http://www.cnn.com/2012/11/23/politics/fiscal-clif'), (image: 'http://i2.cdn.turner.com/cnn/dam/assets/1211', 7 more), (feed: 'http://rss.cnn.com/rss/cnn_politics.rss'), (description: 'Nothing riles up the tea party chatter', 5 more)> >>> x.title u'Some Republicans move away from no-tax pledge - CNN.com' >>> x.description u'Nothing riles up the tea party chattering class...' >>> x.images [u'http://i2.cdn.turner.com/cnn/dam/assets/...', u'http://i.cdn.turner.com/cnn/images/1.gif' u"and five more..." ]
Let's try against the Github repository for
>>> url = "https://github.com/lethain/extraction" >>> x = ext.extract(requests.get(url).text, source_url=url) >>> x <Extracted: (title: 'extraction', 7 more), (url: 'https://github.com/lethain/extraction'), (image: 'https://a248.e.akamai.net/assets.githu...', 4 more), (description: 'extraction - A Python library for ex...', 6 more)> >>> x.titles [u'extraction', u'Extraction', u'lethain/extraction \xb7 GitHub', u'public lethain / extraction', u"and four more..." ]
Both of those work pretty well--in large part thanks to both implementing Facebook Opengraph tags--but this is a new library, so it'll undoubtedly perform poorly on a variety of sites. For example, it's really choking on PyPi right now, which I'll go figure out:
>>> url = "http://pypi.python.org/pypi/extraction/0.1.0" >>> x = ext.extract(requests.get(url).text, source_url=url) >>> x <Extracted: >
We'll see how it evolves.