Extraction: Get Metadata from HTML Documents
I needed to extract titles, canonical urls, descriptions and images from HTML pages,
and decided to split that functionality out into its own library.
Then I decided to actually go through the process of uploading it to PyPi
(a first for me), and the result is extraction
.
Simple content extraction is pretty routine, but comes up frequently and often explodes into a hard to maintain mess when it evolves from scratch instead of designed around a framework for processing multiple extraction techniques.
Hopefully extraction
’s framework turns out to be a bit useful
(the current techniques themselves are quite bareboned, but should become
more robust as I use it a bit more).
The README on Github is more indepth than anything I’ll write here, so I’ll do a quick example.
First, install extraction and requests from PyPi:
pip install extraction
pip install requests
pip install html5lib
Then let’s play around a bit.
>>> import extraction, requests
>>> ext = extraction.Extractor()
>>> url = "http://www.cnn.com/2012/11/23/politics/fiscal-cliff/index.html"
>>> x = ext.extract(requests.get(url).text, source_url=url)
>>> x
<Extracted: (title: 'Some Republicans move away from no-tax pledge', 3 more),
(url: 'http://www.cnn.com/2012/11/23/politics/fiscal-clif'),
(image: 'http://i2.cdn.turner.com/cnn/dam/assets/1211', 7 more),
(feed: 'http://rss.cnn.com/rss/cnn_politics.rss'),
(description: 'Nothing riles up the tea party chatter', 5 more)>
>>> x.title
u'Some Republicans move away from no-tax pledge - CNN.com'
>>> x.description
u'Nothing riles up the tea party chattering class...'
>>> x.images
[u'http://i2.cdn.turner.com/cnn/dam/assets/...',
u'http://i.cdn.turner.com/cnn/images/1.gif'
u"and five more..."
]
Let’s try against the Github repository for extraction
:
>>> url = "https://github.com/lethain/extraction"
>>> x = ext.extract(requests.get(url).text, source_url=url)
>>> x
<Extracted: (title: 'extraction', 7 more),
(url: 'https://github.com/lethain/extraction'),
(image: 'https://a248.e.akamai.net/assets.githu...', 4 more),
(description: 'extraction - A Python library for ex...', 6 more)>
>>> x.titles
[u'extraction',
u'Extraction',
u'lethain/extraction \xb7 GitHub',
u'public lethain / extraction',
u"and four more..."
]
Both of those work pretty well–in large part thanks to both implementing Facebook Opengraph tags–but this is a new library, so it’ll undoubtedly perform poorly on a variety of sites. For example, it’s really choking on PyPi right now, which I’ll go figure out:
>>> url = "http://pypi.python.org/pypi/extraction/0.1.0"
>>> x = ext.extract(requests.get(url).text, source_url=url)
>>> x
<Extracted: >
We’ll see how it evolves.