Extraction: Get Metadata from HTML Documents

Published on November 23, 2012. extraction (1), python (65)

I needed to extract titles, canonical urls, descriptions and images from HTML pages, and decided to split that functionality out into its own library. Then I decided to actually go through the process of uploading it to PyPi (a first for me), and the result is extraction.

Simple content extraction is pretty routine, but comes up frequently and often explodes into a hard to maintain mess when it evolves from scratch instead of designed around a framework for processing multiple extraction techniques.

Hopefully extraction’s framework turns out to be a bit useful (the current techniques themselves are quite bareboned, but should become more robust as I use it a bit more).

The README on Github is more indepth than anything I’ll write here, so I’ll do a quick example.

First, install extraction and requests from PyPi:

pip install extraction
pip install requests
pip install html5lib

Then let’s play around a bit.

>>> import extraction, requests
>>> ext = extraction.Extractor()
>>> url = "http://www.cnn.com/2012/11/23/politics/fiscal-cliff/index.html"
>>> x = ext.extract(requests.get(url).text, source_url=url)
>>> x
<Extracted: (title: 'Some Republicans move away from no-tax pledge', 3 more),
            (url: 'http://www.cnn.com/2012/11/23/politics/fiscal-clif'),
            (image: 'http://i2.cdn.turner.com/cnn/dam/assets/1211', 7 more),
            (feed: 'http://rss.cnn.com/rss/cnn_politics.rss'),
            (description: 'Nothing riles up the tea party chatter', 5 more)>
>>> x.title
u'Some Republicans move away from no-tax pledge - CNN.com'
>>> x.description
u'Nothing riles up the tea party chattering class...'
>>> x.images
[u'http://i2.cdn.turner.com/cnn/dam/assets/...',
 u'http://i.cdn.turner.com/cnn/images/1.gif'
 u"and five more..."
]

Let’s try against the Github repository for extraction:

>>> url = "https://github.com/lethain/extraction"
>>> x = ext.extract(requests.get(url).text, source_url=url)
>>> x
<Extracted: (title: 'extraction', 7 more),
            (url: 'https://github.com/lethain/extraction'),
            (image: 'https://a248.e.akamai.net/assets.githu...', 4 more),
            (description: 'extraction - A Python library for ex...', 6 more)>
>>> x.titles
[u'extraction',
 u'Extraction',
 u'lethain/extraction \xb7 GitHub',
 u'public lethain / extraction',
 u"and four more..."
]

Both of those work pretty well–in large part thanks to both implementing Facebook Opengraph tags–but this is a new library, so it’ll undoubtedly perform poorly on a variety of sites. For example, it’s really choking on PyPi right now, which I’ll go figure out:

>>> url = "http://pypi.python.org/pypi/extraction/0.1.0"
>>> x = ext.extract(requests.get(url).text, source_url=url)
>>> x
<Extracted: >

We’ll see how it evolves.