Stripping Reddit From HackerNews With BOSS Mashup

Published on July 12, 2008. python (64), boss (11)

In a previous article, I wrote some search recipes using Python and Yahoo's BOSS Mashup Framework, now lets start looking at actually doing a mashup. We'll walk through a fairly simple example that represents something of a dream of mine:
YC's Hacker News with all the reposts from Reddit Programming stripped out. In other words, calculate the intersectionf between HackerNews and Reddit Programming, and remove that intersection from HackerNews, and return the remaining content in HackerNews.

Note that you'll need to have the BOSS Mashup Framework installed and a Yahoo BOSS App Id before you get started. I walked through that process here.

Getting Started

First lets look at grabbing HackerNew's RSS feed.

>>> from yos.yql import db,udfs
>>> hn = db.create(name="hn",url="http://news.ycombinator.com/rss")
>>> _hn = db.select(udf=udfs.unnest_value,table=hn)
>>> _hn.rows
[ {
    'hn$link': 'http://scarybuggames.com/2008/05/chronotron', 
    'hn$description': '<a href="http://news.ycombinator.com/item?id=244594">Comments</a>', 
    'hn$title': 'Play this game for 20 minutes - learn about concurrency',
    'hn$comments': 'http://news.ycombinator.com/item?id=244594'
  }, 
  {
    'hn$link': 'http://haydenplanetarium.org/resources/starstruck/manhattanhenge/', 
    'hn$description': '<a href="http://news.ycombinator.com/item?id=244612">Comments</a>',
    'hn$title': "Today is one of the 2 days each year when the sun lines up with Manhattan's grid",
    'hn$comments': 'http://news.ycombinator.com/item?id=244612'
  },
  "Additional results truncated for brevity..."
]

The db.create function is used for retrieving the contents of an RSS feed (and also search results from Yahoo's exposed search APIs), and encapsulating them in an easy to manipulate object.

db.select is performing a map across all the results, and in this case it is applying the udfs.unnest_value function to each row and returning the results. (The udfs.unnest_value function tries to collapse nested dictionaries into a single dictionary with all values at the base level.) As a quick example of using db.select lets write a function that strips out the hn$description key from the dictionaries.

>>> from yos.yql import db,udfs
>>> def strip_key(row, key):
...     if row.has_key(key):     
...         del row[key]
...     return row
... 
>>> def strip_desc(row):
...     return strip_key(row, "hn$description")
... 
>>> hn = db.create(name="hn",url="http://news.ycombinator.com/rss")
>>> _hn = db.select(udf=udfs.unnest_value,table=hn)
>>> __hn = db.select(udf=strip_des,table=_hn)
>>> __hn.rows
[
  {
    'hn$link': 'http://scarybuggames.com/2008/05/chronotron',
    'hn$title': 'Play this game for 20 minutes - learn about concurrency',
    'hn$comments': 'http://news.ycombinator.com/item?id=244594'
  },
  "Additional results stripped for brevity."
]

Okay, and now lets grab the RSS feed for Reddit Programming.

>>> rp = db.create(name="rp",url="http://www.reddit.com/r/programming/.rss")
>>> _rp = db.select(udf=udfs.unnest_value,table=rp)
>>> _rp.rows 
[
  {
    'rp$link': 'http://www.reddit.com/goto?rss=true&id=t3_6rfwt',
    'rp$description': '<a href="http://www.gnu.org/fun/jokes/dna">[link]</a> <a href="http://www.reddit.com/r/programming/info/6rfwt/comments/">[comments]</a>',
    'rp$title': 'Human DNA in C code',
    'rp$date': '2008-07-12T11:43:35.936253+00:00',
    'rp$guid': 'http://www.reddit.com/goto?rss=true&id=t3_6rfwt'
  }, 
  "Additional results stripped for brevity."
]

Now we want to strip all the Reddit Programming results from the Hacker News results. Our first step is to find a way to correlate results with each other. Often times the titles are very similar, but using urls would be ideal. HackerNews makes the url easily available, but Reddit is hiding the url within the description data, so we're going to have to parse it out.

Fortunately we can do that pretty easily with a regular expression.

>>> from yos.yql import db,udfs
>>> import re
>>> REDDIT_LINK_REGEX = re.compile(r'<a href="(?P<url>.*?)">\[link\]</a>')
>>> def update_link(row):
...     m = REDDIT_LINK_REGEX.search(row['rp$description'])
...     if not m: return row
...     row['rp$link'] = m.group('url')
...     return row
... 
>>> rp = db.create(name="rp",url="http://www.reddit.com/r/programming/.rss")
>>> _rp = db.select(udf=udfs.unnest_value,table=rp)
>>> _rp = db.select(udf=update_link,table=rp)
>>> _rp.rows
[
  {
    'rp$link': 'http://en.wikipedia.org/wiki/Computational_complexity_of_songs',
    'rp$description': '<a href="http://en.wikipedia.org/wiki/Computational_complexity_of_songs">[link]</a> <a href="http://www.reddit.com/r/programming/info/6ri66/comments/">[comments]</a>',
    'rp$title': "Donal Knuth's Complexity of Songs",
    'rp$date': '2008-07-13T01:44:18.961669+00:00',
    'rp$guid': 'http://www.reddit.com/goto?rss=true&id=t3_6ri66'
  },
  "Additional rows truncated for brevity.",
]

Okay, we have collected the ingredients and just need to cook the stew.

Finding the Intersection

Lets take a look at a sample entry from our transformed HackerNews and Reddit Programming feeds. First lets look one from HackerNews:

{
  'hn$link': 'http://scarybuggames.com/2008/05/chronotron', 
  'hn$description': '<a href="http://news.ycombinator.com/item?id=244594">Comments</a>',
  'hn$title': 'Play this game for 20 minutes - learn about concurrency',
  'hn$comments':'http://news.ycombinator.com/item?id=244594'
}

And now one from Reddit Programming (after we have run the update_link function on it):

{
  'rp$link': 'http://en.wikipedia.org/wiki/Computational_complexity_of_songs',
  'rp$description': '<a href="http://en.wikipedia.org/wiki/Computational_complexity_of_songs">[link]</a> <a href="http://www.reddit.com/r/programming/info/6ri66/comments/">[comments]</a>',
  'rp$title': "Donal Knuth's Complexity of Songs",
  'rp$date': '2008-07-13T01:44:18.961669+00:00',
  'rp$guid': 'http://www.reddit.com/goto?rss=true&id=t3_6ri66'
}

Now we use the link key to find the intersection between the two RSS feeds. Notice that in the overlap function we use 'link' as the key, as opposed to using 'rp$link' and 'hn$link' for the keys. The db.join function is smart enough to strip off namespaces before it passes the arguments in.

>>> def overlap(r1,r2):
...     return r1['link'].strip() == r2['link'].strip()
...
>>> len(_hn)
25
>>> len(rp)
25
>>> joint = db.join(overlap, [_hn, _rp])
>>> len(joint)
3
>>> joint.rows[0]
{
  'hn$description': '<a href="http://news.ycombinator.com/item?id=244837">Comments</a>',
  'hn$title': "Donald Knuth's Complexity of Songs",
  'rp$date': '2008-07-13T01:44:18.961669+00:00',
  'hn$comments': 'http://news.ycombinator.com/item?id=244837',
  'hn$link': 'http://en.wikipedia.org/wiki/Computational_complexity_of_songs',
  'rp$description': '<a href="http://en.wikipedia.org/wiki/Computational_complexity_of_songs">[link]</a> <a href="http://www.reddit.com/r/programming/info/6ri66/comments/">[comments]</a>',
  'rp$title': "Donal Knuth's Complexity of Songs",
  'rp$guid': 'http://www.reddit.com/goto?rss=true&id=t3_6ri66',
  'rp$link': 'http://en.wikipedia.org/wiki/Computational_complexity_of_songs'
}

But, the goal isn't really to find the intersection, its to find the content in HackerNews that is not contained in Reddit Programming. Fortunately, we can write a quick Python function to strip the intersection out.

>>> joint = db.join(overlap, [_hn, _rp])
>>> def in_reddit(row):
...     for dup in joint.rows:
...         if row['hn$link'] == dup['hn$link']:
...             return True
...     return False
... 
>>> hn_uniques = [ x for x in _hn.rows if in_reddit(x) == False ]          
>>> len(hn_uniques)
22

And we've finally accomplished our noble goal: creating a feed of HackerNews with the intersection of entries with Reddit Programming stripped out. Putting all the code together it looks like this:

import re
from yos.yql import db,udfs
REDDIT_LINK_REGEX = re.compile(r'<a href="(?P<url>.*?)">[link]</a>')
def update_link(row):
"Replace key 'rp$link' with url parsed from 'rp$description'."
m = REDDIT_LINK_REGEX.search(row['rp$description'])
if not m: return row
row['rp$link'] = m.group('url')
return row
def overlap(r1,r2):
"Returns true if dicts r1 and r2 have same value for key 'link'."
return r1['link'].strip() == r2['link'].strip()
# Get HackerNews RSS feed.
hn = db.create(name="hn",url="http://news.ycombinator.com/rss")
_hn = db.select(udf=udfs.unnest_value,table=hn)
# Get Reddit Programming RSS feed.
rp = db.create(name="rp",url="http://www.reddit.com/r/programming/.rss")
_rp = db.select(udf=udfs.unnest_value,table=rp)
_rp = db.select(udf=update_link,table=rp)
# Calculate the intersection between both feeds.
joint = db.join(overlap, [_hn, _rp])
def in_reddit(row):
for dup in joint.rows:
if row['hn$link'] == dup['hn$link']:
return True
return False
# Strip intersection from HackerNews RSS feed.
hn_uniques = [ x for x in _hn.rows if in_reddit(x) == False ]

I'll be the first to admit that this is a fairly contrived example. That said, it does show how the BOSS Mashup Framework has provided some useful tools to play around with. There is certainly more depth to the framework that isn't touched here, and this task is a bit of an abnormality in the sense that it is more difficult than most things you'd be trying to accomplish using the framework.

The Mashup Framework has a lot of support for merging things together in interesting ways--ya know, mashing things up--but not much for removing duplicate results, which isn't surprising since that is almost the exact opposite of what it is intended to do. In that sense, it's something of a testament to the framework that its still fairly easy to accomplish. Of course, it would have been fairly easy to accomplish this example without using the BOSS Mashup Framework at all...

In my next (and likely, for a while, last) tutorial on using the BOSS Mashup Framework I'll put it to use at a task that it's actually good at.

Let me know if there are any mistakes or if you have any questions.