Search Recipes for Yahoo's BOSS in Python
This tutorial is going to take a look at how to use Yahoo's BOSS Mashup Framework's optional Python library to (what else?) Build your Own Search Service. We'll start out by installing the library and getting a BOSS developer's account, and then play with BOSS at the Python command line, putting together a handful of useful recipes for searching the web, news, images and more.
This tutorial isn't going to attempt to build an entire application, just show examples of using the BOSS library itself. If you're interested in an article that uses BOSS in a web framework, I recently wrote an article using BOSS in Django, which should be helpful.
As you work through these examples you may want to refer to the official Yahoo BOSS API documentation (whose link is kind of hard to find on the BOSS page).
Setting Up the BOSS Library
Before we get started, note that the BOSS Framework library is dependent upon Python2.5, and cannot be used as-is with previous versions.
Sign up for a BOSS App ID. You'll have to chose to use in-browser authentication, including supplying two urls: one for your app, and one for redirecting to after successful authentication.
However, those urls are not used by the library (neither is any kind of authentication for end-users using your service, although you will have to supply your BOSS APP ID for the library to work), however they are used for validating your account, so you will need to chose a valid domain where you can insert a static file for them to verify domain ownership.
Unzip it, and delete the zip.
unzip boss_mashup_framework_0.1.zip rm boss_mashup_framework_0.1.zip
Open up the
boss_mashup_framework_0.1/config.json
file and fill in your values for the first three items.appid
is the id you got when you registered for BOSS in step #1.email
is your email.org
is the name of your organization (likely your name).
Install Simple JSON if you don't have it installed. You can check if you have it installed by entering a Python2.5 prompt and typing
import simplejson
If that didn't work, download Simple JSON, unzip it, enter the folder, and then install it.
tar -xzvf simplejson-1.9.2.tar.gz cd simplejson-1.9.2 python2.5 setup.py build python2.5 setup.py install cd ../ rm -rf simplejson-1.9.2*
Create the folder
boss_mashup_framework_0.1/deps/
.cd boss_mashup_framework_0.1 mkdir deps
Download dict2xml and xml2dict, and extract them into the deps folder, remove the
.tgz
files, and return to theboss_mashup_framework_0.1
directory.cd deps cp ~/Desktop/dict2xml.tgz ./ cp ~/Desktop/xml2dict.tgz ./ tar -xzvf dict2xml.tgz tar -xzvf xml2dict.tgz rm *.tgz cd ..
Now we can build the library.
python2.5 setup.py build python2.5 setup.py install
Finally, lets test that everything worked.
python2.5 examples/ex3.py
You should see some news stories printed to screen. Success. If you run into an error message looking like this:
Traceback (most recent call last): File "examples/ex3.py", line 16, in <module> from yos.yql import db File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/yos/yql/db.py", line 44, in <module> from yos.crawl import rest File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/yos/crawl/rest.py", line 15, in <module> HEADERS = {"User-Agent": simplejson.load(open("config.json", "r"))["agent"]} IOError: [Errno 2] No such file or directory: 'config.json'
then you probably started Python in a directory without a
config.json
file present. I consider this a pretty ugly flaw in the library in its present form: you must run Python from a directory with aconfig.json
file in it. Hopefully this situation can be improved upon,
At this point, we are all setup and ready to move onward. If things didn't quite work out for you, take a look at the boss_mashup_framework_0.1/README
file. If that doesn't help, leave a comment and I'll try to help out.
Now, lets begin writing our recipes.
Searching the Web
First lets look at using the BOSS Mashup Framework to search the web.
>>> from yos.boss import ysearch
>>> from yos.yql import db
>>> data = ysearch.search("Django",count=10)
>>> table = db.create(data=data)
>>> table.rows
[
{ u'dispurl': u'www.<b>djangoproject.com</b>',
u'title': u'<b>Django</b> | The Web framework for perfectionists with deadlines',
u'url': u'http://www.djangoproject.com/', u'abstract': u'<b>Django</b> is a high-level Python Web framework that encourages rapid development and clean, pragmatic design. <b>...</b> <b>Django</b> focuses on automating as much as possible <b>...</b>',
u'clickurl': u'http://www.djangoproject.com/',
u'date': u'2008/06/19',
u'size': u'8524'
},
"would display 9 more results, but removed to save space"
]
From there we can play with the data quite easily:
>>> for row in table.rows:
... print "(%s) %s : %s" % (row['date'],row['title'],row['url']
(2008/06/19) <b>Django</b> | The Web framework for perfectionists with deadlines : http://www.djangoproject.com/
(2006/03/15) <b>Django</b> Reinhardt : http://www.redhotjazz.com/django.html
(2008/07/09) <b>Django</b> Reinhardt - Wikipedia, the free encyclopedia : http://en.wikipedia.org/wiki/Django_Reinhardt
...
One thing to point out is the difference between url
and clickurl
. In the above code they are the same:
{ u'url': u'http://www.djangoproject.com/',
u'clickurl': u'http://www.djangoproject.com/',
...
}
However, Yahoo is makes an important distinction between the two in their documentation. Namely, that you should use clickurl
for the actual link used to access that site. For example, if you were writing a Django template the link might look like one these examples:
<a href="{ { row.clickurl }}">{ { row.url }}</a>
<a href="{ { row.clickurl }}">{ { row.title }}</a>
but should not look like these:
<a href="{ { row.url }}">{ { row.url }}</a>
<a href="{ { row.url }}">{ { row.title }}</a>
Although I haven't run into any situations where url
and clickurl
actually differ, it seems safer to follow Yahoo's instructions, as there is likely a very reasonable explanation, even though its not explained well in their documentation.
Limiting Search to one Site
Limiting your search to a site is fairly simple. If you wanted to search for HttpRequest
on djangoproject.com
you would simply do this:
>>> query = u"%s site:%s" % ("HttpRequest", "djangoproject.com")
>>> query
u'HttpRequest site:djangoproject.com'
>>> data = ysearch.search(query)
Here is a full example:
>>> from yos.boss import ysearch
>>> from yos.yql import db
>>> data = ysearch.search("Django site:lethain.com")
>>> results = db.create(data=data)
>>> results.rows
[{ u'dispurl': u'www.<b>lethain.com</b>/tags/<b>django</b>',
u'title': u'<b>django</b>',
u'url': u'http://www.lethain.com/tags/django/',
u'abstract': u"Will Larson's blog about programming and other things. <b>...</b> Tags: Google App Engine <b>django</b> <b>...</b> and disadvantages of using <b>Django</b> with the Google App Engine. <b>...</b>",
u'clickurl': u'http://www.lethain.com/tags/django/',
u'date': u'2008/07/03',
u'size': u'29012'},
u"9 more results truncated for brevity",
]
Searching in Different Regions and Languages
BOSS lets you search in different languages and regions as well, it lists the supported regions here. These are accessed using the lang
and region
parameters to ysearch.search
. For example if you want to search for the Argentina region (ar
) in Spanish (es
) it would look like this:
>>> data = ysearch("BOSS Framework",lang="es",region="ar")
Here is a full example of searching using the Japanese region and language.
>>> from yos.boss import ysearch
>>> from yos.yql import db
>>> data = ysearch.search("Django",lang='jp',region='jp')
>>> results = db.create(data=data)
>>> results.rows
[{ u'dispurl': u'<b>django</b>.nqsblog.jp',
u'title': u'<b>Django</b> Kumamoto',
u'url': u'http://django.nqsblog.jp/',
u'abstract': u'<b>Django</b> Kumamoto. \u718a\u672c\u30b8\u30e3\u30f3\u30b4\u306e\u30e9\u30a4\u30d6\u30fb\u30a2\u30fc\u30c6\u30a3\u30b9\u30c8\u30fb\u30d4\u30c3\u30af\u30a2\u30c3\u30d7\u60c5\u5831 ... \u718a\u672c<b>\uff24\uff4a\uff41\uff4e\uff47\uff4f</b>\u30e9\u30a4\u30d6\u30b9\u30b1\u30b8\u30e5\u30fc\u30eb ... <b>Django</b>. 07-01-12. \u53ea\u4eca\u597d\u8a55\u767a\u58f2\u4e2d! \u544a\u77e5\u6709\u96e3\u3046\u5fa1\u5ea7\u3044\u307e.. TIGER HOLE ...',
u'clickurl': u'http://django.nqsblog.jp/',
u'date': u'2008/07/07',
u'size': u'23025'},
u"9 more results truncated for brevity",
]
Searching Yahoo News
Searching data from Yahoo News is pretty much the same story, except we add the parameter vertical="news"
to the ysearch.search
function.
>>> from yos.boss import ysearch
>>> from yos.yql import db
>>> data = ysearch.search("Python",vertical="news",count=10)
>>> news = db.create(data=data)
>>> news.rows
[{u'sourceurl': u'http://www.geek.com/',
u'language': u'en english',
u'title': u'Sun to add Python support to NetBeans IDE',
u'url': u'http://www.geek.com/sun-to-add-python-support-to-netbeans-ide-20080710/',
u'abstract': u'Sun\u2019s open source IDE NetBeans is expanding fast. After announcing support for PHP back in May, Sun has used the EuroPython 2008 event to announce support for Python and Jython in upcoming releases. NetBeans is an all-in-one development environment in the same vein as Microsoft\u2019s Visual Studio. The one key difference being Sun\u2019s version is open [...]',
u'clickurl': u'http://www.geek.com/sun-to-add-python-support-to-netbeans-ide-20080710/',
u'source': u'Geek.com',
u'time': u'14:50:27',
u'date': u'2008/07/10'},
"9 results truncated for brevity"
]
Lets say we wanted to generate a list of html links from those results, it would look something like this:
>>> def make_link(row):
... return u'<a href="%s">%s</a>' % (row['clickurl'],row['title'])
...
>>> links = [make_link(row) for row in news.rows]
>>> links
[ u'<a href="http://www.geek.com/sun-to-add-python-support-to-netbeans-ide-20080710/">Sun to add Python support to NetBeans IDE</a>',
u'<a href="http://www.akron.com/akron-ohio-community-news.asp?aID=2875">Peninsula celebrates Python Day with special events</a>',
u'8 results truncated for brevity']
Searching for Images
BOSS lets you search for images by adding the parameter vertical="images"
to the ysearch.search
function.
>>> from yos.boss import ysearch
>>> from yos.yql import db
>>> data = ysearch.search("cherry blossom",vertical="images",count=10)
>>> images = db.create(data=data)
>>> images.rows
[
{ u'mimetype': u'image/jpeg',
u'refererurl': u'http://whatdigitalcamera.com/gallery/view_photo_properties.php?set_albumName=wdc_gallery&index=127&gallery_popup=true',
u'format': u'jpeg',
u'url': u'http://whatdigitalcamera.com/albums/wdc_gallery/Cherry_Blossom.thumb.jpg',
u'abstract': u'Photo Properties Cherry Blossom',
u'clickurl': u'http://whatdigitalcamera.com/albums/wdc_gallery/Cherry_Blossom.thumb.jpg',
u'thumbnail_width': u'125',
u'height': u'100',
u'width': u'150',
u'refererclickurl': u'http://whatdigitalcamera.com/gallery/view_photo_properties.php?set_albumName=wdc_gallery&index=127&gallery_popup=true',
u'date': u'2006/07/26',
u'title': u'Cherry_Blossom.thumb.jpg',
u'thumbnail_height': u'83',
u'filename': u'Cherry_Blossom.thumb.jpg',
u'thumbnail_url': u'http://sp1.yt-thm-a01.yimg.com/image/25/m3/2578341441',
u'size': u'5500'
},
u"9 more results truncated for brevity",
]
Searching for Spelling
You can use BOSS to check spelling as well. To do this add the parameter vertical="spelling"
to the ysearch.search
function.
>>> from yos.boss import ysearch
>>> from yos.yql import db
>>> data = ysearch.search("exubberance",vertical="spelling")
>>> spellings = db.create(data=data)
>>> spellings.rows
[{u'suggestion': u'exuberance'}]
>>> data = ysearch.search("horrifiy",vertical="spelling")
>>> spellings = db.create(data=data)
>>> spellings.rows
[{u'suggestion': u'horrific'}]
Paginating Search Results
Often you'll want the ability to page through results, and show the first ten results, then the next ten, and so on. Fortunately this is easy as well. Simply use the start
named paramter for the ysearch.search
function, slowly incrementing its value:
>>> pos = 0
>>> first_five = ysearch.search("Django",start=pos,count=5)
>>> pos = pos + 5
>>> second_five = ysearch.search("Django",start=pos,count=5)
>>> pos = pos + 5
>>> third_five = ysearch.search("Django",start=pos,count=5)
And here is a full example of paginating the search results:
>>> from yos.boss import ysearch
>>> from yos.yql import db
>>> data = ysearch.search("Python",start=0,count=5)
>>> results = db.create(data=data)
>>> titles = [ row['title'] for row in results.rows ]
>>> titles
[u'<b>Python</b> Programming Language', u'<b>Python</b> (programming language) - Wikipedia, the free encyclopedia', u'<b>Python</b> (programming language) - Wikipedia, the free encyclopedia', u'Download <b>Python</b> Software', u'<b>Python</b> - Wikipedia, the free encyclopedia']
>>> data = ysearch.search("Python",start=5,count=5)
>>> results = db.create(data=data)
>>> titles = [ row['title'] for row in results.rows ]
>>> titles
[u'<b>Python</b> (programming language) - Wikipedia, the free encyclopedia', u'<b>Python</b> for S60', u'PythOnline', u"<b>Python</b> | O'Reilly Media", u'<b>Python</b> for S60']
Going on from here
The recipes here are only looking at the functionality in ysearch.search
, but are not taking advantage of the functionality existing in the rest of the BOSS Mashup library. I'll be putting together some recipes using the yos.yql
functionality in a bit, until then the best way to learn about how things work is to read the examples in the examples
directory and to simply read the source code. There is also the official documentation, but it doesn't deal with the Python library specifically, so it may not be of great help.
Let me know if there are any questions.