December 2, 2008.
Last week I was doing some performance work with a client, and one of the big improvements we made was making http requests in parallel. If your server needs to hit two or three APIs before it can render (the bane of the mashup crowd), then making sequential requests can be taking a huge bite out of your performance. With the client, the solution needed to be in PHP, but this evening I decided to whip up a similar solution for Python (someone might have suggested it as well).
Using the threading and urllib modules it turned out to be a fairly straight forward task. My (very basic) strategy was to create a thread for each request, poll the threads until they finished, and then return the received data as a list of two tuples in the form of (url, data). The one other item on my wish list is that I wanted a timeout that applied to all of the threads (again, to keep things feeling snappy).
My code ended up looking like this (let's say that it is
stored in a file named
multi_get.py for the following
from threading import Thread, enumerate from urllib import urlopen from time import sleep UPDATE_INTERVAL = 0.01 class URLThread(Thread): def __init__(self,url): super(URLThread, self).__init__() self.url = url self.response = None def run(self): self.request = urlopen(self.url) self.response = self.request.read() def multi_get(uris,timeout=2.0): def alive_count(lst): alive = map(lambda x : 1 if x.isAlive() else 0, lst) return reduce(lambda a,b : a + b, alive) threads = [ URLThread(uri) for uri in uris ] for thread in threads: thread.start() while alive_count(threads) > 0 and timeout > 0.0: timeout = timeout - UPDATE_INTERVAL sleep(UPDATE_INTERVAL) return [ (x.url, x.response) for x in threads ]
Usage looks like this:
from multi_get import multi_get sites = ['http://msn.com/','http://yahoo.com/','http://google.com/'] requests = multi_get(sites,timeout=1.5) for url, data in requests: print "received this data %s from this url %s" % (url,data)
I did some comparison testing against this straightforward sequential implementation,
from urllib import urlopen import time results =  sites = ('http://msn.com/','http://yahoo.com/','http://google.com/') for site in sites: start = time.time() req = urlopen(site) results.append((site, req.read())) end = time.time() print "took %s seconds" % (end-start)
and the results were what one would expect. On my connection it was taking MSN 1.14 seconds to load, while Yahoo and Google took between .1 and .2 seconds (who knows, maybe they were rejecting the user agent with an error page ;). The first script, executing the retrievals in parallel, was able to retrieve all requests ten or twenty miliseconds slower than the slowest result, while the sequential script took a good bit longer (the sum of all response times).
As you build your next Django or TurboGears mashup--or even plan your next doomed foray into screenscaping Google--give parallel requests a try and see just how helpful they can be.
(You could also approach this problem using the asyncore module, but it started throwing some weird errors my way and I went with the threaded approach. Given that the threaded approach performs admirably, I decided to leave it as it is; using asyncore should be almost identical but you'd have to write a bit more glue code for the HTTP aspects, especially if you wanted to get more complex than GET requests.)