Last week I was doing some performance work with a client, and one of the big improvements we made was making http requests in parallel. If your server needs to hit two or three APIs before it can render (the bane of the mashup crowd), then making sequential requests can be taking a huge bite out of your performance. With the client, the solution needed to be in PHP, but this evening I decided to whip up a similar solution for Python (someone might have suggested it as well).
Using the threading and urllib modules it turned out to be a fairly straight forward task. My (very basic) strategy was to create a thread for each request, poll the threads until they finished, and then return the received data as a list of two tuples in the form of (url, data). The one other item on my wish list is that I wanted a timeout that applied to all of the threads (again, to keep things feeling snappy).
My code ended up looking like this (let's say that it is
stored in a file named multi_get.py for the following
eample):
from threading import Thread, enumerate
from urllib import urlopen
from time import sleep
UPDATE_INTERVAL = 0.01
class URLThread(Thread):
def __init__(self,url):
super(URLThread, self).__init__()
self.url = url
self.response = None
def run(self):
self.request = urlopen(self.url)
self.response = self.request.read()
def multi_get(uris,timeout=2.0):
def alive_count(lst):
alive = map(lambda x : 1 if x.isAlive() else 0, lst)
return reduce(lambda a,b : a + b, alive)
threads = [ URLThread(uri) for uri in uris ]
for thread in threads:
thread.start()
while alive_count(threads) > 0 and timeout > 0.0:
timeout = timeout - UPDATE_INTERVAL
sleep(UPDATE_INTERVAL)
return [ (x.url, x.response) for x in threads ]
Usage looks like this:
from multi_get import multi_get
sites = ['http://msn.com/','http://yahoo.com/','http://google.com/']
requests = multi_get(sites,timeout=1.5)
for url, data in requests:
print "received this data %s from this url %s" % (url,data)
I did some comparison testing against this straightforward sequential implementation,
from urllib import urlopen
import time
results = []
sites = ('http://msn.com/','http://yahoo.com/','http://google.com/')
for site in sites:
start = time.time()
req = urlopen(site)
results.append((site, req.read()))
end = time.time()
print "took %s seconds" % (end-start)
and the results were what one would expect. On my connection it was taking MSN 1.14 seconds to load, while Yahoo and Google took between .1 and .2 seconds (who knows, maybe they were rejecting the user agent with an error page ;). The first script, executing the retrievals in parallel, was able to retrieve all requests ten or twenty miliseconds slower than the slowest result, while the sequential script took a good bit longer (the sum of all response times).
As you build your next Django or TurboGears mashup--or even plan your next doomed foray into screenscaping Google--give parallel requests a try and see just how helpful they can be.
(You could also approach this problem using the asyncore module, but it started throwing some weird errors my way and I went with the threaded approach. Given that the threaded approach performs admirably, I decided to leave it as it is; using asyncore should be almost identical but you'd have to write a bit more glue code for the HTTP aspects, especially if you wanted to get more complex than GET requests.)
Thanks for the post Will. Its quite interesting to see threading solution. I make a about 5 API requests every few minutes. May be I should give this a try.
I hacked on this code a bit, mostly just to keep my Python skills up. I tried to eliminate the call to sleep() by using join, and to allow a timeout of None to indicate that it should wait until all urls are fetched.
I also made the threads daemons so that the program wouldn't wait on them to exit, since if no other threads were waiting on them then we don't care about what they're fetching anymore. Oh, and I had to butcher alive_count() to make it compatible with my ancient python2.4
Performance turned out to be comparable, with or without an explicit call to sleep()
http://gist.github.com/33001
Good stuff Peter. Your current version of
alive_count()is pretty much identical to my original version before I decided to compact it for readability. Of course you could have written this cluster-fuck instead:Which actually works, I think, although its narrowly treading hard to understand disasters because of the one and zero being booleans...
I think your
multi_joinapproach is definitely preferable to my sleeping, although underneath it is likely doing something very similar. Also, I've never taken the time to understand the daemon stuff, it looks simpler than I imagined.Reply to this entry