About Archive Tag Cloud Translations RSS

You are writing a comment about An Introduction to Compassionate Screen Scraping, here is a quick summary:

One of the most common quickie projects on the web is to screenscrape a website and play around with its data. These projects are a lot of fun, and can allow for inventive mashups, but often the screepscraping scripts cause unnecessary load on the site's servers due to inconsiderate technique. This is an introduction to the art of compassionate screenscraping.


You are responding to this comment written by Will Larson on August 12th 2008, 06:27.

Well, BeautifulSoup is adept at scraping malformed and invalid HTML, so that isn't a big problem when screen scraping with Python. It is true that this stack cannot deal well with sites that use JavaScript to display their content, but in my experience there are relatively few websites which use JavaScript to such an extent that they cannot be parsed with this combination (essentially they'd be making two trips to their servers for each page they loaded, first to load the javascript, and second to load the content via the javascript, so their load time would be twice would it might otherwise be, and thats just one drop in the bucket of issues that design causes).

Really, I think its pretty easy and effective to scrape in Python. That said, I haven't tried those tools you just suggested, and I'll have to take a look at them in the near future. Thanks for the tip.


Please be aware that comment forms go stale after one hour.





Comments may make use of LifeFlow MarkDown. Raw html will be escaped.


Quick Introduction to LifeFlow MarkDown Syntax

A highlighted code block:

@@ ruby
def a (b, c):
  b * c
end
@@

Other common languages work as well: scheme, python, java, html, etc.

Other markdown syntax:

 ### This is an h3 title
#### This is an h4 title
**this is bold**
*this is italics*

1. This is an
2. ordered list

* And an unordered
* list too

[this is a link](http://www.lethain.com/ "Lethain")