Scrape The World


This week Ted Han of DocumentCloud and I presented a session on web scraping at Investigative Reporters and Editors annual conference in Boston.

The technique is money for taking unstructured data off of the web and placing it into tables. And, in my opinion, it’s key to better stories about hard-to-get public data.

From Dan Nguyen of The Bastard’s Book of Ruby fame: 

Whether it’s worth it to learn to code custom scrapers* is definitely still a debate. It certainly is more justifiable than it was even 5 years ago, given the much-lower barrier to entry and the much higher amount of digital information. I agree with Aron Pilhofer that a day of Excel training for every journalist would bring massive benefits to the industry as a whole. But that’s because journalism has been in overdue need of spreadsheet-skills, not because spreadsheets are in themselves, alone, useful sources of information.

(* of course, learning enough programming to make custom scrapers will as a bonus let you do every other powerful data-related task made possible through the mechanism and abstract principles of code)

Check out this series of slides @KnowTheory and I presented at #IRE12. It’s full of examples. 

*Also look for tip-sheets and other resources from the conference, here and here.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s