This week Ted Han of DocumentCloud and I presented a session on web scraping at Investigative Reporters and Editors annual conference in Boston.
The technique is money for taking unstructured data off of the web and placing it into tables. And, in my opinion, it’s key to better stories about hard-to-get public data.
Whether it’s worth it to learn to code custom scrapers* is definitely still a debate. It certainly is more justifiable than it was even 5 years ago, given the much-lower barrier to entry and the much higher amount of digital information. I agree with Aron Pilhofer that a day of Excel training for every journalist would bring massive benefits to the industry as a whole. But that’s because journalism has been in overdue need of spreadsheet-skills, not because spreadsheets are in themselves, alone, useful sources of information.
(* of course, learning enough programming to make custom scrapers will as a bonus let you do every other powerful data-related task made possible through the mechanism and abstract principles of code)
Check out this series of slides @KnowTheory and I presented at #IRE12. It’s full of examples.