Searching For a Needle

Scrapping the web for information can be — you guessed it — as difficult as finding a needle in a Haystack.

So, with that in mind, a group of hackers and journalists have created a program with a similar sounding name to solve the problem.

Haystax is an open source tool that works as a bookmarklet that users can drag to the top of their browsers and then employ after a click.

Tyler Dukes, the managing editor of Reporters’ Lab, took some time out to chat.

What is Haystax ?

We want Haystax to be flexible enough to tackle a bunch of different databases eventually, so it’s versatile enough to work in a variety of use cases. It’s not very flexible right now, since we’re still building it out.

Where was the product developed?

So we developed this at Newshack day in San Francisco last weekend.

It was just a hackathon put on by the Chronicle and the Center for investigative Reporting.

And basically we just got together for a weekend and pitched the idea and built it out in the span of about 30 hours or so.

It got a pretty big response from the people who were there. And, yeah, we were able to create a working prototype in the span of a weekend. 

How did the idea come up?
I came to the event with the pitch, but it’s something that we have been thinking about at the lab a lot. 
We have a wishlist of things that we wish we could get to at the lab.
This was something that we were hoping for, a really good point and click web scrapper. That was at the top of the list.
And really that came about because we have done a lot of work evaluating what’s out there in terms of web scrapping and came to the conclusion that there is nothing out there that is great and cheap for reporters who are just looking to do something off the shelf.

Who was involved?

We had a team of about eight people. A bunch of different journalists and one developer, mostly from the San Francisco area.

Randall Leeds, who works at Hypothes.is. He deserves a lot of credit for putting up with a bunch of journalists for a weekend and coming up with a really brilliant, elegant solution.

How does it work?

So it’s actually built-in javascript entirely on top of this thing called Hackasaurus. It’s an open source project that Mozilla put out. There is a feature within that project called x-ray goggles.

And on any browser you just turn it on, and they built this tool for exploring the structure of web pages, so our developer, when he was thinking through this problem, he thought that this might be a really cool way to tackle it.

So, with x-ray goggles you click on this thing and you can hover over elements of a webpage. It shows: This is a table; This is a Div; and it helps people learn how websites are coded.

Our developer’s thinking was with a couple of extra lines of code it would allow a user to define what was there and then set it up to scrape based on what the structure of the pages was. 

When do you think this will be ready for prime time?

That’s a really good question.

We put this thing together in about 30 hours and it has been about a week, and we haven’t done any significant work on it.

I would like to say it’s up and ready now, and people can use it. They can look at the sourcecode at GitHub.

But it doesn’t seem to work on lots of different types of things, so it really isn’t prime time ready.

If we can get a lot of buy in from the open source community, however, and other folks that want to use a tool like this, than I think it’s a simple enough problem that we could have a real tool in a few weeks, or a few months.

It really depends on how quickly we can mobilize people and get them interested in this sort of problem.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s