#HackJersey: Reputational Risk Pitch

Politicians. Businesses. Banks. All are worried about their reputation.

It affects elections. It can determine stock prices. And it can create jobs — think about all those social media gigs created to hedge against disgruntled Twitter followers. 

That’s why so many are placing resources into sentiment analysis technology that promises to hedge against this type of risk. 

Technology giants such as IBM and SAS offer clients software packages that comb the Web for clues to consumer sentiment. Those programs parse language and attempt to understand the meaning of words, pairs of words or phrases, in context.

Luckily, rather than having to create a new methodology for this type of analysis, several Stanford students have published an API that takes care of the hard work for us. The software measures the positive and negative number of tweets containing specific keywords (for example, ‘BofA’, ‘SuperStorm’, or ‘Christie’).

The Guardian used the tool to analyze the sentiment around Rupert and James Murdoch in the wake of the British tabloid cell phone hacking scandal.

The Goal: 

Develop a site that has the ability to either track large publicly traded companies or elections using the aforementioned API.

At the same time, the visualization will also track stock price and company news, or polls to see if there is ever any correlation.

The Details: 

I’m open to how this news app might be designed, but I imagine it will have three main elements: some kind of score based on the number of positive and negative tweets; a graphic for each keyword, #hashtag or username; and a metric to measure a topic’s sentiment against, that could be polling results or stock prices or a map showing where the most negative and positive tweets are coming from.

Also, since the tool only counts current tweets, I’d recommend running the API at an interval (whether that’s several times an hour, a day or a week) to give viewers an idea of how sentiment around a topic has changed over time.

There are several different targets for this visualization. I’m open to concentrating on municipal government, polling results or publicly traded companies.

About HackJersey: 

On the weekend of Jan. 25, Hack Jersey will host the first hackathon in the state to invite journalists and coders to work together, competing to build innovative projects that can transform the way we use data and experience news in the Garden State.

The idea of a Jersey-based hackathon began with a conversation between Debbie Galant, director of the NJ News Commons at Montclair State University, and Tom Meagher, data editor at Digital First Media, at the Online News Association conference this September.  Since then, dozens of volunteers from news organizations, nonprofits and tech startups across the state have come onto our planning team. And Knight-Mozilla’s OpenNews initiative joined the NJ News Commons as a leading sponsor of our hack weekend.

Participants will meet at our launch party on Friday, Jan. 25 at Fitzgerald’s 1928 in Glen Ridge (RSVP here!). The next morning, through our primary sponsor, the NJ News Commons, our hackathon will begin at University Hall at Montclair State. Participants will break into teams and have 24 hours to create their open source projects. On Sunday afternoon, Jan. 27, a panel of media and tech judges will choose the winners and award prizes to the best projects.

To keep up with the latest news on our hack weekend, follow us on Twitter @hackjersey. You can find more information about preparing for the hackathon on our blog. Want to start brainstorming ideas for your projects? Check out our list of public data where you might want to start and be sure to read our rules for the competition.

Feel free to grab a hold of me over Twitter, Gmail or Linkedin. Or, better yet, just leave a comment below. 


Searching For a Needle

Scrapping the web for information can be — you guessed it — as difficult as finding a needle in a Haystack.

So, with that in mind, a group of hackers and journalists have created a program with a similar sounding name to solve the problem.

Haystax is an open source tool that works as a bookmarklet that users can drag to the top of their browsers and then employ after a click.

Tyler Dukes, the managing editor of Reporters’ Lab, took some time out to chat.

What is Haystax ?

We want Haystax to be flexible enough to tackle a bunch of different databases eventually, so it’s versatile enough to work in a variety of use cases. It’s not very flexible right now, since we’re still building it out.

Where was the product developed?

So we developed this at Newshack day in San Francisco last weekend.

It was just a hackathon put on by the Chronicle and the Center for investigative Reporting.

And basically we just got together for a weekend and pitched the idea and built it out in the span of about 30 hours or so.

It got a pretty big response from the people who were there. And, yeah, we were able to create a working prototype in the span of a weekend. 

How did the idea come up?
I came to the event with the pitch, but it’s something that we have been thinking about at the lab a lot. 
We have a wishlist of things that we wish we could get to at the lab.
This was something that we were hoping for, a really good point and click web scrapper. That was at the top of the list.
And really that came about because we have done a lot of work evaluating what’s out there in terms of web scrapping and came to the conclusion that there is nothing out there that is great and cheap for reporters who are just looking to do something off the shelf.

Who was involved?

We had a team of about eight people. A bunch of different journalists and one developer, mostly from the San Francisco area.

Randall Leeds, who works at Hypothes.is. He deserves a lot of credit for putting up with a bunch of journalists for a weekend and coming up with a really brilliant, elegant solution.

How does it work?

So it’s actually built-in javascript entirely on top of this thing called Hackasaurus. It’s an open source project that Mozilla put out. There is a feature within that project called x-ray goggles.

And on any browser you just turn it on, and they built this tool for exploring the structure of web pages, so our developer, when he was thinking through this problem, he thought that this might be a really cool way to tackle it.

So, with x-ray goggles you click on this thing and you can hover over elements of a webpage. It shows: This is a table; This is a Div; and it helps people learn how websites are coded.

Our developer’s thinking was with a couple of extra lines of code it would allow a user to define what was there and then set it up to scrape based on what the structure of the pages was. 

When do you think this will be ready for prime time?

That’s a really good question.

We put this thing together in about 30 hours and it has been about a week, and we haven’t done any significant work on it.

I would like to say it’s up and ready now, and people can use it. They can look at the sourcecode at GitHub.

But it doesn’t seem to work on lots of different types of things, so it really isn’t prime time ready.

If we can get a lot of buy in from the open source community, however, and other folks that want to use a tool like this, than I think it’s a simple enough problem that we could have a real tool in a few weeks, or a few months.

It really depends on how quickly we can mobilize people and get them interested in this sort of problem.

Scrape The World


This week Ted Han of DocumentCloud and I presented a session on web scraping at Investigative Reporters and Editors annual conference in Boston.

The technique is money for taking unstructured data off of the web and placing it into tables. And, in my opinion, it’s key to better stories about hard-to-get public data.

From Dan Nguyen of The Bastard’s Book of Ruby fame: 

Whether it’s worth it to learn to code custom scrapers* is definitely still a debate. It certainly is more justifiable than it was even 5 years ago, given the much-lower barrier to entry and the much higher amount of digital information. I agree with Aron Pilhofer that a day of Excel training for every journalist would bring massive benefits to the industry as a whole. But that’s because journalism has been in overdue need of spreadsheet-skills, not because spreadsheets are in themselves, alone, useful sources of information.

(* of course, learning enough programming to make custom scrapers will as a bonus let you do every other powerful data-related task made possible through the mechanism and abstract principles of code)

Check out this series of slides @KnowTheory and I presented at #IRE12. It’s full of examples. 

*Also look for tip-sheets and other resources from the conference, here and here.

Docs on Docs on Docs

Developers and programmers are creating tools that use Google Docs, not databases such as MySQL or Postgresql, to power visualization.

Inspired by the open source culture of the internet, code writing reporters and hackers for hire are using easy to understand javascript to fuel simple applications with Excel-like spreadsheets.

These journalists are fueled by the opportunity to bring visualizations to newsrooms that are devoid of news app developers.

The latest example is Tabletop — created by Balance Media, which has already produced news apps for WNYC, The New York Times and Propublica, among others.

Tabletop was originally built to work with ProPublica’s TimelineSetter, a JS+Ruby library that creates timelines. You need some specifically-formatted JSON which is created by a Ruby script from a CSV, which means your workflow is usually spreadsheet -> CSV -> Ruby -> JSON -> JS.

With Tabletop, though, you get to hook right into a Google Spreadsheet for all of your info! You just need to massage your data a little bit, thanks to Google’s API messing with column names and you needing a timestamp.”

I am by no means a programmer. But show me an example that I can open up in a text editor and upload to a website using FileZilla, even I can figure that out.

Shout Out: A big thanks to Andy Boyle of The Boston Globe for pointing me in the direction of this tool via Twitter. I’ll be sure to use it.