CitationHunt

Published 03 May 2015

As with most people who were students at some point over the past decade, I use Wikipedia quite heavily: it’s simply the best source for getting an overview of a new topic and finding a trail of references for in-depth study. It’s a project I truly believe in and to which I’ve always wanted to give something back.

In addition to that, I’ve become increasingly interested lately in the process of welcoming and mentoring newcomers in open source communities. I began to think that, while it would be nice to turn myself into a Wikipedia editor, it would be great if I could also produce some sort of tool for simplifying the daunting task of making the first few edits.

So I built CitationHunt.

The idea

CitationHunt is a minimalistic tool for browsing unsourced statements – those marked with the iconic “[citation needed]” superscript – in the English Wikipedia.

My thought process in coming up with it went something like this: as a wannabe editor, I asked myself what was getting in the way of my contributing. Three main factors came to mind:

  1. The Wikipedia manual of style is hefty and intimidating;
  2. I didn’t want to accidentally get involved in a lame edit war;
  3. I didn’t quite know how to find articles and sections in need of editing.

I tried to think of a class of edits that would be at the same time easy to perform and uncontroversial, but still interesting, and adding citations seemed like a good fit. The remaining problem was, of course, to facilitate the search for article snippets lacking citations, and that’s the problem I sought to address.

To keept it from being overwhelming, CitationHunt had to be simple, even to the detriment of the feature set, and minimalistic in appearance. The user experience was inspired by the swipe right/swipe left mechanics of the Tinder dating app. The most technically challenging part of this project is the correct and reliable parsing of Wikipedia dumps, but thankfully none of its complexity made its way into the web interface.

I hosted a first working version on Heroku and used it to fill in a few references. The first I ever did was on the article about Paris Hilton’s sex tape, something I’m equally proud and embarrassed about. This early dogfooding was fundamental in assuring me there was some value in what I was doing, if only for my own use.

I then introduced CitationHunt to the community on the Village Pump page, where it was quite well received. Other Wikipedians were quick to inform me that the Wikimedia Foundation is kind enough to provide hosting for tools like mine, and I immediately moved CitationHunt to their infrastructure.

Lessons learned

This project has taught me a few interesting lessons:

Adding citations can be super easy, but also super hard: Many unsourced statements reflect the source nearly verbatim. Anyone with basic search engine skills can easily fill them up. On the other hand, if you can find the decades-old pamphlet in which hardcore singer H.R. is described as “James Brown gone berserk” (one of the snippets in CitationHunt), I’ll be immensely impressed.

Serendipity happens: Just as I was looking for an autocomplete library that didn’t depend on JQuery, Ilya Grigorik tweeted about Awesomplete, which is pretty great. Speaking of him, CitationHunt incorporates quite a few tricks from his book High Performance Browser Networking.

Parsing Wikipedia dumps is hard: It’s a lot of data, and few parsers properly support templates. Several snippets appear broken because of that. I’m sorry.

Speed is a feature: Not really a new lesson, but the only way I could support the streamlined experience I wanted for CitationHunt was for it to be very fast. This motivates the use of aggressive caching, prefetching and the move away from sqlite to MySQL on the server.

Conclusion

CitationHunt has been pretty useful to me, and I’ve made several edits through it since its launch. If you made it this far in the post, do consider trying it out and letting me know what you think!