The technology driving churnalism.com


Donovan Hide

24.02.11

Posted by Donovan Hide

Tagged as: , , ,

Churnalism API

In my experience tech projects generally start with a mistake. In our case it started with the belief that the best way to find churn was through a traditional full text search index like Xapian or Lucene. These indexes work by searching for the rarest term in a document and looking for other documents with those rare terms. Here’s one of the algorithms they use: http://en.wikipedia.org/wiki/Okapi_BM25. Our initial thought was that, by identifying press releases and articles with similarly unusual terms, we could unearth churn.

Trouble is, this ignores the order and grammar of the search text, which are crucial indicators of churn. When looking for an article that is mostly cut and pasted from a press release the prepositions, the adverbs, even the spaces in the sentence, are as important as the distinctive terms used in the text.

So we gave up on Xapian and started experimenting with methodologies that focus on exact matching of sequences.

The most effective methodology we developed – which we’ve called super fast match (SFM) -  works by ‘hashing’ strings of text. In other words, it compresses strings of text into numbers. Once compressed it is much easier, and quicker, to compare identical strings of text.

We decided to try hashing every string of text strings of fifteen characters, and then storing document ids against each hash. When we found hashes with multiple document ids we would note an association. Bodies of text (i.e. articles and press releases) with lots of associations with other bodies of text id would then have the highest chance of being churn.

The immediate benefits of this methodology were that every new document could be compared with every existing document, using a consistently accurate scoring method. Also, because hashes are compact and of fixed length, unsigned integers in this case, they could potentially be stored in memory, or if not, in a very fast key value store such as Redis or Kyoto Cabinet. This would enable extremely quick similarity checking of pasted documents.

To make SFM work with press releases and news articles we came up with two replacement algorithms. One that narrows down the full set of documents into likely churn candidates and another that takes this set and ranks them by the amount of copying and pasting while also joining the 15 character windows into the longest possible sections.

Though this worked as a methodology it still required stacks of processing power to work quickly. So we then went through about seven implementations of the first algorithm to find the fastest, most cost effective way of getting adequate performance.

Latest news and blogs

MST Response to PCC Defence of IPSO

27.03.14 Media Standards Trust

  The Media Standards Trust has submitted a response to the Culture, Media and Sport...

More

Journalisted Weekly

Pistorius bail, Moody's downgrade and Child poverty

26.02.13 Pistorius bail, Moody's downgrade and Child poverty

Pistorius bail covered most

More

Orwell Prize newsletter

Orwell Prize 2014 Launch Debate: 21st October

18.09.13 Orwell Prize 2014 Launch Debate: 21st October

The Orwell Prize 2014 Launch Debate will be at the Frontline Club on October 21st. Book Here

More

PCC Watch

PCC Watch on hold

4.05.11 PCC Watch

PCC Watch is on hold until after the publication of the Leveson report

More

Events