In my experience tech projects generally start with a mistake. In our case it started with the belief that the best way to find churn was through a traditional full text search index like Xapian or Lucene. These indexes work by searching for the rarest term in a document and looking for other documents with those rare terms. Here’s one of the algorithms they use: http://en.wikipedia.org/wiki/Okapi_BM25. Our initial thought was that, by identifying press releases and articles with similarly unusual terms, we could unearth churn.
Trouble is, this ignores the order and grammar of the search text, which are crucial indicators of churn. When looking for an article that is mostly cut and pasted from a press release the prepositions, the adverbs, even the spaces in the sentence, are as important as the distinctive terms used in the text.
So we gave up on Xapian and started experimenting with methodologies that focus on exact matching of sequences.
The most effective methodology we developed – which we’ve called super fast match (SFM) - works by ‘hashing’ strings of text. In other words, it compresses strings of text into numbers. Once compressed it is much easier, and quicker, to compare identical strings of text.
We decided to try hashing every string of text strings of fifteen characters, and then storing document ids against each hash. When we found hashes with multiple document ids we would note an association. Bodies of text (i.e. articles and press releases) with lots of associations with other bodies of text id would then have the highest chance of being churn.
The immediate benefits of this methodology were that every new document could be compared with every existing document, using a consistently accurate scoring method. Also, because hashes are compact and of fixed length, unsigned integers in this case, they could potentially be stored in memory, or if not, in a very fast key value store such as Redis or Kyoto Cabinet. This would enable extremely quick similarity checking of pasted documents.
To make SFM work with press releases and news articles we came up with two replacement algorithms. One that narrows down the full set of documents into likely churn candidates and another that takes this set and ranks them by the amount of copying and pasting while also joining the 15 character windows into the longest possible sections.
Though this worked as a methodology it still required stacks of processing power to work quickly. So we then went through about seven implementations of the first algorithm to find the fastest, most cost effective way of getting adequate performance.