The new churnalism.com – behind the scenes

 

Churnalism.com is a tool that helps you detect churnalism. The original churnalism site was a Django-based website, backed by lots of Python and some C++.

As part of a big overhaul, things have been split up into components, mostly written in Go.

In addition to the website, there is now a browser extension, which sits passively in a users web browser and pops up to warn them if any churnalism is detected.

 

The overall system now looks like this:

Screen shot 2014-02-12 at 09.39.57

Churnalism web site

The web site itself is just a thin front-end layer. It delegates all the hard work. Searches are handled by calling the SFM search API, then displaying the results.

If the user enters an article URL, the site uses the Readability API service to fetch the article and extract its text. The result is passed on to the SFM API.

The website’s API, as used by the browser extension, is just a simple pass-through to the SFM API.

Web site source code: https://github.com/bcampbell/Churnalism-go.

 

SuperfastMatch (SFM)

SuperfastMatch is the core of the system. It lets you compare a document against a large set of others very quickly.

It is based upon the Rabin-Karp algorithm. SFM uses a MongoDB database to hold it’s collection of documents. From this, SFM builds an in-memory index for performing searches.

The database is also used for:

- holding document associations (stored search results)

- a work queue for adding and removing documents and associations

- tracking feeds of incoming documents

 

Feeds

SFM can be configured to monitor a bunch of Server-Sent Event (SSE) feeds to stream documents into the system as they become available.

An SSE server is an HTTP server. Clients connect to it via HTTP and keep the connection open. Whenever a new event occurs (ie a new document has come in), more data is sent. Each event has an id associated with it, and clients can connect with a `Last-Event-Id` HTTP header containing the id of the last event it received. The server can then send older events – if it still has them – to help the client catch up on any it may have missed.

SSE feeds were originally intended to be used by web browsers, but in this system we use them behind the scenes, purely on the server side.

The SFM instance used to power churnalism.com is set up to take press releases from UKPR and news articles from journalisted.com.

SFM source code: https://github.com/donovanhide/superfastmatch.

 

UKPR

To get press releases, we screen-scrape a whole bunch of websites. UKPR is the component which handles this.

Written in Go, it performs the scraping and provides SSE feeds to any interested clients – to SFM, in this case.

The current scrapers all cover sources of interest to the UK (see FAQs for current list), but it is reasonably simple to add to those scrapers or swap out the scrapers for a different set.

UKPR has a database (Sqlite) in which it stores the press releases as they are scraped. This means any clients which go down for a while can catch up on any press releases they may have missed.

UKPR source code: https://github.com/bcampbell/ukpr.

 

Browser extension

The churnalism.com browser extension monitors articles for churnalism as the user browses the web. If any churn is detected, a warning is displayed and the user has the option of highlighting the copied text.

There are versions of the extension for Firefox and Chrome. A lot of the code is shared between the two. It was originally based on the unsourced.org extension, which in turn used some code from the Sunlight Foundation‘s US churnalism extension - thank you Sunlight.

It uses a version of the readability algorithm (from Sunlight’s code) to extract the article text, then performs a search using the churnalism.com search api (which is really just a passthrough to the SFM search).

Browser extension source code: https://github.com/bcampbell/churnalism-extensions.