Do you know what Clustering is?

June 11th, 2018
One of the big problems with automated news feeds is that they keep sending you the same stories over and over again. The articles might technically be different – appearing in different papers or written by different people – but they aren’t saying anything really new. With the rise in automated journalism, content marketing, and with hardly any “old school” journalists writing and researching from scratch left, most “news” articles are just re-hashing the same story.To make matters worse, once you have clicked on something the news feed provider’s algorithm latches on to that as one of your “likes” or “preferences” and sends you more and more of the same.

At truba we are building a system that will let you take control of the “sameyness” of the articles you receive. In order to do this we are doing a lot of research into clustering and semantic similarity techniques.

Clustering has many meanings, so it is easy to get confused. For example, clustering can mean grouping lots of computers together on a network or it can mean allowing lots of different servers to connect to the same database. It can also refer to the way storage is managed within a single computer. Clustering in statistics means finding the data points that have similar values and grouping them together. This statistical clustering is the basis of lots of different techniques in data mining, machine learning, and natural language processing.

We are interested in using clustering as a tool for natural language processing, because we want to design better, more precise ways of identifying which news articles are similar to each other. There are lots of existing clustering methods and techniques, some of which are more effective than others. For example, some simple statistical clustering techniques rely on counting the number of words two articles have in common. This is great for detecting plagiarism, where one article has copied chunks of another one. It is less good at identifying two articles that are about a similar subject, but written using different words, for example where a writer has written an “explainer” article that simplifies a complex report or scientific publication.

Some clustering techniques focus on words, some on sentences, and some on long texts. Some rely on pre-defined knowledge – for example a thesaurus that sets out which words have similar meanings to other words.

We are currently researching which combination of methods and techniques will work best for organizing news stories so that you never have to worry about wasting time reading something you already know.

Fran Alexander