At truba we are building a system that will let you take control of the “sameyness” of the articles you receive. In order to do this we are doing a lot of research into clustering and semantic similarity techniques.
Clustering has many meanings, so it is easy to get confused. For example, clustering can mean grouping lots of computers together on a network or it can mean allowing lots of different servers to connect to the same database. It can also refer to the way storage is managed within a single computer. Clustering in statistics means finding the data points that have similar values and grouping them together. This statistical clustering is the basis of lots of different techniques in data mining, machine learning, and natural language processing.
We are interested in using clustering as a tool for natural language processing, because we want to design better, more precise ways of identifying which news articles are similar to each other. There are lots of existing clustering methods and techniques, some of which are more effective than others. For example, some simple statistical clustering techniques rely on counting the number of words two articles have in common. This is great for detecting plagiarism, where one article has copied chunks of another one. It is less good at identifying two articles that are about a similar subject, but written using different words, for example where a writer has written an “explainer” article that simplifies a complex report or scientific publication.
Some clustering techniques focus on words, some on sentences, and some on long texts. Some rely on pre-defined knowledge – for example a thesaurus that sets out which words have similar meanings to other words.
We are currently researching which combination of methods and techniques will work best for organizing news stories so that you never have to worry about wasting time reading something you already know.