Clustering​

Clustering is an important step in our process, designed to reduce the noise and make it easier to analyze the results by grouping similar articles and showing only one high-quality article from each group.

How does it work?

At the end of our pipeline, we retrieve articles discussing controversies for each company. The same controversy event could be covered in different articles by different sources, or in one article posted by multiple sources, resulting in duplicates. Our clustering models help us link multiple articles across languages corresponding to the same event, to avoid generating too many different outputs.

The idea is to:

  • Measure daily the similarity between articles mentioning the given company
  • Group articles with a similarity indicator greater than a given threshold in one cluster
  • Display only one article from each cluster. The article chosen is the one with the best source reliability score and quality score, ensuring it is the most relevant.

To do this, we have two clusterings.

First Clustering: Jaccard Similarity

In the first clustering, we look for articles that are identical with a slight difference in the title. This case is handled by the Jaccard similarity between the titles.

Jaccard Similarity

Jaccard similarity is a measure of similarity between two sets of data, defined as the size of the intersection of the sets divided by the size of their union. In other words, it quantifies the amount of overlap between the two sets. The Jaccard similarity coefficient ranges between 0 and 1, where 0 indicates no overlap between the sets, and 1 indicates complete overlap (i.e., the sets are identical).

Second Clustering: Semantic Clustering

After the first layer of clustering, which removes articles that have a lot of keywords in common, comes a second layer of clustering: semantic clustering. Semantic clustering relies on machine learning and can detect duplicates even if the articles employ different words or are in different languages.

Semantic clustering relies on two models, an embedding model and a clustering model. The embedding model calculates for each article a vector that is supposed to capture the meaning of the article. Two articles that treat the same topic will be close in the embedding space. Next, we use a clustering model to group the articles into clusters.