From Articles to Events to Cases: Event & Case Detection and Creation

How It Works (In a Nutshell)

Clean up: We filter out noise and irrelevant articles.

Deduplicate: Very similar articles are grouped into daily clusters (see Step 1B), and one representative article is selected per cluster.

Summarize: An AI model reads each article and creates short summaries of the controversy, capturing both the full context and new developments in each article.

Group: Clustering algorithms group articles into Events and Events into Cases

Simplify: Clean titles, summaries, and metadata are generated for Events and Cases to ensure usability and explainability.

This ensures that users see only the most relevant documents and have a clear timeline for how each ESG topic unfolds.

Step-by-Step Process

1. Pre-Processing: Removing the Noise

Filtering: Removing irrelevant documents

Before any clustering occurs, SESAMm filters incoming documents to retain only ESG-relevant content. This includes:

Removing predefined irrelevant sources
Deduplicating articles with similar titles
Filtering content using SESAMm's ESG filtering model (light model focused on recall)
Filtering content using SESAMm's controversy detection model (finetuned language model)

Clustering: Grouping similar documents

To reduce redundancy, articles are clustered daily before further processing. This includes:

Calculating article embeddings
Applying agglomerative clustering based on those embeddings to group similar and redundant articles together
Retaining only the most relevant article from each cluster for downstream processing

2. Document Insights

For each retained article, SESAMm generates two AI-based summaries:

Main Story Summary

A comprehensive overview of the ESG event or controversy, consolidating key contextual information to give users a full picture of the ongoing situation.

Novelty Summary

A focused summary highlighting only new insights or developments introduced by the document in relation to the broader ESG event. It highlights what is new or updated information compared to what is already known.

This dual-summary approach enables users to understand both the broader context and incremental developments of an ESG situation.

3. Creating Embeddings

The summaries are transformed into semantic embeddings:

Case embeddings are generated from the Main Story summary.
Event embeddings are generated from the combination of Main Story + Novelty summaries.

These embeddings allow the system to detect semantic similarity across documents, in multiple languages.

4. Clustering into Cases and Events

Case Creation

A clustering algorithm is applied to Case embeddings to group documents referring to the same overarching ESG controversy for a given entity. No temporal constraint is applied at this stage; documents published weeks or months apart may belong to the same Case if semantic similarity remains strong.

Event Creation

Within each Case, a second clustering step is performed using Event embeddings. The purpose of this step is to separate distinct incidents and developments within the broader Case. Each Event corresponds to one specific incident or development.

To support this distinction, the clustering incorporates a temporal component in addition to topical similarity. More specifically, documents are clustered into Events using novelty embeddings, and the cosine similarity between document clusters is reduced as the number of days between them increases. As a result, documents published closer in time are more likely to be grouped into the same Event, while documents published farther apart are more likely to be split into separate Events.

However, there is no fixed temporal cutoff defining when a new Event must be created. Two documents published several weeks apart may still belong to the same Event if their novelty is very similar and the topic remains stable. Conversely, if their content differs meaningfully, even a relatively short time gap may contribute to splitting them into separate Events.

For example, a strike in 2022 and a similar strike in 2024 would be captured as separate Events, as they represent distinct developments rather than a continuation of the same incident.

Structural constraints:

Each Event is linked to one and only one Case
Each Event and Case is linked to a single entity

5. Titles and Summaries

Event Titles and Summaries

Event titles and summaries are generated using LLMs, based on the top 10 articles ranked by SESAMm’s Article Reliability Score (ARS). The ARS ranks articles primarily based on the reliability of their sources, assessed through indicators such as the number of recognized authors, the source's presence in Wikipedia citations, and structural quality signals, such as writing style and grammar. Only articles validated as ESG-relevant are considered.

Titles and summaries prioritize information such as:

Number of victims
Legal actions and lawsuits
Financial penalties or quantified impacts
Environmental or community impacts
Actions taken or commitments made by the company
Major developments during the Event period

To ensure accuracy and avoid repetition, the model’s input dynamically adapts based on the event’s chronological position:

First event in a case: the model uses the main story and novelty insights from the source documents to establish the baseline context.
Subsequent events: the model uses only the novelty insights to focus strictly on new developments and prevent redundant information.

Summaries are recomputed whenever the top 10 articles change, ensuring alignment with the most relevant information.

Case Titles and Summaries

Case summaries are generated by aggregating the information from their underlying events. To maintain readability regardless of the controversy's length, the inputs scale according to the case size:

Cases with 15 events or fewer: the model uses the full body text of all underlying event summaries.

Cases with more than 15 events: the model uses only the event titles to summarize the broader trajectory.

Additional Metadata

Each Event and Case is then enriched with Intensity score, ESG risk category, and ESG sub-risk classification. More information is provided below.