Data Lake Sources Overview

SESAMm has the largest data lake available in the industry, featuring over 25 billion documents from over 4 million sources in 100 languages. Every day, 10 million new documents are processed by SESAMm and added to the data lake, ensuring it stays up-to-date with the latest information.

The documents come from across the web, including premium news content, NGO websites, retail forums, company websites, and more. Additionally, we have deep access to local information in Europe, Asia, and emerging countries, providing important insight into controversies that might arise there, including on local subsidiaries.

Sources Overview

Our data lake features approximately 30,000 premium sources and over 4 million public sources (news, blogs / NGOs / reports, and social / discussion forums).

As for the number of articles by source type, the graph below shows the breakdown.


Percentage of Articles in the Data Lake by Source Type

Percentage of Articles in the Data Lake by Source Type

The articles in the data lake are sourced globally. The graph below highlights the geographical distribution.


Percentage of Articles in the Data Lake by Country

The geographical information is determined by the website’s domain and traffic source information. If it cannot be determined, it’s listed as “Unknown”.

The data lake also covers over 100 languages, the graph below shows the distribution.


Percentage of Articles in the Data Lake by Language

Example Sources

To give you an idea of which websites our data is coming from, we’ve put together a short sample below. If you’re looking for a specific source that isn’t listed, contact your Account Manager.

Type Source
Premium News Bloomberg
Premium News CNN
Premium News Forbes
Premium News The Boston Globe
Premium News Telegraph
Premium News The Economist
Premium News The Atlantic
Premium News The Economic Times
Premium News Caixin Global
Premium News Nikkei Asian Review
Premium News The New Yorker
Premium News The New York Times Blogs
News CBS News
News CNBC
News USA Today
News Financial Times
News WickedLocal
News The Guardian
News Huffington Post
News Reuters
News The Verge
News Politico
Blogs Yahoo
Blogs Buzzfeed
Blogs Medium
Blogs Wordpress
Blogs Youtube
Blogs Pinterest
Blogs Mumsnet
Blogs Google
Blogs JustAnswer
Blogs SomethingAwful

What’s Covered by Social / Discussion Forums

Data from social media and discussion forums represents more than 70% of the data lake. Our coverage of social sites includes public and specialized forums, such as Apple Communities, Stackoverflow, NGO communities, corporate forums, and more. These forums are important sources because controversies and whistleblower alerts are often shared there first.

Some mainstream social media, such as Facebook and Instagram, cannot be legally accessed for compliance reasons. These sites are generally excluded from the data lake, except for some public pages with limited amounts of information.

AI-Generated Information

AI-generated information is flourishing across the web. To differentiate between news, documents, and information generated by AI vs humans, SESAMm employs a quality score for every article. The score assesses the amount of actual information, readability, source authority, and more. This includes scraping Wikipedia pages of sources to identify journalist numbers, agency ties, and the number of references.