Data Lake Sources Overview

SESAMm has the largest data lake available in the industry, featuring over 30 billion documents from over 4 million sources in 100 languages. Every day, 10 million new documents are processed by SESAMm and added to the data lake, ensuring it stays up-to-date with the latest information.

The documents come from across the web, including premium news content, NGO websites, retail forums, company websites, and more. Additionally, we have deep access to local information in Europe, Asia, and emerging countries, providing important insight into controversies that might arise there, including on local subsidiaries.

Sources Overview

Our data lake features approximately 30,000 premium sources and over 4 million public sources (news, blogs / NGOs / reports, and social / discussion forums).

As for the number of articles by source type, the graph below shows the breakdown.

Percentage of Articles in the Data Lake by Source Type

The articles in the data lake are sourced globally. The graph below highlights the geographical distribution.

Percentage of Articles in the Data Lake by Country

The geographical information is determined by the website’s domain and traffic source information. If it cannot be determined, it’s listed as “Unknown”.

The data lake also covers over 100 languages, the graph below shows the distribution.

Percentage of Articles in the Data Lake by Language

Example Sources

To give you an idea of which websites our data is coming from, we’ve put together a short sample below. If you’re looking for a specific source that isn’t listed, contact your Account Manager.

Type	Source
Premium News	Bloomberg
Premium News	CNN
Premium News	Forbes
Premium News	The Boston Globe
Premium News	Telegraph
Premium News	The Economist
Premium News	The Atlantic
Premium News	The Economic Times
Premium News	Caixin Global
Premium News	Nikkei Asian Review
Premium News	The New Yorker
Premium News	The New York Times Blogs
News	CBS News
News	CNBC
News	USA Today
News	Financial Times
News	WickedLocal
News	The Guardian
News	Huffington Post
News	Reuters
News	The Verge
News	Politico
Blogs	Yahoo
Blogs	Buzzfeed
Blogs	Medium
Blogs	Wordpress
Blogs	Youtube
Blogs	Pinterest
Blogs	Mumsnet
Blogs	Google
Blogs	JustAnswer
Blogs	SomethingAwful

What’s Covered by Social / Discussion Forums

Data from social media and discussion forums represents more than 70% of the data lake. Our coverage of social sites includes public and specialized forums, such as Apple Communities, Stackoverflow, NGO communities, corporate forums, and more. These forums are important sources because controversies and whistleblower alerts are often shared there first.

Some mainstream social media, such as Facebook and Instagram, cannot be legally accessed for compliance reasons. These sites are generally excluded from the data lake, except for some public pages with limited amounts of information.

AI-Generated Information

AI-generated information is flourishing across the web. To differentiate between news, documents, and information generated by AI vs humans, SESAMm employs a quality score for every article. The score assesses the amount of actual information, readability, source authority, and more. This includes scraping Wikipedia pages of sources to identify journalist numbers, agency ties, and the number of references.