A massive dataset frequently used to train artificial intelligence models has been found to contain nearly 12,000 valid secrets, including API keys and passwords. The discovery raises concerns about data security, especially as AI models pull from vast digital archives that are difficult to clean thoroughly.
Common Crawl’s Massive Digital Archive Holds Hidden Risks
Common Crawl, a nonprofit organization, has been collecting and maintaining an open-source repository of web data since 2008. The dataset, spanning petabytes, is available for free and serves as a resource for various AI projects, including language models from OpenAI, Google, Meta, Anthropic, and Stability AI.
Given its scale, many AI models rely on Common Crawl at least in part for training. However, researchers at Truffle Security, a company specializing in detecting exposed credentials, found that the archive contained thousands of valid secrets. These secrets—hardcoded into public-facing code—could pose security threats if exploited by malicious actors.
AWS Root Keys and MailChimp API Keys Found in Publicly Available Data
Truffle Security’s team used their open-source scanner, TruffleHog, to analyze 400 terabytes of data from 2.67 billion web pages in Common Crawl’s December 2024 archive. Their findings were alarming:
- 11,908 valid secrets were identified, all of which could be successfully authenticated.
- 219 distinct types of secrets were exposed, including API keys for Amazon Web Services (AWS), MailChimp, and WalkScore.
- 1,500 unique MailChimp API keys were hardcoded in front-end HTML and JavaScript, making them easily accessible.
- One single WalkScore API key appeared 57,029 times across nearly 1,900 subdomains—a shocking level of reuse.
Why Are These Secrets Leaking?
The issue largely stems from poor security practices by developers. Many of the discovered secrets were hardcoded into publicly accessible HTML and JavaScript, rather than stored securely in server-side environment variables.
For instance, MailChimp API keys, commonly used for email marketing, were embedded in front-end code, making them visible to anyone who inspected a webpage’s source. Similarly, AWS root keys—a highly sensitive credential that grants access to cloud infrastructure—were found in HTML files.
Truffle Security also uncovered live Slack webhooks on one webpage, with 17 unique keys that could allow unauthorized users to post messages into Slack channels. The messaging platform explicitly warns against exposing these keys, stating, “Your webhook URL contains a secret. Don’t share it online, including via public version control repositories.”
The Challenge of Scrubbing Sensitive Data from AI Training Sets
AI companies do not use raw training data without modification. Large datasets like Common Crawl go through extensive pre-processing, which includes filtering out irrelevant, duplicate, harmful, or sensitive content. Despite these efforts, Truffle Security’s findings suggest that some confidential data still makes it through.
The reason is simple: eliminating all sensitive information from a dataset of this scale is nearly impossible. Personally identifiable information (PII), financial data, medical records, and API keys can slip through, influencing the outputs of AI models in unintended ways.
What Happens Next?
After identifying these exposed credentials, Truffle Security reached out to the affected vendors. Their report states that they successfully helped organizations revoke or rotate several thousand keys, reducing the immediate risk. However, the findings highlight broader concerns about security practices in software development and AI model training.
Some key takeaways from this incident:
- Developers should avoid hardcoding API keys in front-end code and instead use secure storage methods like environment variables.
- AI researchers and data engineers must refine filtering techniques to prevent sensitive data from making it into training sets.
- Companies should regularly audit their web-facing code for accidentally exposed secrets before they end up in public datasets.
Even if AI models are trained on older archives than the one Truffle Security analyzed, the risk remains. Poor security hygiene at the source can lead to unintended consequences in AI behavior, making data security an ongoing challenge for the industry.