The Low-Background Steel Problem of AI

In the early 1940s, a worker at Kodak noticed something strange. Packages of photographic paper, sealed in lead-lined containers, were fogged with radiation. The mystery unraveled to reveal an unsettling cause: fallout from nuclear weapons tests. The Trinity test in 1945 had sent radioactive particles into the atmosphere, contaminating even the materials used in sensitive imaging. This incident didn’t just uncover America’s nuclear ambitions—it also left an enduring legacy. From that point on, metals produced after the nuclear age bore trace radiation, rendering them unusable for delicate scientific instruments. For years, researchers scavenged the pre-war wrecks of scuttled battleships, seeking “low-background steel”—uncontaminated by atomic history.

Now, artificial intelligence researchers are grappling with a digital equivalent of radioactive fallout.

On November 30, 2022, OpenAI’s ChatGPT went public. In the months that followed, a wave of generative AI models flooded the internet with synthetic content. Academics and technologists began to ask a quiet but growing question: Are we poisoning our data the same way we once poisoned our metals?

The concern is that AI models, trained on internet-scale data, are now increasingly ingesting content that was itself generated by other AI models. This feedback loop may degrade future models, resulting in a phenomenon some call “model collapse” or “Model Autophagy Disorder (MAD).” The fear is not just one of degraded performance, but of irreversibility—where the training environments become so thoroughly polluted that clean, original data becomes a precious and dwindling resource.

John Graham-Cumming, Cloudflare’s former CTO, picked up on this analogy early. In 2023, he registered the domain lowbackgroundsteel.ai, referring to repositories of data free from generative contamination, such as GitHub’s Arctic Code Vault snapshot from early 2020. He and others have floated the idea of building archives of “known human-created stuff,” safeguarding it from the AI echo chamber.

Researchers like Ilia Shumailov and Maurice Chiodo at Cambridge’s Centre for the Study of Existential Risk worry about the long-term consequences. Their position isn’t just technical—it’s economic and political. If access to pre-2022 human-generated data becomes a competitive edge, it may entrench current tech giants and lock out future innovators. As Chiodo explains in a recent interview, “Everyone participating in generative AI is polluting the data supply for everyone.”

One proposed solution is federated learning—letting others train on sensitive or clean datasets without ever taking possession of them. It’s an elegant compromise, but a difficult one to regulate. Governments would need to create and maintain repositories of uncontaminated data, raising serious concerns about privacy, centralization, and political influence.

Labeling AI-generated content might help, but the technical challenges are formidable. Watermarks are fragile. Text is trivially stripped of metadata. Images and videos cross jurisdictional boundaries. In a distributed internet, even light regulatory touch becomes a coordination nightmare.

And yet, the stakes are high. If too much of the internet becomes AI-generated noise, it may become prohibitively expensive—or even impossible—to reconstruct datasets pure enough to train reliable next-generation models.

Rupprecht Podszun, a legal scholar from Heinrich Heine University Düsseldorf, draws the line clearly: “Pre-2022 data reflects how humans actually communicate. That’s more useful for training than what a chatbot generated after 2022.” It’s not about factual truth, he says—it’s about creative style, the nuance that makes language worth learning.

In the nuclear age, the scuttled German fleet at Scapa Flow became an accidental goldmine. One hundred years later, coders and ethicists may find themselves digging through digital archives with a similar goal: to locate the last uncontaminated artifacts of human thought—our low-background steel of the AI era.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.