AI Is a Black Box - Anthropic Figured Out a Way to Look Inside
By Steven Levy. WIRED. May 21, 2024 11:00 AM
What goes on in artificial neural networks work is largely a mystery, even to their creators. But researchers from Anthropic have caught a glimpse.
For the past decade, AI researcher Chris Olah has been obsessed with artificial neural networks. One question in particular engaged him, and has been the center of his work, first at Google Brain, then OpenAI, and today at AI startup Anthropic, where he is a cofounder. “What's going on inside of them?” he says. “We have these systems, we don't know what's going on. It seems crazy.”
That question has become a core concern now that generative AI has become ubiquitous. Large language models like ChatGPT, Gemini, and Anthropic’s own Claude have dazzled people with their language prowess and infuriated people with their tendency to make things up. Their potential to solve previously intractable problems enchants techno-optimists. But LLMs are strangers in our midst. Even the people who build them don’t know exactly how they work, and massive effort is required to create guardrails to prevent them from churning out bias, misinformation, and even blueprints for deadly chemical weapons. If the people building the models knew what happened inside these “black boxes,'' it would be easier to make them safer.
Olah believes that we’re on the path to this. He leads an Anthropic team that has peeked inside that black box. Essentially, they are trying to reverse engineer large language models to understand why they come up with specific outputs—and, according to a paper released today, they have made significant progress.
Maybe you’ve seen neuroscience studies that interpret MRI scans to identify whether a human brain is entertaining thoughts of a plane, a teddy bear, or a clock tower. Similarly, Anthropic has plunged into the digital tangle of the neural net of its LLM, Claude, and pinpointed which combinations of its crude artificial neurons evoke specific concepts, or “features.” The company’s researchers have identified the combination of artificial neurons that signify features as disparate as burritos, semicolons in programming code, and—very much to the larger goal of the research—deadly biological weapons. Work like this has potentially huge implications for AI safety: If you can figure out where danger lurks inside an LLM, you are presumably better equipped to stop it.
I met with Olah and three of his colleagues, among 18 Anthropic researchers on the “mechanistic interpretability” team. They explain that their approach treats artificial neurons like letters of Western alphabets, which don’t usually have meaning on their own but can be strung together sequentially to have meaning. “C doesn’t usually mean something,” says Olah. “But car does.” Interpreting neural nets by that principle involves a technique called dictionary learning, which allows you to associate a combination of neurons that, when fired in unison, evoke a specific concept, referred to as a feature.
“It’s sort of a bewildering thing,” says Josh Batson, an Anthropic research scientist. “We’ve got on the order of 17 million different concepts [in an LLM], and they don't come out labeled for our understanding. So we just go look, when did that pattern show up?”
Last year, the team began experimenting with a tiny model that uses only a single layer of neurons. (Sophisticated LLMs have dozens of layers.) The hope was that in the simplest possible setting they could discover patterns that designate features. They ran countless experiments with no success. “We tried a whole bunch of stuff, and nothing was working. It looked like a bunch of random garbage,” says Tom Henighan, a member of Anthropic’s technical staff. Then a run dubbed “Johnny”—each experiment was assigned a random name—began associating neural patterns with concepts that appeared in its outputs.