Manipulating Chess-GPT’s World Model

Chess-GPT, a language model trained to predict chess moves from PGN strings, not only plays chess at an approximate 1500 Elo rating but also estimates player skill levels and board states. Initial tests showed a drop in performance when playing games with random initial moves, suggesting a lack of deeper game understanding. However, by adjusting the model’s internal activations, it’s possible to manipulate its skill level and board state understanding.

The model’s performance against Stockfish dropped significantly when starting from a randomly initialized board, but interventions to its skill representation improved its win rate substantially. This indicates that the model was predicting moves as if it were a low-skilled player in these scenarios. Similarly, interventions to the model’s internal board state representation allowed it to output legal moves under modified board states, although with limited success.

These findings suggest that language models like Chess-GPT can develop a sophisticated world model through self-supervised learning, beyond simple pattern recognition. However, the partial success of interventions also highlights the current limitations in our understanding of machine learning models. The work points to the need for more advanced interpretability methods in AI, akin to the role of microscopes in early biology. The research, code, and datasets are openly available for further exploration and collaboration.
Read more at Adam Karvonen…

Manipulating Chess-GPT’s World Model

Related

When the Vending Machine Went Sentient

Constant-Time Breakthrough Raises the Hash-Table Speed Limit

Star Wars Reimagined: China’s Laser Satellite Outpaces Starlink

Court Rules AI’s Use of Books as Fair Use but Slams Pirated Collection Storage

Introducing the OWASP AI Testing Guide: A New Standard for AI Security Testing

The Low-Background Steel Problem of AI

Chinese AI Firms Dodge US Chip Bans with Cross-Border Data Smuggling to Malaysia

OpenAI open-sources a demo of a UI testing agent

Financial Dynamics in Agentic AI: Cursor’s Rise Versus GitHub Copilot