Manipulating Chess-GPT’s World Model

Chess-GPT, a language model trained to predict chess moves from PGN strings, not only plays chess at an approximate 1500 Elo rating but also estimates player skill levels and board states. Initial tests showed a drop in performance when playing games with random initial moves, suggesting a lack of deeper game understanding. However, by adjusting the model’s internal activations, it’s possible to manipulate its skill level and board state understanding.

The model’s performance against Stockfish dropped significantly when starting from a randomly initialized board, but interventions to its skill representation improved its win rate substantially. This indicates that the model was predicting moves as if it were a low-skilled player in these scenarios. Similarly, interventions to the model’s internal board state representation allowed it to output legal moves under modified board states, although with limited success.

These findings suggest that language models like Chess-GPT can develop a sophisticated world model through self-supervised learning, beyond simple pattern recognition. However, the partial success of interventions also highlights the current limitations in our understanding of machine learning models. The work points to the need for more advanced interpretability methods in AI, akin to the role of microscopes in early biology. The research, code, and datasets are openly available for further exploration and collaboration.
Read more at Adam Karvonen…