Researchers Evaluate Abstraction Abilities of Text and Multimodal Versions of GPT-4

Recent advances in large language models (LLMs) like GPT-3 and GPT-4 have led to claims that these models can perform human-like reasoning and abstraction. However, new research indicates significant gaps still remain between LLMs and human reasoning abilities.

In a new study, researchers from the Santa Fe Institute evaluated text and multimodal versions of GPT-4 on a benchmark called ConceptARC. This benchmark tests understanding and reasoning related to basic concepts like above/below, inside/outside, and same/different.

The researchers first tested a text-only version of GPT-4 using more detailed prompting, including an example, than in previous work. This improved performance over simpler prompting, but GPT-4’s accuracy remained around 33% compared to 91% for humans on the 480 ConceptARC tasks.

Accuracies of humans and GPT-4 (with temperature 0 and 0.5) on each concept group (30 tasks) and over all concepts (480 tasks) in ConceptARC.

To better compare GPT-4 with human performance on visual tasks, the researchers also tested GPT-4V, a multimodal version, on simplified “minimal” ConceptARC tasks presented as images. Surprisingly, GPT-4V performed significantly worse than the text-only GPT-4, achieving only 23-25% accuracy compared to 65-69% for text GPT-4.

Accuracies of humans, GPT-4 (with Temperature 0 and 0.5), and GPT-4V (zero- and one-shot prompting) on minimal tasks over all concepts (48 tasks) in ConceptARC.

The results reinforce that despite recent advances, leading LLMs still lack the robust abstraction abilities and flexible reasoning humans demonstrate even for basic concepts. The authors conclude that better prompting strategies could improve LLM performance, but fundamental gaps likely remain between human and artificial intelligence.

The ConceptARC benchmark provides a useful methodology for continued assessment of reasoning and abstraction capabilities as LLMs evolve. More work is needed to understand differences between human and artificial reasoning mechanisms and work towards more human-like learning and generalization.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.