ZebraLogic is Testing LLMs with Logic Puzzles

Calculating... Comments

ZebraLogic benchmark uses logic puzzles to assess how well large language models (LLMs) can reason logically. It aims to evaluate AI systems' ability to solve complex problems that require logical thinking.

Humans can solve puzzles by using strategic reasoning with clues, like through the process of elimination. But Large Language Models (LLMs) still struggle with logical reasoning.

The best LLM, Claude 3.5 Sonnet, can solve only 33.4% of all puzzles and just 12.4% of the hard ones.

The best open-weight LLM is ðŸ³ DeepSeek-v2-Chat (0628), which is much better than Llama-3-70B-Instruct.

Smaller models with 7 to 10 billion parameters have a tough time with hard puzzles, solving fewer than 1%, and show low accuracy on easy ones.

LLMs Leaderboard: Top 15

#	Model	Puzzle Acc	Easy Puzzle Acc	Hard Puzzle Acc	Cell Acc	No answer
1	ðŸ”’ Claude 3.5 Sonnet	33.4	87.5	12.4	54.3	0
2	ðŸ”’ gpt-4-turbo-2024-04-09	28.4	80.7	8.1	47.9	0.1
3	ðŸ”’ gpt-4o-2024-05-13	28.2	77.9	8.9	38.7	19.3
4	ðŸ”’ gpt-4-0314	27.1	77.1	7.6	47.4	0.2
5	ðŸ”’ Claude 3 Opus	27	78.2	7.1	48.9	0
6	ðŸ”‘ DeepSeek-V2-Chat	22.7	68.6	4.9	42.5	5.2
7	ðŸ”‘ Qwen2-72B-Instruct	21.4	63.9	4.9	38.3	10.2
8	ðŸ”‘ DeepSeek-Coder-V2	21.1	64.6	4.2	41.6	4.9
9	ðŸ”’ gpt-4o-mini-2024-07-18	20.1	62.5	3.6	41.3	0.1
10	ðŸ”’ Gemini 1.5 Flash	19.4	59.3	3.9	31.8	22.7
11	ðŸ”’ Gemini 1.5 Pro	19.4	55.7	5.3	44.6	0.8
12	ðŸ”’ Yi-Large-Preview	18.9	58.9	3.3	42.6	1.4
13	ðŸ”’ Yi-Large	18.8	58.2	3.5	39.8	1.8
14	ðŸ”’ Claude 3 Sonnet	18.7	58.9	3.1	43.7	0
15	ðŸ”‘ llama-3-70B-Instruct	16.8	52.9	2.8	42.3	0.2

An Example of ZebraLogic Data

Here is an example of a 2x3 puzzle (2 houses x 3 features):

There are 2 houses, numbered 1 to 2 from left to right. Each house is occupied by a different person. Each house has a unique attribute for each of the following characteristics: - Each person has a unique name: Arnold, Eric - People own unique car models: ford f150, tesla model 3 - The people keep unique animals: cat, horse Clues: 1. Eric is directly left of the person who owns a Tesla Model 3. 2. The person who keeps horses is in the first house.

Reasoning steps:

From Clue 1, we know that Eric is to the left of someone, so he must be the owner of House 1 because House 2 is the rightmost house.
Additionally, we know that the person in House 1 must be Arnold, and he owns a Tesla Model 3. Thus, Eric owns a Ford F150.
From Clue 2, we know that Eric keeps horses in House 1, which means the other house must keep cats. Finally, we arrive at the unique solution to this puzzle.

The solution is presented in table format: