ZebraLogic is Testing LLMs with Logic Puzzles

ZebraLogic benchmark uses logic puzzles to assess how well large language models (LLMs) can reason logically. It aims to evaluate AI systems' ability to solve complex problems that require logical thinking.
Humans can solve puzzles by using strategic reasoning with clues, like through the process of elimination. But Large Language Models (LLMs) still struggle with logical reasoning.
The best LLM, Claude 3.5 Sonnet, can solve only 33.4% of all puzzles and just 12.4% of the hard ones.
The best open-weight LLM is π³ DeepSeek-v2-Chat (0628), which is much better than Llama-3-70B-Instruct.
Smaller models with 7 to 10 billion parameters have a tough time with hard puzzles, solving fewer than 1%, and show low accuracy on easy ones.
LLMs Leaderboard: Top 15
# | Model | Puzzle Acc | Easy Puzzle Acc | Hard Puzzle Acc | Cell Acc | No answer |
---|---|---|---|---|---|---|
1 | π Claude 3.5 Sonnet | 33.4 | 87.5 | 12.4 | 54.3 | 0 |
2 | π gpt-4-turbo-2024-04-09 | 28.4 | 80.7 | 8.1 | 47.9 | 0.1 |
3 | π gpt-4o-2024-05-13 | 28.2 | 77.9 | 8.9 | 38.7 | 19.3 |
4 | π gpt-4-0314 | 27.1 | 77.1 | 7.6 | 47.4 | 0.2 |
5 | π Claude 3 Opus | 27 | 78.2 | 7.1 | 48.9 | 0 |
6 | π DeepSeek-V2-Chat | 22.7 | 68.6 | 4.9 | 42.5 | 5.2 |
7 | π Qwen2-72B-Instruct | 21.4 | 63.9 | 4.9 | 38.3 | 10.2 |
8 | π DeepSeek-Coder-V2 | 21.1 | 64.6 | 4.2 | 41.6 | 4.9 |
9 | π gpt-4o-mini-2024-07-18 | 20.1 | 62.5 | 3.6 | 41.3 | 0.1 |
10 | π Gemini 1.5 Flash | 19.4 | 59.3 | 3.9 | 31.8 | 22.7 |
11 | π Gemini 1.5 Pro | 19.4 | 55.7 | 5.3 | 44.6 | 0.8 |
12 | π Yi-Large-Preview | 18.9 | 58.9 | 3.3 | 42.6 | 1.4 |
13 | π Yi-Large | 18.8 | 58.2 | 3.5 | 39.8 | 1.8 |
14 | π Claude 3 Sonnet | 18.7 | 58.9 | 3.1 | 43.7 | 0 |
15 | π llama-3-70B-Instruct | 16.8 | 52.9 | 2.8 | 42.3 | 0.2 |
An Example of ZebraLogic Data
Here is an example of a 2x3 puzzle (2 houses x 3 features):
There are 2 houses, numbered 1 to 2 from left to right. Each house is occupied by a different person. Each house has a unique attribute for each of the following characteristics: - Each person has a unique name: Arnold, Eric - People own unique car models: ford f150, tesla model 3 - The people keep unique animals: cat, horse Clues: 1. Eric is directly left of the person who owns a Tesla Model 3. 2. The person who keeps horses is in the first house.
Reasoning steps:
- From Clue 1, we know that Eric is to the left of someone, so he must be the owner of House 1 because House 2 is the rightmost house.
- Additionally, we know that the person in House 1 must be Arnold, and he owns a Tesla Model 3. Thus, Eric owns a Ford F150.
- From Clue 2, we know that Eric keeps horses in House 1, which means the other house must keep cats. Finally, we arrive at the unique solution to this puzzle.
The solution is presented in table format:
Houses | Name | CarModel | Animal |
---|---|---|---|
1 | Eric | ford f150 | horse |
2 | Arnold | tesla model 3 | cat |
Sources:
Published: Jul 23, 2024 at 3:43 AM