AGEofLLMs.com
Search

ZebraLogic is Testing LLMs with Logic Puzzles

Calculating... Comments
Puzzled LLMs Struggling Logically
Puzzled LLMs Struggling Logically

ZebraLogic benchmark uses logic puzzles to assess how well large language models (LLMs) can reason logically. It aims to evaluate AI systems' ability to solve complex problems that require logical thinking.

Humans can solve puzzles by using strategic reasoning with clues, like through the process of elimination. But Large Language Models (LLMs) still struggle with logical reasoning.

The best LLM, Claude 3.5 Sonnet, can solve only 33.4% of all puzzles and just 12.4% of the hard ones.

The best open-weight LLM is 🐳 DeepSeek-v2-Chat (0628), which is much better than Llama-3-70B-Instruct.

Smaller models with 7 to 10 billion parameters have a tough time with hard puzzles, solving fewer than 1%, and show low accuracy on easy ones.

LLMs Leaderboard: Top 15

# Model Puzzle Acc Easy Puzzle Acc Hard Puzzle Acc Cell Acc No answer
1 πŸ”’ Claude 3.5 Sonnet 33.4 87.5 12.4 54.3 0
2 πŸ”’ gpt-4-turbo-2024-04-09 28.4 80.7 8.1 47.9 0.1
3 πŸ”’ gpt-4o-2024-05-13 28.2 77.9 8.9 38.7 19.3
4 πŸ”’ gpt-4-0314 27.1 77.1 7.6 47.4 0.2
5 πŸ”’ Claude 3 Opus 27 78.2 7.1 48.9 0
6 πŸ”‘ DeepSeek-V2-Chat 22.7 68.6 4.9 42.5 5.2
7 πŸ”‘ Qwen2-72B-Instruct 21.4 63.9 4.9 38.3 10.2
8 πŸ”‘ DeepSeek-Coder-V2 21.1 64.6 4.2 41.6 4.9
9 πŸ”’ gpt-4o-mini-2024-07-18 20.1 62.5 3.6 41.3 0.1
10 πŸ”’ Gemini 1.5 Flash 19.4 59.3 3.9 31.8 22.7
11 πŸ”’ Gemini 1.5 Pro 19.4 55.7 5.3 44.6 0.8
12 πŸ”’ Yi-Large-Preview 18.9 58.9 3.3 42.6 1.4
13 πŸ”’ Yi-Large 18.8 58.2 3.5 39.8 1.8
14 πŸ”’ Claude 3 Sonnet 18.7 58.9 3.1 43.7 0
15 πŸ”‘ llama-3-70B-Instruct 16.8 52.9 2.8 42.3 0.2

An Example of ZebraLogic Data

Here is an example of a 2x3 puzzle (2 houses x 3 features):

There are 2 houses, numbered 1 to 2 from left to right. Each house is occupied by a different person. Each house has a unique attribute for each of the following characteristics: - Each person has a unique name: Arnold, Eric - People own unique car models: ford f150, tesla model 3 - The people keep unique animals: cat, horse Clues: 1. Eric is directly left of the person who owns a Tesla Model 3. 2. The person who keeps horses is in the first house.

Reasoning steps:

  • From Clue 1, we know that Eric is to the left of someone, so he must be the owner of House 1 because House 2 is the rightmost house.
  • Additionally, we know that the person in House 1 must be Arnold, and he owns a Tesla Model 3. Thus, Eric owns a Ford F150.
  • From Clue 2, we know that Eric keeps horses in House 1, which means the other house must keep cats. Finally, we arrive at the unique solution to this puzzle.

The solution is presented in table format:

Houses Name CarModel Animal
1 Eric ford f150 horse
2 Arnold tesla model 3 cat

Sources:

https://huggingface.co/blog/yuchenlin/zebra-logic

AI Creators Tools Directory
Related Posts

Visitor Comments

Please prove you are human by selecting the cup.