AGEofLLMs.com
Search

OpenAI Drops SWE-Lancer: A New Benchmark for AI Coders

Calculating... Comments

OpenAI just launched SWE-Lancer—a new benchmark designed to test AI coding skills in a real-world freelance setting. It’s built on 1,400 actual software engineering tasks from Upwork, with a total payout value of $1 million USD. GitHUB repo https://github.com/openai/SWELancer-Benchmark

SWE-lancer by Openai, image shared on X
SWE-lancer image shared on X by Openai

The tasks cover everything from UI/UX design to systems architecture, with projects ranging from $50 bug fixes to $32,000 feature builds. There are even management-level tasks where models have to pick the best technical approach rather than just code.

How Realistic Is It?

Pretty realistic. The average task took human freelancers over 21 days to finish. Task pricing matches real-world rates—tougher tasks pay more.

And here’s the kicker: current AI models can’t solve most of these tasks.

OpenAI's X:
OpenAI's X: "Current frontier models are unable to solve the majority of tasks."

OpenAI is setting this up as a baseline to measure future improvements in AI coding. They’ve even open-sourced a Docker image and released SWE-Lancer Diamond, a public evaluation set, so others can test their models too.

The Internet Reacts

People are already debating what this means for AI in software dev on Reddit.

@Chance_Attorney_8296 pointed out: “Funny, Claude is still the best per this benchmark and they all perform pretty meh despite freelance tasks being fairly self-contained, the types of things these models should excel at.”

Meanwhile, @The-AI-Crackhead said: “Software agents are coming this month I fear. I work at a big company and the higher-ups have been in literal all-day ‘planning’ meetings for the past week. This is a few weeks after a major re-org / layoffs so it’s bizarre.”

But not everyone is worried. @Independent_Pitch598 wrote: “Why fear? Isn’t it great?”

And @Ok-Chemical3 fired back: “Many people will lose their jobs, and the government currently has no plans to bring in any sort of sustainable safety net. Why do you think it’s great?”

Visitor Comments

Please prove you are human by selecting the cup.