Summary from simonwillison.net

A whirlwind tour of the past six months in large language models (LLMs), using the whimsical benchmark of "pelicans riding bicycles" to illustrate rapid technical progress, shifting benchmarks, memorable bugs, and ongoing challenges in AI engineering.

The LLM landscape has seen a surge of new models—over 30 noteworthy releases in six months—with major advances in local model capability, cost, and tool integration, but also some disappointing "lemons."
Creative, qualitative benchmarks (like generating SVGs of pelicans on bicycles) can reveal model strengths and weaknesses more vividly than traditional leaderboards or benchmarks.
Major risks remain, especially in the areas of prompt injection, tool-based agents, and the "lethal trifecta" of access to private data, exposure to malicious instructions, and exfiltration paths.

The use of "pelican riding a bicycle" SVG generation as an offbeat, surprisingly effective way to compare and rank models.
Notable bugs: ChatGPT’s sycophancy fiasco, Grok’s prompt mishaps, and LLMs’ tendency to "snitch" when exposed to evidence of wrongdoing.
The hilarious and revealing leaderboard of 34 model-generated pelicans, with Gemini 2.5 Pro Preview topping the list.

"Most importantly: pelicans can’t ride bicycles. They’re the wrong shape!"
"I think tools combined with reasoning is the most powerful technique in all of AI engineering right now."
"There’s this thing I’m calling the lethal trifecta, which is when you have an AI system that has access to private data, and potential exposure to malicious instructions... and there’s a mechanism to exfiltrate stuff."

Source URL•Original: 5289 words•Summary: 260 words