devil in the details - Consensual Illusions

soundtrack by Jason Sanders

The non-stop progression of Generative AI benchmarks over the past year has been both exciting and exhausting to follow. While big leaps in capabilities make for great headlines, I’m finding myself getting more skeptical about how much these improvements actually matter for everyday users. When I see reports about the latest model achieving better performance on some arcane academic test, I can’t help but think of my personal experiences where advanced models struggled with tasks like mastering CSS styling consistency, or ran themselves in circles trying to fix unit tests.

At times this disconnect between benchmark performance and practical utility feels like a repeat of the Great Digital Camera Megapixel Wars. More megapixels didn’t automatically translate to better photos, and I suspect that higher MMLU scores don’t always mean that a model will be more helpful for common tasks.

That said, there are cases where cutting-edge models can obviously shine – like complex code refactoring projects or handling nuanced technical discussions that require deep ‘understanding’ across multiple domains. The key is matching the tool to the task: I wouldn’t use a simpler model to help architect a distributed system, but I also wouldn’t pay premium rates to use GPT-o1 for basic text summarization.

Maybe instead of fixating on universal benchmarks, we need more personal metrics that reflect our very specific definitions of real-world usability. For example, how many attempts does it take to write a working Tabletop Simulator script so I can play a custom Magic: the Gathering game format? How well does the model maintain the most relevant context in longer conversations about building out my Pathfinder tabletop RPG character? I doubt that OpenAI researchers are focusing on benchmarks specific to these problems. (Side note: I think its interesting that while embellishing this blog post, Claude suggested I should avoid using examples that are ‘too niche.’ ‘Niche’ is real life. We are all a niche of one.)

I’d also hypothesize that a skilled verbal communicator working with an older model often outperforms an unfocused prompter using the latest frontier model, just like a pro with an old iPhone will still take better pictures than an amateur with the newest professional-grade digital camera. If this hypothesis is true, it suggests we should focus more on developing our own reasoning and communication skills, and choosing the right tool for each specific need, rather than chasing the latest breakthroughs.

The most practical benchmark for your own everyday use can be as simple as keeping notes about using different models for your real-world tasks. For example, this post was largely written using Claude 3.5 Sonnet v2 using a custom project, because I consistently prefer the style and tone I get from Claude using this method. Then I asked GPT-o1 to give technical feedback, because I prefer to use o1 as the ‘critic’ rather than the ‘creator.’ My own unscientific personal testing has revealed that while frontier models do often impress me with their ‘reasoning’ abilities, they’re not always the best fit for every step in every task. And as this technology continues to evolve, finding a balance between capability and practicality will become increasingly important for anyone just trying to get things done.

Tag: devil in the details

The AI Megapixel Wars – Benchmarks vs Practical Utility