AI Benchmarking: Evaluating Model Performance on Subjective, Complex, and Predictive Queries

Gene Munster, Brian BakerJuly 5, 2024

We tested ChatGPT, Gemini, Grok and Claude to determine which models provide the most clear and accurate responses. Our theory is not all foundation models are created equal. The two takeaways from our first benchmarking are: first, there is no single best model; certain models are better at handling specific queries. Second, there generally are two model "personalities"; with Grok as the only model with a distinct "personality". As for the what does this mean for search question: the answer is Google will win. What is most important for the future of ChatGPT, Grok, and Claude is their ability to allow third parties to create unique, deep, and conversational experiences that deliver insights beyond search. That doesn't exist today.

Key Takeaways

Our methodology tested ChatGPT, Gemini, Grok and Claude. We asked each model the same 40 questions across 4 categories: Commerce, Future Based, Consulting, and Detail Extraction. The benchmarking scores and stacks each model up against each other on subjective basis.

While all AI models’ average score was within .35 points of each other, ChatGPT and Grok came out on top with a total average score of 8.75, with Gemini at a close second with a score of 8.65, and Claude trailing behind with 8.4. Our personal preferences ranked them a little differently.

As for individual model strengths, Gemini is the best for Commerce (9.4), ChatGPT and Grok tied for Future Based (8.7), ChatGPT and Gemini tied for Consulting (9.1), and Grok won Detail Extraction (8.5).

The comparison to search is inaccurate. The question is what are you doing with AI that you can't do with search.

Appendix: questions & scorecard.

Overview

We set out to test the leading AI models’ knowledge (ChatGPT, Claude, Gemini, Grok) to see which model was the best for answering a wide array of questions. The tests were done on all of premium versions of the chatbots. Initially, we asked fact-based questions and thought-provoking riddles, and found all models are equally equipped to correctly answer prompts (with a few exceptions, hallucinations, and glitches).

Our test expanded to questions of unknowns and opinions. Additionally, we pushed the models to give us one, concise answer, as opposed to a long-winded answer including considerations and options.

Our Goal: To see which models provide the best responses to Subjective, Indirect, and Future Based prompts.

Our Approach: We tested the four leading generative AI models; OpenAI’s ChatGPT, Anthropic’s Claude, xAI’s Grok, and Google’s Gemini. The same 40 questions were prompted across 4 categories: Commerce, Future Based, Consulting, and Detail Extraction. The objective was to get the model to give a definitive answer to the prompt to score full points. Multiple prompt attempts and framing strategies are used to get an acceptable response.

Methodology: Our test included prompting the models across four query types:

Commerce: looking for the best purchase option based on specific parameters.
Future Based: looking for a single forward-looking answer based on relevant information.
Consulting: how-to instructions based on specific scenarios.
Detail Extraction: asking for specific details that need to be extracted from a larger data set or item that aren’t readily available.

Criteria: Responses were graded on the following criteria for a maximum score of 10 per query:

Results

The winners of the AI Benchmarking were ChatGPT and Grok, based on final score (see appendix for scorecard):

ChatGPT: (best overall)

Overall Rating 8.75

Lowest Query Type: Detail Extraction (8.1)

Highest Query Type: Commerce and Consulting (9.1)

Our take: ChatGPT produces consistent high scores and its responses are well formatted. Its lowest score was in Detail Extraction due to a hallucinations and issues with specific details. The model struggled to initially answer Future Based questions and we found some sources to be unreliable. Overall, ChatGPT seemed to be the most reliable and user-friendly model, deserving of the co-champion title.

Grok: (most fun)

Overall Rating: 8.75

Lowest Query Type: Detail Extraction (8.5)

Highest Query Type: Consulting (9)

Our take: Grok is the only model with a distinct “personality”. It’s quirky, funny (at times), and is quick to the answer. It does not layout answers in a scientific format (paragraphs with bullets, numbering, options) as much as the other models, and certain people prefer the more casual and concise interface. A few weaknesses of Grok are it did glitch several times requiring a page refresh, sometimes got stuck in “Fun Mode”, and in some cases required many follow-up questions to get to an acceptable answer. Despite Grok’s high ranking, it’s evident this model has polarizing effect.

Gemini: (best for Commerce)

Overall Rating: 8.65

Lowest Query Type: Detail Extraction (8)

Highest Query Type: Commerce (9.4)

Our take: Gemini is a top model and has a unique advantage in the Commerce queries stemming from Google Search. We purposely used, “Who will win the US presidential election in 2024?” because we found Gemini refuses to answer anything political. While this is a limitation, it’s not functionality issue of the model itself, but of Google’s choosing. Google is clearly on top of the Google Search vs Generative AI battle as it tries to retain the billions of Google Searches per day.

Claude (More of a pro-tool, less of a consumer tool)

Overall Rating: 8.4

Lowest Query Type: Future Based (7.5)

Highest Query Type: Commerce and Consulting (8.9)

Our take: We struggled to be satisfied with Claude. Wordy answers, hallucinations, several framing techniques needed to receive an acceptable answer, all contributed to frustrations. It was helpful in consulting cases and performed better than in the initial fact-based test, which highlighted the continuous improvement. Additional setbacks to Claude include outdated information and a 45 prompts per 5 hours limit. Overall, this was our least favorite model to use.

Individual model strengths

Across the board, models are best at answering Commerce and Consulting queries. In other words, queries most frequently inputted into Google Search are easily handled by AI chatbots.

ChatGPT excelled in Commerce and Consulting, scoring 9.1 in both categories. Its weakest area was Detail Extraction at a score of 8.1. ChatGPT consistently churns concise, logical answers in a visually appealing format. The issues lie in its habit of hallucinating responses, meaning that fact-checking is necessary for most questions.

Grok achieved a high score of 9 in consulting and a low score of 8.5 in Detail Extraction. Despite it being tied for first place with ChatGPT, Grok needs development. Glitches kept it from giving good responses and hallucinations appeared in some cases. Its information was generally relevant and easy to understand, and it could frequently give a solid answer after one prompt. With a little work, Grok could be a preferred choice for consumers, but for now, it will need time to catch up to the others.

Gemini was very strong with Commerce questions (9.4) and struggled the most with Detail Extraction (8). It excels in giving suggestions but struggled to give out information on future events (Gemini refuses to answer anything political, try asking, “Who was the first president of the United States?”). Most responses were relatively unoriginal, but overall, it performed well despite some unclear answers.

Lastly, Claude, like the others, was most effective in Consulting and Commerce (8.9) but worst in Future Based (7.5). Often it required us to “jailbreak” it, or modify responses to bypass restrictions. This made it difficult to score it higher because of how many prompts it took to get logical answers. It also tended to be overly apologetic and provide unnecessarily long responses to simple questions. At its best, however, Claude provided detailed answers and consistently addressed other perspectives for well-rounded responses.

AI is about creating a unique experience that does not exist now

There is a narrative that search will be displaced by AI. At the highest level, that concern makes sense because these chatbots are helpful at cutting through some of the noise of search (10 blue links) to get one answer.

Said another way, some typical Google searches are better off with a conversation with a chatbot. As time gets pulled away from traditional search, search revenue goes down.

What many miss is that searching Google is reflexive for about 4B people a day, and those habits are next to impossible to break. Eventually, most Google searches will include a generative component and Google will be able to monetize the one clear answer so there will be no need for those 4B people to use ChatGPT (note: OpenAI is rumored to soon announce a search product), Grok or Claude.

The future of ChatGPT, Grok and Claude center on creating a unique experience that does not exist today. Long-term, there is little value in getting a direct answer. What will matter in that unique experience is a deep, ongoing conversation about the question to get a insight which is different than how we will use search in the decades to come.

The key is which products are built on top of the models that will deliver that unique, deep conversation experience.

Appendix: questions & scorecard

Complete List of Questions (by Query Type):

Commerce:

I’m a 25-year-old female with a budget of $1000, looking for a smartphone with an excellent camera, long battery life, and at least 128GB of storage. Recommend the best option for me. I’m looking for a single answer, no other options.
I’m a 30-year-old male graphic designer with a budget of $1500, needing a laptop with high processor speed, at least 16GB RAM, medium weight (around 4 lbs), and at least 8 hours of battery life. I’m looking for a single answer, no other options.
I’m a 40-year-old male in Minnesota with a budget of $30,000 and a 5-year loan, looking for a car for daily commuting that gets 30 MPG or higher, has 5-star safety ratings, high resale value, and excellent after-sales service. I’m looking for a single answer, no other options.
I’m a 28-year-old female with a budget of $600, seeking a 55-inch 4K television with built-in streaming apps for a moderately lit living room with minimal glare. I’m looking for a single answer, no other options.
I’m a 35-year-old male with a budget of $1200, needing a large refrigerator (25 cu ft) that is Energy Star certified and fits in a standard kitchen space with features like French doors, an ice maker, and a water dispenser. I’m looking for a single answer, no other options.
I’m a 45-year-old female with a budget of $2000, looking for a camera for landscape photography with a full-frame sensor, 24MP resolution, interchangeable lenses, and features like 4K video recording and Wi-Fi connectivity. I’m looking for a single answer, no other options.
I’m a 38-year-old male with a budget of $800, setting up a home gym in my basement and looking for high-durability fitness equipment for cardio and strength training, with user reviews of 4.5 stars or higher. I’m looking for a single answer, no other options.
I’m a 35-year-old male, 5’10” and 170 lbs, with a budget of $1200, seeking a mountain bike with an aluminum frame, 21 gears, and front suspension. I’m looking for a single answer, no other options.
I’m a 22-year-old female with a budget of $250, looking for the best possible in-ear wireless headphones for music and travel. I’m looking for a single answer, no other options.
I’m a 29-year-old female with a budget of $300, looking for a sporty watch for fitness tracking with a heart rate monitor, GPS, and waterproof features, and a high durability with a 2-year warranty. I’m looking for a single answer, no other options.

Future Based:

Who will win the US presidential election in 2024?
Which public company will have the highest market cap in 2025?
Which renewable energy source will be the most widely adopted by 2028?
What will be the most popular career choice in 2028?
Which team will win the Super Bowl in 2025?
What will be the most popular girl baby name in the US in 2025?
Who will win the Oscar for best male actor in 2025?
In what year will cars be capable of full autonomy without any human interaction on all roads and conditions?
In what year will a human step foot on Mars?
Who will win the Nobel Peace Prize in 2025?

Consulting:

How to remove a skunk odor from under a deck in a humid climate.
Best way to patch a large hole in a plaster wall without calling a professional.
How to stop a cat from scratching leather furniture without harming the animal.
Best way to get rid of fruit flies in a small apartment with limited ventilation.
How to fix a running toilet in an old house with outdated plumbing.
Best way to pack a suitcase for a two-week trip to Europe with varying weather conditions.
How to remove a stubborn red wine stain from a light-colored carpet.
Best way to negotiate a higher salary during a job interview without appearing greedy.
Best way to prevent shin splints from running on roads.
How to potty train a puppy without punishing it.

Detail Extraction:

In the painting “The Arnolfini Portrait” by Jan van Eyck, what is the object reflected in the mirror on the back wall?
In the video game “Silent Hill 2,” what is the name of the ship that James Sunderland takes to the town?
In the novel “One Hundred Years of Solitude” by Gabriel García Márquez, what is the name of the song that is played repeatedly at Remedios the Beauty’s funeral?
In the musical “Hamilton,” what is the exact wording on the crumpled-up paper that Eliza throws into the fire during the song “Burn”?
In the TV show “Lost,” what are the numbers written on the hatch that leads to the Dharma Initiative station?
In the album “The Dark Side of the Moon” by Pink Floyd, what is the sound effect that transitions between the songs “Speak to Me” and “Breathe”?
In the movie “The Grand Budapest Hotel,” what is the flavor of the pastry that Zero is constantly eating?
In the poem “The Waste Land” by T.S. Eliot, what is the name of the drowned Phoenician sailor mentioned in the first section?
In the movie “Happy Gilmore” who is pictured in the sky at the end of the movie?
In the novel “Infinite Jest” by David Foster Wallace, what is the title of the film that is so entertaining it causes viewers to become catatonic?

Scorecard:

Disclaimer

Subscribe to our newsletter

"*" indicates required fields