Five AI bots took our tough reading test. One was smartest – and it wasn’t ChatGPT

Geoffrey A. Fowler

Washington Post·

4 Jun, 2025 09:43 PM8 mins to read

Subscribe to listen

Access to Herald Premium articles require a Premium subscription. Subscribe now to listen.

Already a subscriber?

Listening to articles is free for open-access content—explore other articles or learn more about text-to-speech.

‌

Save

Share this article

AI summaries frequently left out important information and overemphasised the positive (while ignoring the negative). Photo / 123rf

All of the most popular artificial intelligence chatbots have the ability to upload and summarise documents, from legal contracts to an entire book. The tech promises to give you a kind of speed-reading superpower. But do any of the bots really understand what they’re reading?

To figure out which artificial

To judge the AI tools’ summaries and analysis, I gathered a panel of experts – including the original authors of the book and scientific reports.

All told, I asked 115 questions about the assigned reading to ChatGPT, Claude, Copilot, Meta AI and Gemini. Some of the AI responses were astoundingly good. Others were so clueless they sounded like Seinfeld’s George Costanza.

All the bots, barring one, made up – or “hallucinated” – information, a persistent AI problem. But facts were only one part of the challenge; my questions also challenged the AI to provide analysis, such as recommending improvements to the contracts and spotting factual problems in Trump’s speeches. (In March, I ran a similar test asking AI to write tough emails. Send me an email about what you’d like me to test next.)

Advertise with NZME.

Wait, shouldn’t people be doing their own reading? There’s still no substitute for reading yourself, particularly if you’re trying to learn or experience art. But for better or worse, people are turning to AI for help when they want to get up to speed on a new topic, need help decoding jargon or need to cheat their way through a meeting. Summarisation is emerging as a core use for AI, and chatbots promise to be a kind of CliffsNotes where you can ask follow-up questions.

If you use AI, this test offers a real-world assessment of what the current tech can – and cannot – reliably accomplish. (The Washington Post has a content partnership with ChatGPT’s maker, OpenAI.)

Here’s how the bots performed on each topic, followed by an overall champion and our judges’ conclusions.

Advertise with NZME.

Literature

Best: ChatGPT

Literature was the worst subject overall for the bots. Only Claude got all the facts right about Chris Bohjalian’s 2025 Civil War love story, The Jackal’s Mistress.

Gemini, which wrote very short responses to our questions, was most often guilty of what Bohjalian called inaccurate, misleading and sloppy reading. In one summary, Gemini described a man who just had a leg amputated “appearing” on another character’s doorstep. Bohjalian says the answer reminded him of the Seinfeld episode where Costanza watches the Breakfast at Tiffany’s movie instead of reading the novel and ends up embarrassing himself at the book club.

Even the best overall summary of the book, which came from ChatGPT, left something to be desired. “The response could be copy for the dust jacket. But it also discusses only three of the five major characters, ignoring the important role of the two formerly enslaved people,” says Bohjalian. In fact, he noticed the overly “positive” AI helpers often failed to address slavery and the Civil War.

Discover more

Premium

Business

Tech Insider: The Kiwis most likely to support an U16 social media ban; lawyer's AI horror story

21 May 05:00 AM

Technology

Tech Insider: Wellington man gets shock $16k bill after using a Google AI-ready tool

04 Jun 07:04 AM

Premium

Business

On The Up: AI disruptors – meet the Kiwis using new tech to boost their businesses and lead the way

19 May 09:00 PM

Business

AI could make crashes worse, amplify herd behaviour: Reserve Bank

05 May 03:00 AM

That said, the quality of answers to more analytical questions by both ChatGPT and Claude left Bohjalian gobsmacked. Prompted to describe how the book’s epilogue “made you feel,” both bots appeared to have “all the feels”, Bohjalian says.

“These responses express precisely what I was trying to convey,” says Bohjalian.

Scores, out of 10: ChatGPT 7.8; Claude 7.3; Meta AI 4.3; Copilot 3.5; Gemini 2.3

Law

Best: Claude

Sterling Miller, a long-time corporate lawyer, judged our AI tools’ understanding of two common legal contracts that people might not necessarily have a lawyer around to help them with. What he found was inconsistency.

At times, Meta AI and ChatGPT tried to reduce complex parts of the contracts to one-line summaries. “That is basically useless,” Miller says.

Advertise with NZME.

Worse, the bots sometimes didn’t seem to appreciate significant nuances. In our test rental agreement, Meta AI skipped several sections entirely and missed that a landlord could enter the property at any time. ChatGPT forgot to mention a key clause in a contractor agreement about who owned inventions.

Claude won overall by offering the most consistently decent answers to our questions. And it did its best work on our most complex request: suggesting changes to our test rental agreement. Miller said Claude’s answer was complete, picked up on nuance and laid things out exactly like he would.

On that prompt, it came the closest to being a “good substitute for a lawyer,” Miller says. “The problem is none of the tools got 10s across the board.”

Scores, out of 10: Claude 6.9; Gemini 6.1; Copilot 5.4; ChatGPT 5.3; Meta AI 2.6

Health science

Best: Claude

On average, all of the AI tools scored better at analysing scientific research. In our test of two papers co-written by judge Eric Topol, less than two points separated the best and worst performances.

Advertise with NZME.

It’s hard to say exactly why. AI might have access to a lot of scientific papers in its training data. Research reports were also the only documents in our tests that follow a very predictable structure, including their own human-written summary introduction.

Topol’s lowest score of 4 went to Gemini for its summary of a study on Parkinson’s disease. The response didn’t introduce hallucinations, but it left out key descriptions of the study and why it mattered.

Claude was the only AI tool to earn a score of 10 out of 10. Topol gave that for its summary of his paper on long covid, which helpfully broke down the results for different kinds of patients and highlighted the most important takeaway from the paper for doctors treating covid patients.

However, on an analytical question about how one study accounted for racial differences, Claude scored only a 5. “I was very surprised at how different the responses were for the different prompts,” says Topol.

Scores, out of 10: Claude 7.7; ChatGPT 7.2; Copilot 7; Gemini 6.5; Meta AI 6

Politics

Best: ChatGPT

Advertise with NZME.

Trump’s speeches can be so meandering, they’ve garnered their own stylistic nickname: “the weave”. Cat Zakrzewski, a Washington Post White House reporter, judged whether AI could make out what he was actually asserting and analyse what it meant.

For example, we asked the bots to analyse Trump’s 100-day rally in Michigan, in which he mentioned jobs returning to the state about a dozen times. But how many jobs? Copilot incorrectly said thousands by conflating some comments Trump made about keeping an Air Force base open. Meta AI answered best by reporting that Trump never specified, while also highlighting what he did suggest about auto jobs.

ChatGPT stood out from the pack with impressive responses to about half of our questions. For example, when we asked it to identify what rival Democrats wouldn’t like about Trump’s unscripted 100-day rally, it produced a bullet-point list that hit all the right notes. “This answer does a good job of drawing specific examples from the speech, and it provides accurate context,” Zakrzewski says. What’s more, it “accurately fact-checks Trump’s false claims that he won the 2020 election”.

The bots got into the most trouble conveying Trump’s tone. For example, Copilot’s summary of the 100-day rally was factually accurate but didn’t capture its charged nature. “If you only read this summary, you might not believe Trump delivered this speech,” says Zakrzewski.

Scores, out of 10: ChatGPT 7.2; Claude 6.2; Meta AI 5.2; Gemini 5; Copilot 3.7

And the overall winner is …

Claude edged out ChatGPT and left the others in the dust.

Advertise with NZME.

Overall winner Claude was also the only model that never hallucinated.

What did we learn?

So is that good or bad? Both Claude and ChatGPT produced some analysis that knocked it out of the park, the judges said.

During his evaluations of those two tools, Bohjalian was flabbergasted. “Okay, I’m done. Whole human race is. Stick a fork in us,” he noted.

But you could also see the results this way: none of the bots scored higher than 70% overall – the typical cutoff for a D+.

Beyond hallucinations, a number of limitations echoed across the tests. AI summaries frequently left out important information and overemphasised the positive (while ignoring the negative). Too often, Bohjalian says, you could “really see the robot hiding behind the human mask” pretending to be an expert in something it didn’t actually understand.

And an AI tool’s capability in one field didn’t necessarily translate to another. ChatGPT, for example, might have been tops in politics and literature but ranked near the bottom in law.

Advertise with NZME.

The judges highlight the inconsistency as reason for caution.

Miller says AI is not a substitute for a lawyer. “If paying an attorney is out of the question or if you just want to have something in hand while you also read through the agreement or document,” he says, “then using generative AI is an ‘okay’ solution.”

I’d also recommend running your document through at least two AI tools, so you can compare the results. And for anything that’s actually important in your life, it’s definitely worth taking the time to read it yourself.

Save

Share this article

Latest from Business

Premium

Five AI bots took our tough reading test. One was smartest – and it wasn’t ChatGPT

Literature

Tech Insider: The Kiwis most likely to support an U16 social media ban; lawyer's AI horror story

Tech Insider: Wellington man gets shock $16k bill after using a Google AI-ready tool

On The Up: AI disruptors – meet the Kiwis using new tech to boost their businesses and lead the way

AI could make crashes worse, amplify herd behaviour: Reserve Bank

Law

Health science

Politics

And the overall winner is …

What did we learn?

Latest from Business

Market close: NZ sharemarket starts Christmas week up 1.31%

'Fully inform themselves': FMA issues warning to Chance Voight investors

NZ retail spending warms up before Christmas

New chapter, same independence

Market close: NZ sharemarket starts Christmas week up 1.31%

'Fully inform themselves': FMA issues warning to Chance Voight investors

NZ retail spending warms up before Christmas

New chapter, same independence