Wait, shouldn’t people be doing their own reading? There’s still no substitute for reading yourself, particularly if you’re trying to learn or experience art. But for better or worse, people are turning to AI for help when they want to get up to speed on a new topic, need help decoding jargon or need to cheat their way through a meeting. Summarisation is emerging as a core use for AI, and chatbots promise to be a kind of CliffsNotes where you can ask follow-up questions.
If you use AI, this test offers a real-world assessment of what the current tech can – and cannot – reliably accomplish. (The Washington Post has a content partnership with ChatGPT’s maker, OpenAI.)
Here’s how the bots performed on each topic, followed by an overall champion and our judges’ conclusions.
Literature
Best: ChatGPT
Literature was the worst subject overall for the bots. Only Claude got all the facts right about Chris Bohjalian’s 2025 Civil War love story, The Jackal’s Mistress.
Gemini, which wrote very short responses to our questions, was most often guilty of what Bohjalian called inaccurate, misleading and sloppy reading. In one summary, Gemini described a man who just had a leg amputated “appearing” on another character’s doorstep. Bohjalian says the answer reminded him of the Seinfeld episode where Costanza watches the Breakfast at Tiffany’s movie instead of reading the novel and ends up embarrassing himself at the book club.
Even the best overall summary of the book, which came from ChatGPT, left something to be desired. “The response could be copy for the dust jacket. But it also discusses only three of the five major characters, ignoring the important role of the two formerly enslaved people,” says Bohjalian. In fact, he noticed the overly “positive” AI helpers often failed to address slavery and the Civil War.
That said, the quality of answers to more analytical questions by both ChatGPT and Claude left Bohjalian gobsmacked. Prompted to describe how the book’s epilogue “made you feel,” both bots appeared to have “all the feels”, Bohjalian says.
“These responses express precisely what I was trying to convey,” says Bohjalian.
Scores, out of 10: ChatGPT 7.8; Claude 7.3; Meta AI 4.3; Copilot 3.5; Gemini 2.3
Law
Best: Claude
Sterling Miller, a long-time corporate lawyer, judged our AI tools’ understanding of two common legal contracts that people might not necessarily have a lawyer around to help them with. What he found was inconsistency.
At times, Meta AI and ChatGPT tried to reduce complex parts of the contracts to one-line summaries. “That is basically useless,” Miller says.
Worse, the bots sometimes didn’t seem to appreciate significant nuances. In our test rental agreement, Meta AI skipped several sections entirely and missed that a landlord could enter the property at any time. ChatGPT forgot to mention a key clause in a contractor agreement about who owned inventions.
Claude won overall by offering the most consistently decent answers to our questions. And it did its best work on our most complex request: suggesting changes to our test rental agreement. Miller said Claude’s answer was complete, picked up on nuance and laid things out exactly like he would.
On that prompt, it came the closest to being a “good substitute for a lawyer,” Miller says. “The problem is none of the tools got 10s across the board.”
Scores, out of 10: Claude 6.9; Gemini 6.1; Copilot 5.4; ChatGPT 5.3; Meta AI 2.6
Health science
Best: Claude
On average, all of the AI tools scored better at analysing scientific research. In our test of two papers co-written by judge Eric Topol, less than two points separated the best and worst performances.
It’s hard to say exactly why. AI might have access to a lot of scientific papers in its training data. Research reports were also the only documents in our tests that follow a very predictable structure, including their own human-written summary introduction.
Topol’s lowest score of 4 went to Gemini for its summary of a study on Parkinson’s disease. The response didn’t introduce hallucinations, but it left out key descriptions of the study and why it mattered.
Claude was the only AI tool to earn a score of 10 out of 10. Topol gave that for its summary of his paper on long covid, which helpfully broke down the results for different kinds of patients and highlighted the most important takeaway from the paper for doctors treating covid patients.
However, on an analytical question about how one study accounted for racial differences, Claude scored only a 5. “I was very surprised at how different the responses were for the different prompts,” says Topol.
Scores, out of 10: Claude 7.7; ChatGPT 7.2; Copilot 7; Gemini 6.5; Meta AI 6
Politics
Best: ChatGPT
Trump’s speeches can be so meandering, they’ve garnered their own stylistic nickname: “the weave”. Cat Zakrzewski, a Washington Post White House reporter, judged whether AI could make out what he was actually asserting and analyse what it meant.
For example, we asked the bots to analyse Trump’s 100-day rally in Michigan, in which he mentioned jobs returning to the state about a dozen times. But how many jobs? Copilot incorrectly said thousands by conflating some comments Trump made about keeping an Air Force base open. Meta AI answered best by reporting that Trump never specified, while also highlighting what he did suggest about auto jobs.
ChatGPT stood out from the pack with impressive responses to about half of our questions. For example, when we asked it to identify what rival Democrats wouldn’t like about Trump’s unscripted 100-day rally, it produced a bullet-point list that hit all the right notes. “This answer does a good job of drawing specific examples from the speech, and it provides accurate context,” Zakrzewski says. What’s more, it “accurately fact-checks Trump’s false claims that he won the 2020 election”.
The bots got into the most trouble conveying Trump’s tone. For example, Copilot’s summary of the 100-day rally was factually accurate but didn’t capture its charged nature. “If you only read this summary, you might not believe Trump delivered this speech,” says Zakrzewski.
Scores, out of 10: ChatGPT 7.2; Claude 6.2; Meta AI 5.2; Gemini 5; Copilot 3.7
And the overall winner is …
Claude edged out ChatGPT and left the others in the dust.
Overall winner Claude was also the only model that never hallucinated.
What did we learn?
So is that good or bad? Both Claude and ChatGPT produced some analysis that knocked it out of the park, the judges said.
During his evaluations of those two tools, Bohjalian was flabbergasted. “Okay, I’m done. Whole human race is. Stick a fork in us,” he noted.
But you could also see the results this way: none of the bots scored higher than 70% overall – the typical cutoff for a D+.
Beyond hallucinations, a number of limitations echoed across the tests. AI summaries frequently left out important information and overemphasised the positive (while ignoring the negative). Too often, Bohjalian says, you could “really see the robot hiding behind the human mask” pretending to be an expert in something it didn’t actually understand.
And an AI tool’s capability in one field didn’t necessarily translate to another. ChatGPT, for example, might have been tops in politics and literature but ranked near the bottom in law.
The judges highlight the inconsistency as reason for caution.
Miller says AI is not a substitute for a lawyer. “If paying an attorney is out of the question or if you just want to have something in hand while you also read through the agreement or document,” he says, “then using generative AI is an ‘okay’ solution.”
I’d also recommend running your document through at least two AI tools, so you can compare the results. And for anything that’s actually important in your life, it’s definitely worth taking the time to read it yourself.