Key points:
- AI tools scored higher than lawyers in a 200-question legal research evaluation.
- Alexi, Counsel Stack, Midpage, and ChatGPT all surpassed the human lawyer baseline.
- ChatGPT performed well despite not being purpose-built for legal work.
- AI systems struggled with multi-jurisdictional and citation-specific questions.
- Experts say human review remains essential for accuracy and interpretation.
The report from LLM evaluation startup Vals AI compared the performance of Alexi, Counsel Stack, Midpage, and OpenAI’s ChatGPT against human lawyers on 200 U.S. legal research questions. The questions were sourced from attorneys at firms including Reed Smith, Fisher Phillips, McDermott Will & Emery, Ogletree Deakins, Paul Hastings, and Paul, Weiss, Rifkind, Wharton & Garrison.
Each response—AI and human—was scored for accuracy, authoritativeness, and clarity. The lawyer baseline averaged 69%, while the AI tools outperformed: Counsel Stack led at 78%, followed by Alexi at 77%, Midpage at 76%, and ChatGPT at 74%.
Tara Waters, the project’s lead, said she expected ChatGPT to excel in citation quality but found the opposite. “ChatGPT doesn't seem to be, yet, well-engineered for the sourcing and citation,” she told Legaltech News. The generalist AI tended to rely on broad web-based materials rather than pinpointing authoritative statutes or cases, she said.
Still, the legal-focused tools showed weaknesses too. When prompted to survey all 50 states for a single statute, they underperformed ChatGPT. “That was surprising,” said Vals AI CEO Rayan Krishnan, who noted that the systems “should be able to check each one procedurally” without fatigue. He speculated that some tools may have jurisdictional coverage limits or outdated data.
Krishnan cautioned that despite their strong aggregate scores, AI outputs still leave critical gaps. “Even if these tools are getting 70% accuracy, that remaining 30% is really valuable to have human input for,” he said.
Waters added that Vals AI plans to make its evaluation process more repeatable and automated but emphasized the need for continued human involvement in review and scoring. “There won't ever be a pure automated answer for this,” she said, “but we’ll be able to do it more frequently and consistently.”
This study follows Vals AI’s February benchmark that evaluated legal AI platforms from Thomson Reuters, Harvey, vLex, LexisNexis, and Vecflow, assessing how accurately they handled case analysis and transactional work. The new results suggest that, while AI continues to narrow the performance gap with lawyers, the future of legal research may depend as much on oversight as on automation.









