Evaluating LLM Outputs for Legal Contracts Using BLEU, ROUGE, and BERTScore

Authors

  • Pralohith Reddy Chinthalapelly Mayo Clinic, USA Author
  • Amsa Selvaraj Amtech Analytics, USA Author
  • Chandan Jnana Murthy Amtech Analytics, USA Author

Keywords:

Large Language Models, BLEU, ROUGE-L, BERTScore, legal contract evaluation

Abstract

LLMs automate legal contract creation, therefore their output must be reviewed for syntactic correctness, semantic integrity, and jurisdictional conformance. This research measures clause-level textual production quality in complex legal documents, including cross-border M&A discussions, using BLEU, ROUGE-L, and BERTScore. Three LLM architectures—an instruction-tuned GPT-4 model, Anthropic's Claude, and a fine-tuned LLaMA-3 model—are compared for legislation compliance, omission risk reduction, and logical coherence across jurisdictions. Quantitative studies show that BERTScore captures deep semantic and legal contextual correctness better than BLEU, ROUGE-L. For international contract enforceability and interpretation consistency, hybrid evaluation paradigms that combine semantic embeddings with domain-specific legal ontology validation are required.

Downloads

Download data is not yet available.

Downloads

Published

11-06-2024

How to Cite

[1]
Pralohith Reddy Chinthalapelly, Amsa Selvaraj, and Chandan Jnana Murthy, “Evaluating LLM Outputs for Legal Contracts Using BLEU, ROUGE, and BERTScore ”, American J Data Sci Artif Intell Innov, vol. 4, pp. 229–262, Jun. 2024, Accessed: Mar. 07, 2026. [Online]. Available: https://ajdsai.org/index.php/publication/article/view/107