Evaluating LLM Outputs for Legal Contracts Using BLEU, ROUGE, and BERTScore

Pralohith Reddy Chinthalapelly; Amsa Selvaraj; Chandan Jnana Murthy

Evaluating LLM Outputs for Legal Contracts Using BLEU, ROUGE, and BERTScore

Authors

Pralohith Reddy Chinthalapelly Mayo Clinic, USA Author
Amsa Selvaraj Amtech Analytics, USA Author
Chandan Jnana Murthy Amtech Analytics, USA Author

Keywords:

Large Language Models, BLEU, ROUGE-L, BERTScore, legal contract evaluation

Abstract

LLMs automate legal contract creation, therefore their output must be reviewed for syntactic correctness, semantic integrity, and jurisdictional conformance. This research measures clause-level textual production quality in complex legal documents, including cross-border M&A discussions, using BLEU, ROUGE-L, and BERTScore. Three LLM architectures—an instruction-tuned GPT-4 model, Anthropic's Claude, and a fine-tuned LLaMA-3 model—are compared for legislation compliance, omission risk reduction, and logical coherence across jurisdictions. Quantitative studies show that BERTScore captures deep semantic and legal contextual correctness better than BLEU, ROUGE-L. For international contract enforceability and interpretation consistency, hybrid evaluation paradigms that combine semantic embeddings with domain-specific legal ontology validation are required.

Downloads

Download data is not yet available.

Downloads

Published

11-06-2024

Issue

Vol. 4 (2024): American Journal of Data Science and Artificial Intelligence Innovations

Section

Articles

How to Cite

[1]

Pralohith Reddy Chinthalapelly, Amsa Selvaraj, and Chandan Jnana Murthy, “Evaluating LLM Outputs for Legal Contracts Using BLEU, ROUGE, and BERTScore ”, American J Data Sci Artif Intell Innov, vol. 4, pp. 229–262, Jun. 2024, Accessed: Apr. 23, 2026. [Online]. Available: https://ajdsai.org/index.php/publication/article/view/107

Download Citation