Evaluating LLM Outputs for Legal Contracts Using BLEU, ROUGE, and BERTScore
Keywords:
Large Language Models, BLEU, ROUGE-L, BERTScore, legal contract evaluationAbstract
LLMs automate legal contract creation, therefore their output must be reviewed for syntactic correctness, semantic integrity, and jurisdictional conformance. This research measures clause-level textual production quality in complex legal documents, including cross-border M&A discussions, using BLEU, ROUGE-L, and BERTScore. Three LLM architectures—an instruction-tuned GPT-4 model, Anthropic's Claude, and a fine-tuned LLaMA-3 model—are compared for legislation compliance, omission risk reduction, and logical coherence across jurisdictions. Quantitative studies show that BERTScore captures deep semantic and legal contextual correctness better than BLEU, ROUGE-L. For international contract enforceability and interpretation consistency, hybrid evaluation paradigms that combine semantic embeddings with domain-specific legal ontology validation are required.