LLM Evaluation: Frameworks, Benchmarks, and Best Practices

A deep dive into how we measure the intelligence, reliability, and safety of Large Language Models.

Reference-based metrics

As Large Language Models (LLMs) become increasingly integrated into complex workflows, the question of evaluation has moved from a niche research topic to a critical engineering requirement. Unlike traditional software, where unit tests provide deterministic outcomes, LLMs are probabilistic, making “correctness” a moving target.

In this post, we explore the current landscape of LLM evaluation, covering essential metrics, popular benchmarks, and the emerging trend of using LLMs to judge other LLMs.

BLEU: no semantic

BLEU is a useful baseline for large-scale evaluation, but not sufficient on its own — often combined with human evaluation or newer metrics

Limitations

ROUGE: no semantic

BERTScore (ICLR 2020, Cornell): semantic similarity

\begin{equation*} \text{Precision} = \frac{1}{|c|} \sum_{t \in c} \max_{r \in R} \cos(e_t, e_r) \end{equation*} \begin{equation*} \text{Recall} = \frac{1}{|r|} \sum_{r \in R} \max_{t \in c} \cos(e_t, e_r) \end{equation*}

BERTScore

why good?

limitations

Reference-free metrics

BLANC: quality-based metrics for summarizations

How it works

  1. The Source Document: The James Webb Space Telescope (JWST) is the largest optical telescope in space. A collaboration between NASA, ESA, and CSA, its high-resolution instruments allow it to observe some of the most distant events and objects in the universe, such as the formation of the first galaxies.

  2. The Generated Summary: The JWST, a large space telescope from NASA, ESA, and CSA, can see the first galaxies.

  3. Masking the Source: BLANC takes the original source document and “masks” (blanks out) important keywords, typically nouns and verbs. The James Webb Space Telescope (JWST) is the largest optical [MASK] in space. A [MASK] between NASA, ESA, and CSA, its high-resolution instruments allow it to [MASK] some of the most distant events and objects in the universe, such as the formation of the first galaxies.

  4. The Two Test Conditions:
    • Condition A (Baseline): A pre-trained language model (like BERT) is asked to fill in the blanks in the masked source without any other context.
    • Let’s say the model predicts: [MASK] -> "telescope" (Correct) [MASK] -> "project" (Incorrect, the word was "collaboration") [MASK] -> "see" (Incorrect, the word was "observe")
    • Baseline Accuracy: 1 out of 3 correct = 33%

    • Condition B (Summary-Cued): The same model is now given the generated summary as context and asked to fill in the same blanks.
    • Now with the context: “The JWST, a large space telescope… can see the first galaxies.”
    • The model predicts:
       [MASK] -> "telescope" (Correct)
       [MASK] -> "collaboration" (Correct)
       [MASK] -> "observe" (Correct)
      
    • Summary-Cued Accuracy: 3 out of 3 correct = 100%
  5. Calculating the BLANC Score:
    • The final BLANC score is simply the difference in accuracy between the summary-cued condition and the baseline.
    • BLANC Score = (Summary-Cued Accuracy) - (Baseline Accuracy)
    • BLANC Score = 100% - 33% = 67%

LLM-as-a-Judge

criteria

LLM evaluators can judge text based on different inputs:

How it works

Generating scores

Likert Scale Scoring:
Evaluate the quality of summaries written for a news article. Rate each summary on four
dimensions: {Dimension_1}, {Dimension_2}, {Dimension_3}, and {Dimension_4}. You should
rate on a scale from 1 (worst) to 5 (best).
Article: {Article}
Summary: {Summary}

pairwise comparisons

Solving Yes/No questions

Making multiple-choice selections