Logo

Understanding the BLEU Score for Translation Model Evaluation

Published on
Last updated
Reading time
avatar

David Kirchhoff

Software Engineer

The BLEU (Bilingual Evaluation Understudy) score is a metric for evaluating the quality of text generated by machine translation models by comparing it to one or more reference translations. It was created to provide an automatic, objective, and reproducible way to assess translation quality, which previously relied heavily on subjective human judgment [1].

In this post we will first explore the intuition behind BLEU, then formalize its definition, and finally implement it in Python.

Intuition

To come up with the BLEU score, the authors addressed the question of what makes a machine translation good. Their answer was that a translation is good if it closely matches a good human translation. BLEU works by comparing a machine translated sentence against a number of human translated sentences and aggregating their scores over the entire corpus to get a final score. This score correlates well with human judgment of translation quality.

How can we automatically check how close a candidate machine translation is to a reference human translation? BLEU measures the overlap of n-grams (contiguous sequences of n words/tokens) in a sentence between the generated text and the reference texts. Let’s look at an example to understand what that means:

  • Candidate: the the the the the the the.
  • Reference 1: The cat is on the mat.
  • Reference 2: There is a cat on the mat.

A naive approach to measure the overlap of words from the candidate with the references is to look at each word in the candidate in isolation and see if it appears in the reference. By using these unigrams, we can calculate the precision using

P=number of candidate words in any referencetotal number of words in candidateP = \frac{\text{number of candidate words in any reference}}{\text{total number of words in candidate}}

In the example, we get

P=77P = \frac{7}{7}

because all words in the candidate appear in the references, and the candidate contains 7 words. If the candiate additionally contained a word which is not in the references, for example dog, the precision would be lower at P=7/8P=7/8. So far so good.

However, one can see that the translation is of poor quality because it does not contain the information that we have in the references about the cat and the mat (lack of adequacy), and it lacks fluency. The problem is that so far we just seek to share a large number of words between candidate and references.

To fix this, we have to discourage the usage of repeated words. To do so, BLEU clips the word count to the maximum number of times the word appears in any reference. In this example, the appears twice in reference 1 and once in reference 2. This yields a modified unigram precision of:

Pm=27P_{m} = \frac{2}{7}

This modified unigram precision introduces a new problem. To achieve a high modified unigram precision, we can shorten the candidate like this:

  • Candidate: the the.
  • Reference 1: The cat is on the mat.
  • Reference 2: There is a cat on the mat.

Now, we would get again a score of 1, because the appears twice in reference 1.

To discourage such short translations, a brevity penalty is introduced. This, as well as how to calculate the score for a corpus consisting of many sentences, is explained next.

BLEU Score Calculation

Calculating the BLEU score consists of three steps n-gram matching, brevity penalty calculation, and final score calculation.

N-gram Matching

So far, we have used unigrams (single words) to calculate precision. BLEU expands this precision to n-grams of various lengths (usually up to 4-grams), which is the proportion of n-grams in the candidate text that match n-grams in the reference texts.

To evaluate a corpus of text, BLEU maintains sentences as units for evaluation. The corpus is broken up into sentences, and for each candidate sentence CC, all n-grams are formed and their counts are aggregated.

For example, using the example above, if n=4n=4, the n-grams would be: the, the the, the the the, and the the the the.

The final precision is calculated by taking one candidate sentence CC at a time from all sentences in the corpus, calculating the clipped counts for all n-grams, and then dividing these counts by the total number of n-grams’\text{n-grams}’ in the candidate.

Pn=CCandidatesn-gramCCountclip(n-gram)CCandidatesn-gramCCount(n-gram)P_{n} = \frac{\sum_{C \in \text{Candidates}} \sum_{\text{n-gram} \in C} \text{Count}_{\text{clip}}(\text{n-gram})}{\sum_{C' \in \text{Candidates}} \sum_{\text{n-gram}' \in C'} \text{Count}(\text{n-gram}')}

Brevity Penalty

In the intuition section, we have seen why the precision metric discourages longer candidate sentences (as their length is in the denominator) and encourages shorter candidates.

To prevent very short candidates from receiving high scores, BLEU applies a brevity penalty when a candidate is shorter than a reference. The penalty for shorter candidates is chosen to be an exponential decay r/cr/c, where cc is the candidate length, and rr is the reference length. Instead of calculating the penalty for every sentence, it is calculated for the entire corpus. This is because a short sentence is significantly more penalized than a longer sentence, but sometimes translations are short for valid reasons.

Consequently, cc is the length of the candidate translation, and rr is the length of the reference translation. This ensures that the generated text is adequately comprehensive.

The brevity penalty is calculated as:

BP={1if c>re(1r/c)if crBP = \begin{cases} 1 & \text{if } c > r \\ e^{(1 - r/c)} & \text{if } c \leq r \end{cases}

Final Score Calculation

The BLEU score is the geometric mean of the n-gram precisions, multiplied by the brevity penalty. The final BLEU score is calculated as:

BLEU=BPexp(1Nn=1NlogPn)BLEU = BP \cdot \exp \left( \frac{1}{N} \sum_{n=1}^{N} \log P_{n} \right)

where NN is the maximum length of n-grams used. Note that by using 1/N1/N we are applying uniform weights for the n-grams, meaning every n-gram is equally important. The BLEU score can also specify different weights for each n-gram, but I won't go into details here.

It is important to understand that the BLEU score is sensitive to the number and quality of reference translations used. A score of 1 is only possible if the candidate matches one of the references exactly. When comparing BLEU scores from different datasets, it is crucial to consider the number of reference translations and their quality to ensure a fair comparison. More reference translations generally provide a better evaluation of the candidate translation's quality.

Implementation

Now that we understand what BLEU measures and how it is calculated, let’s implement it. We'll start by preparing our inputs, then calculate the n-gram precisions, and finally put it all together with the Brevity Penalty. For simplicity, we will support single sentence evaluation only. You can find a notebook with the code here.

Note that this implementation is mainly for educational purposes, as popular ML libraries already come with BLEU score implementations. Examples include:

As a first step we need the following imports.

from collections import Counter
import math
import numpy as np

Preprocessing

The BLEU score calculation works with sentences as the basic unit. For each sentence, we compare it against one or more reference translations.

Given these requirements, we'll use a list of lists of strings, where each word is an element of the inner list, and each list represents a sentence. Additionally, we will case-fold all strings and remove punctuation.

To construct this list of lists of words, we use this function to convert a list of sentences.

def preprocess_text(sentences: list[str]) -> list[list[str]]:
    """Splits a list of strings, for example a list of sentences, into a list of list of strings, and removes punctionation, and case folds.
    The input could be a list of at least one sentence as a string.
    
    Args:
        sentences: A list of strings or a list of lists containing strings.

    Returns:
        A list of lists where each inner list contains words from the corresponding input string.
      """

    if not type(sentences) == list:
        raise ValueError("Expecting a list of strings as input.")
    return [["".join(char.lower() for char in word if char.isalnum()) for word in sentence.split()] for sentence in sentences]

In the next steps, we'll calculate the n-gram precisions and integrate the brevity penalty into our BLEU score calculation.

n-gram Precision

Using the preprocessing function, we can ensure that our inputs are always in the correct format. To calculate the n-gram precision, we define a function that takes a candidate sentence and a list of references, and returns the result as a tuple containing the numerator and denominator. This makes it easier to verify the results.

def n_gram_precision(candidate: list[str], references: list[list[str]], n: int) -> tuple[int, int]:
    """Calculates the n-gram precision for a candidate given a list of references.
    
    Args:
        candidate: A list of words that make up a sentence.
        references: A list of reference sentences, each being a list of words.
        n: The parameter for the n-gram.

    Returns:
        The clipped n-gram count and the total n-gram count.
    """

    if n < 1:
        raise ValueError("n must be greater or equal 1.")

    # Count n-grams in candidate
    candidate_n_grams = Counter([tuple(candidate[i:i+n]) for i in range(len(candidate)-n+1)])

    # Count n-grams in references and take the maximum counts for each n-gram
    max_reference_n_grams = Counter()
    for reference in references:
        reference_n_grams = Counter([tuple(reference[i:i+n]) for i in range(len(reference)-n+1)])
        for n_gram in reference_n_grams:
            max_reference_n_grams[n_gram] = max(max_reference_n_grams[n_gram], reference_n_grams[n_gram])

    # Clip counts
    clipped_counts = {ng: min(count, max_reference_n_grams[ng]) for ng, count in candidate_n_grams.items()}
    
    return sum(clipped_counts.values()), sum(candidate_n_grams.values())

We can test this function with some examples from the paper.

# Example 1
candidate11 = preprocess_text(["It is a guide to action which ensures that the military always obeys the commands of the party."])
candidate12 = preprocess_text(["It is to insure the troops forever hearing the activity guidebook that party direct."])
references1 = preprocess_text(["It is a guide to action that ensures that the military will forever heed Party commands.", 
                               "It is the guiding principle which guarantees the military forces always being under the command of the Party.",
                               "It is the practical guide for the army always to heed the directions of the party."])

# Example 2
candidate21 = preprocess_text(["the the the the the the the."])
references2 = preprocess_text(["The cat is on the mat.", "The cat is on the mat."])

# Tests
assert(n_gram_precision(candidate11[0], references1, 1) == (17,18))
assert(n_gram_precision(candidate12[0], references1, 1) == (8,14))
assert(n_gram_precision(candidate21[0], references2, 1) == (2,7))

Brevity Penalty

The final part we need is the brevity penalty calculation. This function penalizes shorter candidate translations to avoid rewarding incomplete translations.

def brevity_penalty(candidate: list[str], references: list[list[str]]) -> float:
    """Calculates the brevity penalty for a candidate sentence against reference translations.
    
    Args:
        candidate: A list of words from the candidate sentence.
        references: A list of lists of words from the reference translations.
    
    Returns:
        The brevity penalty score.
    """

    c = len(candidate)
    r = min(len(reference) for reference in references)
    
    if c > r:
        return 1
    else:
        return math.exp(1 - r / c)

We can test it with the following examples.

candidate1 = ["A", "test"]
references1 = [["A", "test"], ["Another", "test"] ]
references2 = [["A", "test", "hello"], ["Another", "test", "longer", "better?"], ["And", "another", "test", "longer", "worse!"] ]

assert(brevity_penalty(candidate1, references1) == 1.0)
assert(brevity_penalty(candidate1, references2) == math.exp(-0.5))

Next, we can combine these functions to compute the final BLEU score.

Final BLEU Score Calculation

Using the three components (n-gram precision, brevity penalty, and geometric mean), we can now calculate the final BLEU score. This function combines all the steps and provides a comprehensive evaluation of the candidate translation against the reference translations.

def bleu_score_sentence(candidate_seq: str, references_seqs: list[str], max_n: int=4) -> float:
    """Calculates the BLEU score with uniform weights for a candidate given a list of references.
    
    Args:
        candidate_seq: A string with the candidate sentence.
        references_seqs: A list of strings with reference sentences, one string for each sentence.
        max_n: The number up to which n-grams should be calculated.
    
    Returns:
        BLEU Score     
    """

    candidate_seq: list[str] = preprocess_text([candidate_seq])[0]
    references_seqs: list[list[str]] = preprocess_text(references_seqs)

    precisions = []

    for n in range(1, max_n+1):
        p_n, total = n_gram_precision(candidate_seq, references_seqs, n)
        precisions.append(p_n / total if total > 0 else 0)
    
    if all(p == 0 for p in precisions):
        return 0
    
    geometric_mean = np.exp(np.mean([math.log(p) for p in precisions if p > 0]))
    bp = brevity_penalty(candidate_seq, references_seqs)

    return bp * geometric_mean

To test the function, we can use an example candidate sentence and reference translations.

candidate_seq = "This is a test."
reference = ["This is a test.", "This is a test."]
score_own = bleu_score_sentence(candidate_seq, reference, max_n=4)

And that is it! This implementation allows you to evaluate the quality of a machine translation by comparing it to multiple reference translations, providing a score that correlates well with human judgment.

Conclusion

The BLEU score provides a valuable metric for evaluating the quality of machine translations by measuring n-gram overlap with reference translations. It effectively balances precision and recall through the use of clipped n-grams and brevity penalty, making it a robust and reliable evaluation method. However, interpreting BLEU scores requires careful consideration of the number and quality of reference translations. While a higher BLEU score generally indicates better translation quality, it is essential to ensure consistent evaluation conditions across different datasets to make meaningful comparisons. Understanding and using BLEU appropriately can significantly enhance the development and assessment of machine translation and more broadly, sequence-to-sequence systems.

References

[1] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL '02). Association for Computational Linguistics, USA, 311–318. https://doi.org/10.3115/1073083.1073135

[2] BLEU PyTorch

[3] BLEU Keras

[4] BLEU Natural Language Tool Kit library NLTK

This website uses cookies to ensure you get the best experience. Learn more.