This is an overview of recent methods for evaluating and improving factual consistency in automatic summarization.
Automatic summarization is a common task in natural language processing with many real world applications. A system which generates summaries that are factually incorrect or inconsistent is useless, or downright harmful in applications such as radiology report summarization.
Consistent vs. correct
The distinction between factual consistency and correctness is an important one. The factual consistency of a summary is determined by its agreement with facts in the source document. In contrast, factual correctness looks at agreement with facts in some external knowledge base. For example, I may summarize a factually incorrect news article with perfect factual consistency. Unfortunately, determining factual correctness is an incredibly difficult challenge even for very smart humans, thus most research has focused on factual consistency.
Extractive vs. abstractive
Originally, most summarization systems were extractive, i.e. they work by extracting sentences from source documents and stitching them together into a summary. This made factual consistency more-or-less a non-issue. However, with the rise of seq2seq models and large-scale pretrained language representation models, there has been an increasing number of abstractive systems which generate summaries token by token and are more prone to generating statistically likely but factually inconsistent summaries.
The problem of inconsistency
With the improvement of abstractive systems, the field of summarization has grown quite rapidly as shown by the citation histogram for Lin & Hovy (2002).
However, this has led to the identification of several problems with the current state of affairs by Krysinski et al. (2019), including factual inconsistency and lack of its evaluation.
How prevalent is inconsistency?
More specifically, Krysinski et al. (2019) found that 30% out of 200 summaries (generated by state-of-the-art abstractive models on documents from the CNN-DM dataset) were factually inconsistent.
Similarly, Cao et al. (2017) report that ~30% of summaries generated by a seq2seq model with attention are factually inconsistent, using a sample of 100 summaries from the Gigaword dataset. In contrast, Goodrich et al. (2019) report ~17% inconsistent summaries, using the same model on two samples of 30 summaries generated from wikipedia articles.
Conversely, Falke et al. (2019) find that for a sample of 100 CNN-DM documents, state-of-the-art models generate inconsistent summaries ~25-26% of the time. However, this number drops to ~8% for pointer-generator models with coverage.
Examples of inconsistency
Cao et al. (2017) attempt to improve the factual consistency of abstractive models by conditioning on facts extracted from the source document.
They use OpenIE to extract subject-predicate-object fact triples. They then use a dependency parser to extract subject-predicate and predicate-object fact tuples that are not captured by OpenIE. The dependency parser also allows them to filter out “insignificant” patterns like “somebody said/declared/announced”.
For a document, each fact is concatenated into a string, and these are then joined with a special separator token to create a textual representation of the facts (or relations) in the document. The authors then encode both the document and its facts and use a dual-attention decoder to generate a summary conditioned on both.
Using this method, a statistically-significant improvement in informativeness compared to previous state-of-the-art models was found, as measured by ROUGE. The authors also perform human evaluation of factual consistency on a sample of 100 summaries, and find that consistency improves from 68% to 87%.
Goodrich et al. (2019) attempt to create a framework for automatic evaluation of factual consistency based on the overlap of fact triplets.
The authors use Wikipedia articles and associated facts from the Wikidata knowledge base to create a dataset of reference summaries and associated facts (subject-relation-object triples). The subject in a fact can only be the article topic, and relations are constrained to the ten most frequent ones.
As a result of these simplifying constraints, the authors can approximate factual consistency as the number of overlapping subject-relation-object triplets relative to the number of overlapping subject-relation pairs between a reference and candidate summary.
To extract facts from text, the authors train a seq2seq model on their dataset in a novel structured prediction task where the input text
“Person1 was born in Country1. He was a painter”
will have a target output
“Person1 <t> born in <t> Country1 <f> Person1 <t> profession <t> painter <end>”.
<t> separates tokens within the fact, and
<f> separates facts. During prediction, decoding continues until
<end> is predicted.
To evaluate the effectiveness of their automatic evaluation, the authors sample wikipedia article summaries and ask 4 evaluators to rate these on a scale of 1 to 5 for factual accuracy. They then measure the spearman correlation between different automatic evaluation metrics and these human scores.
On a subset of 30 summaries where their fact-extraction model performed better (articles about Actors), they obtain a correlation of 0.67 compared to 0.64 for ROUGE-2. On a subset of 30 summaries sampled from all articles, they obtain a correlation of 0.45 compared to 0.44 for ROUGE-2.
The authors also investigate the use of NER + relation classification for fact extraction. This approach generally had higher recall and lower precision, suggesting it overestimates the number of facts compared to their seq2seq model, which had higher precision and higher correlations with human scores.
Falke et al. (2019) attempt to improve the factual consistency of summarization systems by reranking candidate summaries generated during beam search, using a score based on natural language inference (NLI, or logical entailment).
The authors claim that NLI can presumably act as a proxy for factual consistency, since factually consistent information in the summary should be entailed by the source document.
They use five different NLI models trained on the MultiNLI dataset to score every sentence pair (combining a sentence from the source document and from the summary). Then, the score for a summary sentence is the maximum score over all sentence pairs it belongs to. Finally, the score for a summary is the average of all its sentence scores.
With 200 documents from the CNN-DM validation set, the authors used state-of-the-art abstractive models and beam search to sample 5 summaries per document. These summaries were then evaluated for factual consistency by humans, finding that 107 out of the 200 documents had both consistent and inconsistent summaries. For each of these documents, the 5 summaries were re-ranked with the authors’ approach.
Unfortunately, the reranking had a negligible impact; leaving close to half of the originally incorrect summaries in the first position. The authors conclude that out-of-the-box NLI models are not suitable for evaluating or improving factual consistency, since it does not correspond to NLI performance.
Li et al. (2018) attempt to improve factual consistency with an entailment-aware encoder and decoder.
The authors propose an multi-task encoder, trained on both summarization (Gigaword dataset) and entailment (SNLI dataset). They further incorporate entailment information in the decoder by adopting reward-augmented maximum likelihood (RAML) training. This data augmentation method involves generating permutations of the target summary, and sampling them proportionally to some reward function (in this case, entailment score) during training.
The authors find that their method, applied to a standard seq2seq model with attention and selective coverage, increases the number of factual consistent summaries from 69.4% to 74.2%, as measured by 5 grad students who had the misfortune to evaluate 500 summaries for different models. The authors also found that their method improved informativeness as measured by ROUGE.
Kryscinski et al. (2020) attempt to evaluate factual consistency by training a model to distinguish consistent and inconsistent summaries on a new dataset.
The authors use rule-based transformations to create a new dataset of factually consistent and inconsistent summaries. These text transformations include paraphrasing, negation, noise injection, and swapping of pronouns, entities, and numbers. Examples of these transforms are shown below:
The authors then use this dataset to a train a model on three tasks:
- Predict if a transformed sentence is factually consistent;
- Extract a span from the source document to support prediction;
- If factually inconsistent, extract supporting span from summary;
The authors find that their method outperforms previous entailment-based methods with a weighted accuracy (consistent/inconsistent classification) of ~74% as opposed to ~52%. However, they also remark that their model cannot capture commonsense or multi-sentence consistency mistakes, and that these are hard to formulate as rule-based transforms.
Recent approaches for evaluating or improving factual consistency can fall in one of three categories:
- Extracting structured fact representations;
- Predicting logical entailment;
- Data augmentation with rule-based transforms;
While logical entailment can help increase model performance, I don’t think it can fundamentally address the issue of factual consistency beyond a surface level. In principle, it could work, however the amount and diversity of data needed to generalize to different factual inconsistencies seems intractable. Similarly, the use of data augmentation is promising but I can’t see it scaling to a point that allows generalization. In contrast, structured fact representation is the most fundamentally sound approach in my opinion. While it comes with its own slew of problems related to discrete representation (e.g. disambiguating when two entities or relations are the same despite being written-out differently) and to the difficulty of open information extraction, these don’t seem like fundamental limitations. I prefer to think of them as promising areas of future research.