What is gym-summarizer
gym-summarizer is an OpenAI Gym environment for extractive summarization, a task where the summary of a text is generated by selecting a subset of sentences from it. This was a class project for COMP-767: Reinforcement Learning, built in collaboration with Alexander Nicholson.
In this environment, we define episodes, observations, actions, and rewards as follows:
Episodes are the generation of an entire summary from a given source text;
Observations are the concatenated sentence embeddings from the source text and from the summary generated so far, truncated or padded with zeros to predefined lengths. For example: we use BERT to generate 768-dimensional sentence embeddings, define a maximum source text length of 100 sentences, and a summary length of 4 sentences; the resulting observation is a 104x768 dimensional vector.
Actions are the sentence indices for the source text, up to the predefined maximum number of sentences. For example: if an agent outputs actions 0,1,2,99 in an episode, the first 3 sentences and last sentence are selected for the summary. Every time a sentence is selected, its embedding is added to the corresponding summary column in the next observation. If an invalid action such as a sentence index that has already been selected or that is out-of-bounds for the source text, the agent receives a negative reward and the state does not change.
Rewards are generated via different configurations of the ROUGE automatic evaluation metric. In particular, we extend ROUGE with potential-based reward shaping by measuring the incremental increase of ROUGE with each action. This allows dense rewards as opposed to the common sparse reward setting, where the ROUGE score isn’t calculated and observed as a reward until the end of the episode.
Why we built gym-summarizer
We built gym-summarizer to explore three issues we saw with current approaches of applying reinforcement learning to summarization:
Reproducibility and baselines
By creating a gym environment for summarization, we wanted to leverage its standard interface to make it easier to run reproducible experiments and baselines. While most approaches use either REINFORCE or actor-critic algorithms, we wanted to make it easy to use the variety of out-of-the-box models from stable-baselines to e.g. compare value-based methods and policy-based methods.
Credit assignment problem
Most approaches use sparse terminal rewards, meaning you only observe a reward once a whole summary has been generated. This is due to the fact that BLEU or ROUGE (which are commonly used as rewards) are sequence-level metrics. This leads to a significant credit assignment problem. For example, if you generate the first half of a summary flawlessly but the second half degenerates into something very bad, you have no way of learning that distinction if you only obtain one reward at the end of the summary. In contrast, if you obtain a reward signal for every word generated, it could address this credit assignment issue. We wanted to see if these kind of dense rewards could improve learning and performance.
Most approaches involve supervised pretraining with a maximum likelihood objective, followed by fine-tuning with a reinforcement learning objective. We wanted to see if effective summarization could be learned solely with RL.
How we built gym-summarizer
We use an existing preprocessed version of the CNN-DM dataset.
We use bert-as-service and bert-base-uncased to pregenerate sentence embeddings for each document in the dataset.
Stable baselines algorithms
We use stable-baselines to implement and train the PPO2, DQN, and A2C algorithms;
Effect of dense reward
Using a CNN-LSTM policy and PPO2, we compared the effect of intermediate (dense) rewards compared to terminal (sparse) rewards. We found that using intermediate rewards helped with our final performance on a withheld evaluation set.
Effect of algorithm
We experimented with other algorithms (PPO2, DQN, A2C) and policy network architectures (MLP, LSTM, CNN-LSTM). However, we found no significant change in performance.
Limitations of end-to-end RL
We found our RL-generated summaries consistently underperformed a simple Lead-3 baseline (where a summary is generated by taking the first three sentences).
This is likely due to limitations with our sentence representation which could not be fine-tuned in our approach.