Authors:
(1) Rohit Saxena, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburg;
(2) RFrank Keller, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburg.
Table of Links
Part 1
Part 2
Part 3
Part 4
Part 5
Part 6
5. Summarization Using Salient Scenes
We now investigate the benefit of using only salient scenes for the abstractive summarization of movie scripts. We formulate this task as a sequence-to-sequence generation problem. Formally, given a movie with a set of salient scenes M = {S1, S2, …, SK}, the goal is to generate a target summary S = {s1, s2, …, sm}. As the input length of the salient scenes is still quite large as shown in Figure 2, we use a Longformer EncoderDecoder (LED) architecture (Beltagy et al., 2020). To handle long input sequences, LED uses efficient local attention with global attention for the encoder. The decoder then uses the full self-attention to the encoded tokens and to previously decoded locations to generate the summary.
5.1 Dataset
We used the same dataset and split as in Section 4.1, now with Wikipedia plot summaries as output for movie script summarization. However, instead of using the whole movie script, we utilize the output of our scene saliency model and input only the salient scenes when we generate movie summaries.
5.2 Baselines
We compare the proposed model with various baselines. Lead-N simply outputs the first N tokens of the movie script as the summary of the movie. We varied N to understand the impact of summary length on performance and report results on Lead-512 and Lead-1024. FLAN-T5-XXL (Chung et al., 2022), FLAN-UL2 (Wei et al., 2022), Vicuna-13b-1.5 (Zheng et al., 2023) which is finetuned on Llama-2 (Touvron et al., 2023), and GPT-3.5-Turbo[4] (Brown et al., 2020) are instruction tuned large language models (LLMs) which were used in zero-shot setting. SUMM N (Zhang et al., 2022) is a multi-stage summarization framework for long input dialogues and documents. Unlimi-former (Bertsch et al., 2023) uses retrieval-based attention mechanism for long document summarization. Two-Stage Heuristic (Pu et al., 2022) is a two-stage movie script summarization model which first selects the essential sentences based on heuristics and then summarizes the text using LED with efficient fine-tuning. Random Selection randomly selects salient scenes for summarization. Full Text takes the full movie script as input (no content selection) and truncates the text based on model input length.
5.3 Implementation Details
We experimented with two pre-trained models LED and Pegasus-X as base models for summarization which were fined-tuned on the Scriptbase corpus (see Section 4.1). Each input sequence for the movie is truncated to 16,384 tokens (including special tokens) to fit into the maximum input length of the model. We experimented with both the base and large variants of these models and found that the large models performed better and used them in our experiments. We used AdamW as an optimizer (β1 = 0.9, β2 = 0.99) with a learning rate of 5e-5. We used a linear warmup strategy with 512 warmup steps. We trained the models to 60 epochs and used the checkpoint with the best validation score. We used a beam size of five for decoding and generating the summary. We also created a random selection baseline by selecting a random k% of scenes and using those to generate a summary. We report the best result for random selection, which was obtained for k = 25 and LED. All the baseline models are fully trained on our dataset using the best configuration from the papers.
5.4 Results
Table 5 shows our evaluation results using ROUGE (F1) scores and BERTScore on the Scriptbase corpus. Compared with the baseline models and previous work, our model achieves state-of-the-art results on all metrics. Specifically, our Select and Summarize model, which selects salient scenes, achieves 49.98, 12.11, and 47.95 on ROUGE1/2/L scores and also shows improvements on BERTScore. Compared to a model which uses the full text of the movies, our model improves the performance by 3.83, 1.49, and 3.49 ROUGE-1/2/L points, respectively. The Lead-N baseline achieves better results than Agarwal et al. (2022) with a ROUGE-1 of 17.69 for Lead-1024. Our model outperforms SUMMN (Zhang et al., 2022), which can be attributed to better content selection using salient scenes compared to greedy content selection based on ROUGE. As named entities and places are repeated across the movie script, the greedy alignment used in SUMMN can result in false positives. Unlimiformer performance is low compared to our model and the two-stage model, possibly because it does not include explicit content selection. The Pu et al. (2022) model performs slightly better than using Full Text, as removing sentences based on heuristics allows it to include movie script text which would otherwise be truncated. FLAN-UL2 performs better than GPT-3.5-Turbo and FLANT5-XXL in a zero-shot setting but our fine-tuned model outperforms all three models.
We also experimented with Pegasus-X (Phang et al., 2023) instead of LED as the base summarization model for SELECT & SUMM. We found both models perform better when using our approach of selecting salient scenes compared to the full text, with LED demonstrating superior performance.
Figure 2. also shows that our model yields improvements even though it uses only half the length (only salient scenes) of the original script. This demonstrates the effectiveness of salient scene selection in movie script summarization. Appendix E shows generated summaries for two movies.
[4] We used model gpt-3.5-turbo-1106 which has context length of 16K tokens.