Text Summarization

Text summarization is the process of constructing a fluent and concise summary while retaining the key meaning from the original text (Allahyari et. al, 2017; Torres-Moreno, 2014). In the digital age where textual content rapidly grows on the internet each day, manual summarization takes too much effort and time (El-Kassa, 2021). Automatic Text Summarization (ATS) resolves this dilemma by generating summaries through software systems (El-Kassa, 2021; Torres-Moreno, 2014).

Approaches

Using a single or multi-document ATS system, the source document(s) are processed in an automatic text summarizer where there are two general approaches to summarization: extractive or abstractive (El-Kassas, 2021; Li et al., 2009).¹

Extractive Text Summarization

Words from the original document(s) are assigned scores using linguistic or statistical features (El-Kassas, 2021; Gudivada, 2018; Suleiman & Awajan, 2021). Subsequently, the words with the highest scores are extracted to form a summary (Li et al., 2009).

Example 1 (Garbade, 2018): Extractive Text Summarization

Source text: Joseph and Mary rode on a donkey to attend the annual event in Jerusalem. In the city, Mary gave birth to a child named Jesus.

Extractive summary: Joseph and Mary attend event Jerusalem. Mary birth Jesus.

Pre-Processing

Constructing structured format of the original text using linguistic techniques such as word tokenization, stop-words removal, and stemming (El-Kassas et al., 2021).

Processing

The processing stage includes three tasks: establishing an intermediate representation of the text, scoring sentences, and selecting sentences with high scores (El-Kassas et al., 2021).

Intermediate Representation

The structured format is stored as an intermediate representation (Allahyari et al., 2017; Nenkova & McKeown, 2012). The following are the two main types of intermediate representation:

Topic Representation

Topic representation approaches use techniques such as TF-IDF or lexical chain approaches (Allahyari, 2017; Nenkova & McKeown, 2012).²

TF-IDF: the words and their corresponding weights are presented in a table (Nenkova & McKeown, 2012).

Lexical chain: related terms are identified using a thesaurus such as WordNet (Gudivada, 2018; Nenkova & McKeown, 2012).³ For instance, “New York” and “city” will be identified as the same term.

N-gram indicates retrieving a consecutive of n amount of text (Li et al. ,2009). However, as n increases, discontinuous key terms may be ignored (Li et al. ,2009). For instance, with n<5 for “I’d like to go to China, if your parents permitted, with you”, the text relationship of “go to China” and “with you” are ignored (Li et al., 2009, p.180). Thus, Li et al. (2009) have proposed that using term proximity and query expansion may help extract important information scattered in the sentences.

Edit distance is implemented to quantify whether two strings have a similar meaning (Jurafsky & Martin, 2020). For instance, “Stanford President Marc Tessier-Lavigne” and “Stanford University President Marc Tessier-Lavigne” are judged as similar strings based on edit distance (Jurafsky & Martin, 2020, p.21).

Indicator Representation

Indicator representation approaches such as graph models use LexRank to represent the whole document as a network consisting of inter-linked sentences (Gudivada, 2018; Nenkova & McKeown, 2012).⁴

Scoring Sentences

Based on the approach, a score is assigned to each sentence to an intermediate representation (Allahyari et al., 2017).

Topic representation approaches: the score of each sentence represents the important topics from the text (Allahyari et al., 2017).

Indicator representation approaches: the score is calculated using machine learning to find indicator weights and aggregate results (Allahyari et al., 2017). For instance, LexRank considers that a particular sentence is important when it is similar to other sentences (Gudivada, 2018).

Selecting Sentences

The summarizer system chooses the top-scored sentences and creates a summary (Allahyari et al., 2017; Suleiman & Awajan, 2021). Creating a summary from these sentences has several approaches such as the best n approach, which selects the top n sentences from the sentence ranking (Gudivada, 2018).⁵

Post-Processing

The post-processing stage includes steps such as replacing pronouns with their original noun (e.g. changing they/their/them to the people) and reordering extracted sentences to form a fluent summary (El-Kassas et al., 2021; Gupta & Legal, 2010).

Abstractive Text Summarization

Implementing natural language processing and machine learning, new phrases are generated by understanding and paraphrasing crucial information from the original text (Allahyari et. al, 2017; Gudivada, 2018; Suleiman & Awajan, 2020). For instance, sentence compression reduces irrelevant detail in the sentences (Gupta & Gupta, 2019).

Example 2 (Garbade, 2018): Abstractive Text Summarization

Source text: Joseph and Mary rode on a donkey to attend the annual event in Jerusalem. In the city, Mary gave birth to a child named Jesus.

Abstractive summary: Joseph and Mary came to Jerusalem where Jesus was born.

Pre-Processing

The same pre-process as extractive text summarization.

Processing

The processing stage includes two tasks: establishing an intermediate representation of the text and generating summary (El-Kassas et al., 2021).

Intermediate Representation & Summary Generation

The following are the two categories to generating abstractive summaries:⁶

Structured-Based

Using template or tree methods, the most crucial information is retrieved from the text (Gupta & Gupta, 2019; Suleiman & Awajan, 2020).

The template methods extract sentence fragments using keywords and organize them into templates to formulate the final summary (Gupta & Gupta, 2019). For instance, “The project manager goes through the minutes of the last meeting” (Oya, 2014, p.23) becomes a template such as “[speaker] goes through [evidence] of the last meeting” (Oya, 2014, p.23).

The tree methods extract important text that is considered to be the summary and organize them into tree-like structure (Gupta & Gupta, 2019). For instance, Barzilay and McKeown (2005) proposed a method to combine fragments of semantically-related sentences and develop new sentences.

Semantic-Based

Semantic-based approaches focus on the meaning of the text (Suleiman, Awajan, 2020). A popular method is the semantic graph-based approach (Gupta & Gupta, 2019). By extracting nouns and adjectives from the sentences, semantic graphs are developed and ranked using PageRank score calculation (Joshi, Wang, & McClean, 2018).

Post-Processing

The same post-process as extractive text summarization.

Challenges

Long Length Documents

The majority of current ATS systems focus on short-length multi-documents and achieve low accuracy and efficiency using these systems to summarize long texts such as books (El-Kassas et al., 2021; Wang et al., 2017; Wu et al., 2017). Current systems analyze all the words in the long paragraphs and generate a summary word by word sequentially, suffering from slowness and repetitions when developing multi-sentence summary (Chen & Bansal, 2018).

Concept Generalization and Fusion

Concept generalization and fusion generalize different concepts into one concept (Gupta & Gupta, 2019). It is one of the most challenging areas of abstractive summarization since the systems need good algorithms that can provide generalizable sentences while considering the conciseness of summaries (Gupta & Gupta, 2019).

Rare Words

For abstractive text summarization, rare words that do not occur frequently in the ATS systems might be ignored by the systems (Gupta & Gupta, 2019). Since rare words (e.g. a specific person’s name) and their syntactic structure are important in the summaries, allowing the systems to automatically identify these components is a challenge in natural language processing (Gupta & Gupta, 2019).

Areas of Future Work

Abstractive Summarization

Extractive summarization applications have been the main focus in the scientific community (El-Kassas et al., 2021; Li et al., 2009). However, extractive ATS systems produce summaries that are not as fluent as human-generated summaries (El-Kassas et al., 2021). Thus, researchers are now focusing more on abstractive summarization that would provide automatic generation of human-like synopsis (El-Kassas et al., 2021; Torres-Moreno, 2014).

Multimedia Summaries

Most ATS systems handle textual input and output to this day (El-Kassas et al., 2021). New summarizers are needed that can process sounds and videos, creating multimedia summaries (El-Kassas et al., 2021; Torres-Moreno, 2014).

Quality Dataset

Currently, the majority of datasets that train abstractive text summarization systems are news articles (Gupta & Gupta, 2019). There is a need for large and good datasets in the research community to train and develop abstractive summaries (Gupta & Gupta, 2019; Suleiman & Awajan, 2020).

Bibliography

Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). Text summarization techniques: a brief survey. In Proceedings of arXiv, USA. https://arxiv.org/abs/1707.02268

Barzilay, R., & McKeown, K. R. (2005). Sentence fusion for multidocument news summarization. Computational Linguistics, 31(3), pp. 297-327

Chen, Y.-C., & Bansal, M. (2018). Fast abstractive summarization with reinforce-selected sentence rewriting. Proceedings of arXiv, USA. https://arxiv.org/abs/1805.11080

El-Kassa, W. S., Salama, C. R., Rafea, A. A., & Mohamed, H. K. (2021). Automatic text summarization: A comprehensive survey. Expert Systems with Applications,165. doi:10.1016/j.eswa.2020.113679

Garbade, M. J. (2018, September 19). A quick introduction to text summarization in machine learning. Towards data science. https://towardsdatascience.com/a-quick-introduction-to-text-summarization-in-machine-learning-3d27ccf18a9f

Gudivada, V. N. (2018). Natural language core tasks and applications. In: Venkat N. G., & C.R.R. (Eds.), Computational analysis and understanding of natural languages: Principles, methods and applications (Vol. 38, pp. 403-428). doi:10.1016/bs.host.2018.07.010

Gupta, S. & Gupta, S. K. (2019). Abstractive summarization: An overview of the state of the art. Expert Systems With Applications, 121, pp.49-65. doi:10.1016/j.eswa.2018.12.011

Joshi, M., Wang, H., & McClean, S. (2018). Dense semantic graph and its application in single document summarization. In: Cristian L., Alessandro G., & Giovanni S. (Eds.), Emerging ideas on information filtering and retrieval (Vol. 746, pp.55-67). Springer. doi: 10.1007/978-3-319-68392-8_4

Jurafsky, D., & Martin, J. H. (2020). Regular expressions, text normalization, edit distance. Retrieved from https://web.stanford.edu/~jurafsky/slp3/2.pdf

Li, S., Zhang, Y., Wang, W., & Wang, C. (2009). Using proximity in query focused multi-document extractive summarization. In: Wenjie L., Diego M.-A. (Eds.), Computer processing of oriental languages, language technology for the knowledge-based economy (Vol. 5459, pp.179-188). Springer. doi: 10.1007/978-3-642-00831-3_17

Nenkova, A., McKeown, K. (2012). A survey of text summarization techniques. In: Charu C. A., ChengXiang Z. (Eds), Mining text data (pp. 43-76). Springer. doi:10.1007/978-1-4614-3223-4_3

Oya, T. (2014). Automatic abstractive summarization of meeting conversations [Master’s thesis, The University of British Columbia]. UBC Theses and Dissertations. doi: 10.14288/1.0165907

Suleiman, D., & Awajan, A. (2020). Deep learning based abstractive text summarization: Approaches, datasets, evaluation measures, and challenges. Mathematical Problems in Engineering. doi: 10.1155/2020/9365340

Torres-Moreno, J-M. (2014). Automatic text summarization. Wiley

Wang, S., Zhao,, X., Li, B., Ge, B., & Tang, D. (2017). Integrating extractive and abstractive models for long text summarization. IEEE International Congress on Big Data 2017. IEEE. doi: 10.1109/BigDataCongress.2017.46

Wu, Z., Lei, L., Li, G., Huang, H., Zheng, C. Chen, E., & Xu, G. (2017). A topic modeling based approach to novel document automatic summarization. Expert Systems With Applications, 84, pp.12-23. doi: 10.1016/j.eswa.2017.04.054

Footnotes

There is also a hybrid approach that combines the techniques used in extractive and abstractive approaches (El-Kassas, 2021).
Topic representation approaches also include Latent Semantic Analysis and Bayesian topic models (Nenkova & McKeown, 2012).
For more details on WordNet, see Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM. doi:10.1145/219717.219748
For more details on LexRank, see Erkan, G., & Radev, D. R. (2004). LexRank: Graph-based lexical centrality as salience in text summarization. The Journal of Artifical Intelligence Research, 22, pp.457-479. doi:10.1613/jair.1523
Selecting sentences can also implement the global selection approach that chooses sentences based on optimization procedures or the maximal marginal relevance that uses an iterative procedure, recalculating the sentence scores after each step (Gudivada, 2018).
Another category is deep learning that strives to imitate human thinking by extracting features in different levels of abstraction (Suleiman, Awajan, 2020). An example is the RNN Encoder-Decoder Summarization, where the meaning of the generated word of an input will decide the next output word, following a sequential path that leads to generating the summary (Suleiman, Awajan, 2020).