EasyNLP plays with text summary (news headline) generation

Author: Wang Ming, Huang Jun

Guided reading

Text generation is an important research direction in the field of natural language processing, with rich practical application scenarios and research value. Among them, generative text summarization is an important sub-task of text generation. In practical application scenarios, it includes tasks such as news title generation, summary generation, and keyword generation. Although pre-trained language models, such as BERT, MASS, uniLM, etc., have achieved impressive performance in NLU scenarios, the word and sub-word masking language models used by the models are not suitable for text generation scenarios, especially generative text. Abstract scene. The reason is that the task of generative text summarization often requires the model to have more coarse-grained semantic understanding, such as sentence and paragraph semantic understanding, for summary generation. In order to solve the above problems, the PEGASUS model (PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization) designed an unsupervised pre-training task (Gap Sentence Generation, GSG for short) for the text summarization task, which randomly covers several documents in the document. Full sentences, let the model generate obscured sentences. The pre-training task can well match the actual text summarization task, so that the pre-trained model can achieve a better summary generation effect after simple fine-tuning. Therefore, we integrated the PEGASUS algorithm and model in the EasyNLP framework, enabling users to conveniently use the model for training and prediction on tasks related to text summarization generation.

EasyNLP(https://github.com/alibaba/EasyNLP ) is an easy and rich NLP algorithm framework developed by A Cloud Machine Learning PAI team based on PyTorch, a persistent NLP pre-training model and model landing technology, and provides a station NLP development experience from training to deployment. EasyNLP provides a concise interface for users to develop NLP models, including NLP applications AppZoo and pre-training ModelZoo, and provides technology to help users efficiently implement super-large pre-training models into business. As a sub-task of natural language processing, text generation has many practical applications, including title generation, text summarization, machine translation, question answering systems, dialogue systems, and more. So, EasyNLP is also gradually increasing its support for text generation subtasks, hope to serve more NLP and NLG algorithm developers and researchers, and work with the community to promote the development and implementation of NLG technology.

This article will provide a technical interpretation of PEGASUS and how to generate models using PEGASUS-related text summaries (news headlines) in the EasyNLP framework.

Detailed explanation of Pegasus model

Prior to this, although the text generation pre-training models T5, BART and other models have achieved obvious performance gains in many text generation tasks, in the text summarization task, there is still a big difference between the pre-training target of the model and the text summarization target. . This results in that when such pre-trained models are migrated to summary tasks in different domains, they still need more training data to fine-tune the model to achieve better results. In order to alleviate the above problems, the PEGASUS model adds a complete sentence covering loss based on the original subword covering loss, that is, covering a few random complete sentences in the input document to allow the model to recover.

Specifically, as shown in the figure above, PEGASUS adopts an encoder-decoder architecture (standard transformer architecture). The model uses two types of masking for the input, one is the subword masking used by BERT, which is represented by [mask2], and allows the encoder of the model to restore the masked subwords (this type of loss has been proved in ablation experiments to be ineffective for downstream tasks. gain, so it was not used in the final PEGASUS model). The other is GSG, denoted by [mask1], which allows the decoder to generate random complete sentences that are masked in the input. For this loss, the author proposes three optional schemes at the same time, including Random (select m sentences randomly), Lead (select the first m sentences), and Ind-Orig (select m sentences according to the importance score). Among them, the importance score is obtained by calculating the ROUGE score of each sentence and other sentence sets in the document. It can be considered that the strategy selects sentences that can largely represent other sentences in the document as the covering objects. The figure below shows an example of the three sentence selection schemes, and the selected sentences are marked in green, reddish-brown, and blue, respectively. Experiments show that the model with the third sentence selection strategy can achieve the best performance.

Text Summary Model Tutorial

Below we briefly describe how to use PEGASUS and other text summarization models in the EasyNLP framework.

Install EasyNLP

Users can directly refer to GitHub ( https://github.com/alibaba/EasyNLP ) to install the EasyNLP algorithm framework.

data preparation

In specific text summarization scenarios, users are required to provide training and validation data for downstream tasks, which are tsv files. For text summarization tasks, this file contains two columns of data separated by tabs \t, the first column is the summary column and the second column is the source column. An example is as follows:

Hubei: The work resumption rate of "four top enterprises" has reached 93%.8%	CCTV News: On April 1, the reporter learned from a press conference on the prevention and control of the new crown pneumonia epidemic in Hubei Province that with the joint efforts of all parties, the resumption of work and production in Hubei Province has achieved phased results. As of March 31, the resumption rate of Hubei Province's "four top enterprises", including industrial enterprises above designated size and legal entities in the service industry above designated size, has reached 93%..8%,Return rate 69.3%. The rate of resumption of work and resumption of work in Wuhan has also reached 85%, respectively..4%,40.4%. Responsible editor: Wang Shiyao

The following files generate training and validation data for preprocessed news headlines and can be used for testing:

https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/generation/title_gen.zip

Chinese news headline generation

Since the original model produced by PEGASUS only supports English, in order to facilitate the use of users in the Chinese community, we pre-trained a model mT5 for Chinese news headline summaries based on the model architecture of mT5, and integrated it into the EasyNLP model library. At the same time, we also integrated the text summarization Chinese model Randeng (which can be considered as the Chinese version of PEGASUS) pre-trained by IDEA to facilitate users to explore the performance of different models. The following summarizes the models available in EasyNLP and compares the performance of the models on the above datasets. Users are recommended to select the first two models for text summarization and the last three models for news headline generation.

Chinese

News headline (Rouge1/2/L)

Abstract of the paper title (Rouge1/2/L)

hfl/randeng-238M-Summary-Chinese

59.66/46.26/55.95

54.55/39.37/50.69

hfl/randeng-523M-Summary-Chinese

62.86/49.67/58.89

53.83/39.17/49.92

alibaba-pai/mt5-title-generation-zh-275m

62.35/48.63/58.96

54.28/40.26/50.55

alibaba-pai/randeng-238M-Summary-Chinese-tuned

64.31/51.80/60.97

58.83/45.28/55.72

alibaba-pai/randeng-523M-Summary-Chinese-tuned

64.76/51.65/61.06

59.27/45.58/55.92

In the news headline generation task, we take the following command to train the model. The user can decide the number of steps to save the model according to the hyperparameter 'save_checkpoint_steps'. The framework will evaluate the trained model at this time, and will decide whether to update the saved model parameters according to the performance of the model on the validation set. Among them, the running main.py file is in the EasyNLP/examples/appzoo_tutorials/sequence_generation directory, and the training and validation set data need to be placed in this directory. The models in the above table can be specified under 'pretrain_model_name_or_path' under the 'user_defined_parameters' hyperparameter.

python main.py \
    --mode train \
    --app_name=sequence_generation \
    --worker_gpu=1 \
    --tables=./cn_train.tsv,./cn_dev.tsv  \
    --input_schema=title_tokens:str:1,content_tokens:str:1 \
    --first_sequence=content_tokens \
    --second_sequence=title_tokens \
    --label_name=title_tokens \
    --checkpoint_dir=./finetuned_zh_model \
    --micro_batch_size=8 \
    --sequence_length=512 \
    --epoch_num=1  \
    --save_checkpoint_steps=150 \
    --export_tf_checkpoint_type none \
    --user_defined_parameters 'pretrain_model_name_or_path=alibaba-pai/mt5-title-generation-zh language=zh copy=false max_encoder_length=512 min_decoder_length=12 max_decoder_length=32 no_repeat_ngram_size=2 num_beams=5 num_return_sequences=5'

In addition, users can use the following command to use the model for summary generation, the path of the model is specified by 'checkpoint_dir'. The user can specify to add input columns to the output file through 'append_cols', or fill in none if not specified.

python main.py \
    --mode=predict \
    --app_name=sequence_generation \
    --worker_gpu=1 \
    --tables=./cn_dev.tsv  \
    --outputs=./cn.preds.txt \
    --input_schema=title:str:1,content:str:1,title_tokens:str:1,content_tokens:str:1,tag:str:1 \
    --output_schema=predictions,beams \
    --append_cols=content,title,tag \
    --first_sequence=content_tokens \
    --checkpoint_dir=./finetuned_zh_model \
    --micro_batch_size=32 \
    --sequence_length=512 \
    --user_defined_parameters 'language=zh copy=false max_encoder_length=512 min_decoder_length=12 max_decoder_length=32 no_repeat_ngram_size=2 num_beams=5 num_return_sequences=5'

The following are several examples of the model's prediction of recent hot events. Each example contains 5 columns of data (separated by tabs \t), which are the predicted summary column (news title) and the 5 candidates of beam search. (separated by ||), the input text, the input news tag. The last three columns are directly copied from the corresponding input data. Due to the length of the news text, only the first four columns of results for each sample are shown below.

**Federer farewell letter: I will play more tennis in the future**	Federer farewell letter: I will play more tennis in the future||Federer's farewell letter: I will play more tennis in the future||Federer's farewell letter: I'll play more tennis in the future, but not at a Grand Slam or Tour||Federer farewell letter: I will play more tennis in the future||Details: Federer announces retirement, farewell letter	**The end of a generation of legends! Tennis king Federer announces retirement**	CCTV News: On the evening of September 15th, Beijing time, the tennis king Roger-Federer announced his retirement on personal social media.  The 41-year-old Federer is one of the greatest players in men's tennis history. He has won 103 singles titles and 20 Grand Slam singles titles (6 Australian Opens, 1 French Open, 8 Wimbledon, and 5 US Opens).  310 weeks in the men's singles world number one. With Federer's Farewell Letter: Of all the gifts tennis has given me over the years, the best are without a doubt the people I have met along the way: my friends, my competitors, and most importantly, the fans, who gave life of the sport. Today, I want to share some news with you. As many of you know, I have been challenged by injuries and surgery for the past three years.......
**Typhoon "Meihua" will land on the coast of Dalian and will gradually transform into an extratropical cyclone**	Typhoon "Meihua" will land on the coast of Dalian and will gradually transform into an extratropical cyclone||Typhoon "Meihua" will gradually transform into an extratropical cyclone after making landfall along the coast of Dalian||Typhoon "Meihua" will land on the coast of Dalian and will gradually change into an extratropical cyclone||Typhoon "Meihua" will transform into an extratropical cyclone after making landfall along the coast of Dalian||Typhoon "Plum Blossom" will gradually degenerate after landing on the coast of Dalian	**Typhoon "Meihua" will make landfall on the coast of Dalian, Liaoning Province around the evening of the 16th**	The reporter learned from the Meteorological Department of Dalian City, Liaoning Province on September 16 that this year's No. 12 typhoon "Plum Blossom" will make landfall on the coast from Lushunkou District, Dalian City to Zhuanghe City around the evening of the 16th, and then gradually change into an extratropical cyclone.  Affected by the typhoon "Plum Blossom", from 8:00 on the 14th to 10:00 on the 16th, the average rainfall in Dalian was 132 mm, and the maximum rainfall occurred in Zhengmingsi Village, Dalijia Street, Jinpu New District, which was 283 mm..6 mm; the maximum rainfall in one hour was 49 in Guangludao Town, Changhai County..4 mm......

English text abstract

The EasyNLP model library also integrates English text summarization models, including PEGASUS and BRIO. The following table shows the performance of the two models on English text summarization data. Users can also use the above code to train and predict the model. It should be noted that EasyNLP processes Chinese by default. Therefore, when you need to process English text, you need to specify the language as en in 'user_defined_parameters'. If not provided, it defaults to Chinese (zh).

English

Text Summary (Rouge1/2/L)

alibaba-pai/pegasus-summary-generation-en

37.79/18.69/35.44

hfl/brio-cnndm-uncased

41.46/23.34/38.91

The training process is as follows:

wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/generation/en_train.tsv
wget http://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/generation/en_dev.tsv

python main.py \
    --mode train \
    --app_name=sequence_generation \
    --worker_gpu=1 \
    --tables=./en_train.tsv,./en_dev.tsv  \
    --input_schema=title:str:1,content:str:1 \
    --first_sequence=content \
    --second_sequence=title \
    --label_name=title \
    --checkpoint_dir=./finetuned_en_model \
    --micro_batch_size=1 \
    --sequence_length=512 \
    --epoch_num 1 \
    --save_checkpoint_steps=500 \
    --export_tf_checkpoint_type none \
    --user_defined_parameters 'language=en pretrain_model_name_or_path=alibaba-pai/pegasus-summary-generation-en copy=false max_encoder_length=512 min_decoder_length=64 max_decoder_length=128 no_repeat_ngram_size=2 num_beams=5 num_return_sequences=5'

The prediction process is as follows:

python main.py \
    --mode=predict \
    --app_name=sequence_generation \
    --worker_gpu=1 \
    --tables=./en_dev.tsv  \
    --outputs=./en.preds.txt \
    --input_schema=title:str:1,content:str:1 \
    --output_schema=predictions,beams \
    --append_cols=title,content \
    --first_sequence=content \
    --checkpoint_dir=./finetuned_en_model \
    --micro_batch_size 32 \
    --sequence_length 512 \
    --user_defined_parameters 'language=en copy=false max_encoder_length=512 min_decoder_length=64 max_decoder_length=128 no_repeat_ngram_size=2 num_beams=5 num_return_sequences=5'

The following shows the summary prediction results of the model for a hot technology press release:

With the image generator Stable Diffusion, you can conjure within seconds a potrait of Beyoncé as if painted by Vincent van Gogh, a cyberpunk cityscape in the style of 18th century Japanese artist Hokusai and a complex alien world straight out of science fiction. Released to the public just two weeks ago, it's become one of several popular AI-powered text-to-image generators, including DALL-E 2, that have taken the internet by storm. Now, the company behind Stable Diffusion is in discussions to raise $100 million from investors, according to three people with knowledge of the matter. Investment firm Coatue expressed initial interest in a deal that would value the London-based startup Stability AI at $500 million, according to two of the people. Lightspeed Venture Partners then entered talks — which are still underway — to invest at a valuation up to $1 billion, two sources said. Stability AI, Coatue and Lightspeed declined requests for comment. The London-based startup previously raised at least $10 million in SAFE notes (a form of convertible security popular among early-stage startups) at a valuation of up to $100 million, according to one of the sources. An additional fourth source with direct knowledge confirmed Stability AI's previous round. Much of the company's funds came directly from founder and CEO Emad Mostaque, a former hedge fund manager. News of the prior financing was previously unreported. By nature of being open source, Stability AI's underlying technology is free to use. So far, the company does not have a clear business model in place, according to three of the sources. However, Mostaque said in an interview last month with Yannic Kilcher, a machine learning engineer and YouTube personality, that he has already penned partnerships with "governments and leading institutions" to sell the technology. "We've negotiated massive deals so we'd be profitable at the door versus most money-losing big corporations," he claims. The first version of Stable Diffusion itself cost just $600,000 to train, he wrote on Twitter — a fraction of the company's total funding. Mostaque, 39, hails from Bangladesh and grew up in England. He received a master's degree in mathematics and computer science from Oxford University in 2005 and spent 13 years working at U.K. hedge funds. In 2019, he launched Symmitree, a startup that aimed to reduce the cost of technology for people in poverty; it shuttered after one year, according to his LinkedIn profile. He then founded Stability AI in late 2020 with the mission of building open-source AI projects. According to its website, text-to-image generation is only one component of a broader apparatus of AI-powered offerings that the company is helping to build. Other open-source research groups it backs are developing tools for language, audio and biology. Stable Diffusion — created in collaboration with RunwayML, a video editing startup also backed by Coatue, and researchers at the Ludwig Maximilian University of Munich — has generated by far the most buzz among the company's projects. It comes as AI image generators entered the zeitgeist this year, with the release of OpenAI's DALL-E 2 in April and independent research lab Midjourney's eponymous product in July. Google also revealed a text-to-image system, Imagen, in May, though it is not available to the public. Mostaque and his peers have said that the existing technology only represents the tip of the iceberg of what AI art is capable of creating: Future use cases could include drastically improved photorealism, video and animation. These image generators are already facing controversy: Many of them have been trained by processing billions of images on the internet without the consent of the copyright holder, prompting debate over ethics and legality. Last week, a testy debate broke out online after a Colorado fine arts competition awarded a top prize to an AI-generated work of art. Moreover, unlike DALL-E and Midjourney, which have restrictions in place to prevent the generation of gory or pornographic images, Stable Diffusion's open source nature allows users to bypass such a block. On 4chan, numerous threads have appeared with AI-generated deepfakes of celebrity nudes, while Reddit has banned at least four communities that were dedicated to posting "not safe for work" AI imagery made using Stable Diffusion. It's a double-edged sword for Stability AI, which has accumulated community goodwill precisely due to its open source approach that gives its users full access to its code. The company's website states that the company is "building open AI tools," a mission that mirrors the initial intent of OpenAI to democratize access to artificial intelligence. OpenAI was launched as a nonprofit research organization by prominent technologists including Sam Altman and Elon Musk, but upon accepting a $1 billion investment from Microsoft in 2019, it became a for-profit business. The move led it to focus on commercializing its technology rather than making it more widely available, drawing criticism from the AI community — and Musk himself.  Stability AI has been a for-profit corporation from its inception, which Mostaque has said is meant to allow the open source research to reach more people. In an interviewwith TechCrunch last month, he said that the company was fully independent. "Nobody has any voting rights except our 75 employees — no billionaires, big funds, governments or anyone else with control of the company or the communities we support," he said. At a $1 billion valuation, Mostaque would be ceding up to 10% of the company to the new financiers. Venture capital investors who take significant stakes in startups typically ask for board positions so they can influence the decisions the company is making using their money. Lightspeed, which manages $10 billion of assets, and Coatue, which is in charge of $73 billion, both have a track record of taking board seats, though it's unclear if that will be the case with Stability AI. Follow me on Twitter. Send me a secure tip. 

The above text comes from https://www.forbes.com/sites/kenrickcai/2022/09/07/stability-ai-funding-round-1-billion-valuation-stable-diffusion-text-to-image/?sh=33ecbe8724d6

For the above press release, the following are summary generation results for two state-of-the-art models:

stable Diffusion is in discussions to raise $100 million from investors, three people say. The image generator is one of several popular AI-powered text-to-image generators.
company behind the popular image generator Stable Diffusion is in talks to raise $100 million from investors, according to sources

The above is the whole process of how to use EasyNLP for text summarization model training and prediction. More detailed tutorials can be added to the following courses for learning. Title Party Crash Course: Chinese News Title Generation Based on Machine Learning PAI EasyNLP

future outlook

In the future, we plan to integrate a knowledge-oriented Chinese pre-training model in the EasyNLP framework, covering various common NLU and NLG Chinese fields, so stay tuned. We will also integrate more SOTA models (especially Chinese models) in the EasyNLP framework to support various NLP and multimodal tasks. In addition, the Alibaba Cloud machine learning PAI team is also continuing to promote the self-research work of Chinese text generation and Chinese multimodal models. Users are welcome to continue to follow us, and welcome to join our open source community to jointly build Chinese NLP and a library of multimodal algorithms!

Github address: https://github.com/alibaba/EasyNLP

references

  1. Chengyu Wang, Minghui Qiu, Taolin Zhang, Tingting Liu, Lei Li, Jianing Wang, Ming Wang, Jun Huang, Wei Lin. EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing. arXiv
  2. Zhang, Jingqing, et al. "Pegasus: Pre-training with extracted gap-sentences for abstractive summarization." International Conference on Machine Learning. PMLR, 2020.
  3. Xue, Linting, et al. "mT5: A massively multilingual pre-trained text-to-text transformer." arXiv preprint arXiv:2010.11934(2020).
  4. Lewis, Mike, et al. "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension." arXiv preprint arXiv:1910.13461 (2019).
  5. Song, Kaitao, et al. "Mass: Masked sequence to sequence pre-training for language generation." arXiv preprint arXiv:1905.02450 (2019).
  6. Dong, Li, et al. "Unified language model pre-training for natural language understanding and generation." Advances in Neural Information Processing Systems 32 (2019).
  7. Yixin Liu, Pengfei Liu, Dragomir Radev, and Graham Neubig. 2022. BRIO: Bringing Order to Abstractive Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2890–2903, Dublin, Ireland. Association for Computational Linguistics.

Ali Lingjie Review

Posted by mansuang on Wed, 21 Sep 2022 21:37:23 +0300