Change language

Text Summarization | NLP tutorial | LSTM | Encoder Decoder Architecture | Python | Gensim | NLTK

Text Summarization | NLP  tutorial | LSTM | Encoder Decoder Architecture | Python | Gensim | NLTK

let us first understand what text summarization  is before we look it look at how it works here is a definition to get  us to get let us start it see automatic text summarization is the task  of reducing a concise and fluent summary while preserving key information content  and overall meaning there are broadly two different approaches that are used for text  summarization one is extractive summarization second one is abstractive summarization lets look at these two types in a bit  more detail extractive summarization uh as the name gives away that this approach  does we identify the important sentences or phrases from the original text  and extract only those from the text those extracted sentences would be our summary the  below diagram illustrates extractive summarization abstractive summarization is a very interesting  approach here we generate new sentences from the original text this is in contrast to  the extractive approach we saw earlier where we used only the sentences  that were present the sentences generated through abstractive summarization  might not be present in the original text lets discuss on sequence to sequence modeling  we can build a sequence to sequence model on any problem which involves sequential information this  includes sentiment classification neural machine translation and named entity recognition and some  very common applications of sequential information in the case of neural machine translation  the input is a text in one language and the output is also a text in another language  in the named entity recognition the input is a sequence of words and the output is a sequence  of tags for every word in the input sequence our objective is to build a text summarizer where  the input is a long sequence of words and the output is a short summary so we can model this  as a many-to-many sequence to sequence problem there are two major components of a sequence  to sequence model encoder and decoder let us understand these two in detail these are  essential to understand how text summarization works underneath the code you can also check out the encoder decoder architecture is mainly  used to solve the sequence to sequence problems where the input and output  sequence are of different lengths let us understand this from the  perspective of text summarization the input is a long sequence of words and the  output is a short version of the input sequence generally variance of recurrent neural  networks that is gated recurrent neural net networks or grus long short term memory lstm are  preferred as the encoder and decoder components this is because they are capable  of capturing long term dependencies by overcoming the problem of vanishing gradient  we can set up the encoder decoder in two phases one is the training phase and the inference phase this is the data set i am using here you can  see the test data and the training data if you open the training data there are huge number of  files here and if you open one file from this you can see the content here and the highlight  of this the entire content is given here and the highlight of this is only this much this is the  summary summary of this text so all the files are like this only you can see it here so this  is the text and the highlight is given here by using these files we are training the model after  that if you see the text data set this is the test data set there in this there is no highlight  we we will use these files to test the model let me show you the code i have developed  see here i am iterating through the folder and iterating through all the files  in the directory after that i am reading the files and the content is appended in  a data frame so i am showing the data frame here this is the data frame here you have file name  text and the summary summary we are extracting after the highlight i am using the search  word highlight and if the search word is there in the file we will reading and it  is adding to the data frame as summary after that i am using the gensim library  you may be familiar with the gensim to you know do the to create the feature vectors  and i have used the nltk natural language toolkit also both nsim and nltk i have used  uh there is one method that is the simple underscore pre-process method in the gensim  library i have used that method to clean the data data uh in the data frame so i have  used that method and it is cleaned and one more i have added one more column in the data frame that is file name text  summary clean text and clean summary so the the this is text and this is the cleaned text this  is the summary and this is the clean summary after that i am using the sk learn library to split the  data data into training and you know testing data you can see it here and i have taken the length  max length as the 80 and the max length summary as 10 then i am downloading the nltk uh you  know uh and from an ltk i am downloading this bounty that is done after that  i am using the lstms to you know create the model we are finally at the model building part but  before we do that we need a family rise we need to family raise our ourselves with a few terms  which are required prior to building the mortar return sequences is equal to true when  the return sequence parameter is set to true lstm produces the hidden state  and cell state for every time step return state equal to true when return state is  equal to true lstm produces the hidden state and cell state of the last timestamp only initial  state this is used to initialize the internal states of the lstm for the first time stacked  lstm stacked lstm has multiple layers of lstm stacked on top of each other this leads  to a better representation of the sequence i encourage you to experiment with the multiple  layers of the lstm stacked on top of each other here we are building a three  stacked lstm for the encoder you can see the code here sorry just one second this is the code i have written from  keras we are importing the big back end and this is the output we are getting here i am training the model i am using these parts categorical sorry i am using this parts categorical cross  entropy as the loss function since it converts the integer sequence to a one-fourth vector  on the fly this overcomes any memory issues uh here the concept of l is  talking it is used to stop training the neural network at the right  time by monitoring a user specified metrics so thats although we are doing it here  during training you can see the code here we will plot a diagnostic plot to understand  the behavior of the model over time you can see it here this is the plot we can infer that there is a slight  increase in the variation of loss after each epoch 10. so we will stop  training the model after this epoch

Shop

Learn programming in R: courses

$

Best Python online courses for 2022

$

Best laptop for Fortnite

$

Best laptop for Excel

$

Best laptop for Solidworks

$

Best laptop for Roblox

$

Best computer for crypto mining

$

Best laptop for Sims 4

$

Latest questions

NUMPYNUMPY

Common xlabel/ylabel for matplotlib subplots

12 answers

NUMPYNUMPY

How to specify multiple return types using type-hints

12 answers

NUMPYNUMPY

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

12 answers

NUMPYNUMPY

Flake8: Ignore specific warning for entire file

12 answers

NUMPYNUMPY

glob exclude pattern

12 answers

NUMPYNUMPY

How to avoid HTTP error 429 (Too Many Requests) python

12 answers

NUMPYNUMPY

Python CSV error: line contains NULL byte

12 answers

NUMPYNUMPY

csv.Error: iterator should return strings, not bytes

12 answers

News


Wiki

Python | How to copy data from one Excel sheet to another

Common xlabel/ylabel for matplotlib subplots

Check if one list is a subset of another in Python

sin

How to specify multiple return types using type-hints

exp

Printing words vertically in Python

exp

Python Extract words from a given string

Cyclic redundancy check in Python

Finding mean, median, mode in Python without libraries

cos

Python add suffix / add prefix to strings in a list

Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?

Python - Move item to the end of the list

Python - Print list vertically