Projects

Projects
Seq2Seq Attention Model for Text Summarization Using Amazon Reviews

Objective

The objective of this project is to build a model that can create relevant summaries for reviews written about food and gourmet sold on Amazon. Review data was obtained from Stanford Large Network Dataset Collection. The raw data include ~1 million reviews of food from Amazon.

Introduction
Overview of text summarization
Introduction to Abstractive Summarization Using Sequence-to-Sequence Modeling

Encoder-Decoder Model
Training and Inference Phases

Implementing a Text Summarization seq2seq Model in Python

Data Preprocessing

Data Inspection
Removing irrelevant features
Finding and removing null values
Exploring the reviews
Data Preparation
Text Cleaning
Sentence Length Distribution
Removing short text and summaries

Data tokenization and vectorization

Adding special tokens
Data Split to train and validatiaon
Text Tokenizer
Summary Tokenizer

Word Embeddings
Model Building

Encoder
Decoder
Attention Mechanism
Training the model

Post Processing

Visualizing the model history

Inference

Greedy Search
Attention Plots
Beam Search

The main goal of summarization can be described as representing the gist of a text in a few words that carry the main idea and important information from the text. One common example is the new headlines. Summaries can be short or long. News headlines are good examples of short summaries where they usually contain a few words and journal article abstracts are good examples for long summaries where they can be as long as a few sentences.

Overview of text summarization

The traditional way of summarizing a text works like highlighting the important parts of the text before getting prepared for an exam. We would basically highlight the parts of the text that we think are more important so that we spend less time going over the material the next time we read the text. This can save a lot of time specially if we're dealing with a lot of material to read! The traditional approach to summarization is called Extraction-based summarization.

With the emergence of deep learning models and subsequently the seq2seq architecture, a new way of summarizing the text was introduced that works very differently compared to the traditional way. Seq2seq model was first introduced in 2014 by Google, where the model first transforms the input (text) to a fixed-length sequence, and eventually maps it to a fixed-length output sequence (summary).

The idea is no longer trying to only highlight the parts of the text and eventually put those pieces together to build the summary. Instead, the model will try to generate a new piece of text once it has a good understanding of the meaning of the words that are used in the article. This is called Abstractive Summarization. We can think of it as a model that is trying to transform the input sequence of tokens into a limited set of output tokens that carry the core idea about the input tokens. The amount of tokens in the output sequence is learned by the model during the training phase where we need to provide both the text and summary, therefore, abstractive summarization falls into the suprevised learning category.

Introduction to Abstractive Summarization Using Sequence-to-Sequence Modeling

In the context of text summarization, the seq2seq model takes and input sequence of words, where each word is represeneted as an integer token, and returns an output sequence of tokens that form the summary. There are different variations of seq2seq model and this project focuses on the many-to-many Seq2Seq problem where the model maps many input tokens (text) into many output tokens (summary). The seq2seq model has two main components, namely, an encoder and a decoder. I will try to explain the main properties of the two components in the following but the full details and specifically the math derivations are outside the scope of this project. However, along the way, I will add the references that I found useful for understanding the math!

Encoder-Decoder Model

The main reason behind using an Encoder-Decoder architecture is that the input and output sequences are not necessarily of the same lenghts. The input sequence is typically a long sequence of words whereas the output contains only a few words.

The Encoder-Decoder Architecture

Building Blocks of the Model

Both the encoder and decoder take advantage of a variant of Recurrent Neural Networks (RNNs), usually a Long Short Term Memory (LSTM) unit or a Gated Recurrent Unit (GRU). The reason behind using this specific RNNs is that they can learn long-term dependencies, which is something the regular RNNs fail to achieve.

RNNs are appealing because they can, at least in theory, connect previous information to the present information. However, as the time gap between the past and present information becomes large (the input sentence becomes long), RNNs fail to learn such dependencies due to the problem of vanishing gradients. LSTMs are designed to remedy the long-term dependency issue.

Training and Inference Phases

During the task of text summarization, the encoder-decoder architecture experiences two phases, the training and inference phases. In the training phase, the model is trained to predict the target sequence offset by one timestep, i.e., when word X appears, what word is most likely to appear afterwards. In the inference phase, we essentially test the model to see what it returns as we feed an input sequence to the model.

I will provide more details about the details of each component during the model implementation so let's start!

Implementing a Text Summarization seq2seq Model in Python

Data Preprocessing

Data proprocessing is an essential step of this projects because using uncleaned text data can lead to a bad model and a lot of wasted time and computational power! Text preprocessing step includes:

Removing irrelevant features
removing potential null values from data
Adding text contractions
Cleaning the text by removing unwanted characters

import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
import time
import tensorflow as tf
from bs4 import BeautifulSoup
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from matplotlib import rc, rcParams
from matplotlib import cm as cm
import matplotlib.ticker as ticker

Insepcting the Data

reviews = pd.read_json('review_data/Grocery_and_Gourmet_Food.json', lines = True, nrows=1000000)
reviews.head()

overall	verified	reviewTime	reviewerID	asin	reviewerName	reviewText	summary	unixReviewTime	vote	image	style
5	True	06 4, 2013	ALP49FBWT4I7V	1888861614	Lori	Very pleased with my purchase. Looks exactly l...	Love it	1370304000	NaN	NaN	NaN
4	True	05 23, 2014	A1KPIZOCLB9FZ8	1888861614	BK Shopper	Very nicely crafted but too small. Am going to...	Nice but small	1400803200	NaN	NaN	NaN
4	True	05 9, 2014	A2W0FA06IYAYQE	1888861614	daninethequeen	still very pretty and well made...i am super p...	the "s" looks like a 5, kina	1399593600	NaN	NaN	NaN
5	True	04 20, 2014	A2PTZTCH2QUYBC	1888861614	Tammara	I got this for our wedding cake, and it was ev...	Would recommend this to a friend!	1397952000	NaN	NaN	NaN
4	True	04 16, 2014	A2VNHGJ59N4Z90	1888861614	LaQuinta Alexander	It was just what I want to put at the top of m...	Topper	1397606400	NaN	NaN	NaN

Removing irrelevant features

reviews = reviews.drop(
      ['reviewerID', 'asin', 'reviewerName', 'reviewTime', 'verified', 'overall', 'unixReviewTime', 'style', 'image', 'vote'], 1
  )
reviews = reviews.reset_index(drop=True)

Finding and removing null values

reviews.isnull().sum()

reviewText           433
summary              234
dtype: int64

reviews = reviews.dropna()
# sanity check!
reviews.isnull().sum()

reviewText    0
summary       0
dtype: int64

reviews.columns = ['text', 'summary']

Exploring the reviews

for i in range(10,15):
    print(f"Review #{i+1}: {reviews.text[i]}")
    print(f"Review #{i+1} summary: {reviews.summary[i]}\n")

Review #11: This arrived in the mail and it was packaged so well so it doesn't break. It's so pretty and well worth my money! Can't wait to use it on my wedding cake :D
Review #11 summary: So pretty

Review #12: No adverse comment.
Review #12 summary: Five Stars

Review #13: These are hard to find locally and Amazon has it for a good price.  I first tasted this tea in Costa Rica and loved it.
Review #13 summary: Wonderful tea and great price too!

Review #14: Best black tea in US.

Highly recommend.
I use 3 bags in a large 16 oz glass mug with boiled water then add boiled milk & sugar. Oh my, it's wonderful. I wish I could drink it at night.
Review #14 summary: Best black tea in US

Review #15: if you like strong flavorful tea you will enjoy this Yellow Label
Review #15 summary: Five Stars

Text Contractions

Next, to avoid headaches caused by text contractions, we need to expand them (the list was obtained from here)!

contractions = { 
    "ain't": "is not",
    "aren't": "are not",
    "can't": "cannot",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'll": "he will",
    "he's": "he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "I'd": "I would",
    "I'd've": "I would have",
    "I'll": "I will",
    "I'll've": "I will have",
    "I'm": "I am",
    "I've": "I have",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'd've": "it would have",
    "it'll": "it will",
    "it'll've": "it will have",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "mightn't've": "might not have",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not",
    "oughtn't've": "ought not have",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "shan't've": "shall not have",
    "she'd": "she would",
    "she'd've": "she would have",
    "she'll": "she will",
    "she'll've": "she will have",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "so've": "so have",
    "so's": "so as",
    "this's": "this is",
    "that'd": "that would",
    "that'd've": "that would have",
    "that's": "that is",
    "there'd": "there would",
    "there'd've": "there would have",
    "there's": "there is",
    "here's": "here is",
    "they'd": "they would",
    "they'd've": "they would have",
    "they'll": "they will",
    "they'll've": "they will have",
    "they're": "they are",
    "they've": "they have",
    "to've": "to have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'd've": "we would have",
    "we'll": "we will",
    "we'll've": "we will have",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what'll've": "what will have",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who'll've": "who will have",
    "who's": "who is",
    "who've": "who have",
    "why's": "why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "y'all": "you all",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "y'all've": "you all have",
    "you'd": "you would",
    "you'd've": "you would have",
    "you'll": "you will",
    "you'll've": "you will have",
    "you're": "you are",
    "you've": "you have"
}

Text Cleaning

def clean_text(text, remove_stopwords = True):
    '''Remove unwanted characters, stopwords, and format the text to create fewer nulls word embeddings'''
    
    # Convert to lower case
    text = text.lower()
    
    # take advantage of bs lxml parser
    text = BeautifulSoup(text, "html.parser").text
    
    # Fix contractions
    tokens = text.split()
    new_tokens = []
    for token in tokens:
        if token in contractions:
            new_tokens.append(contractions[token])
        else:
            new_tokens.append(token)
    text = " ".join(new_tokens)
    
    # Format words and remove unwanted characters
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    text = re.sub(r'\<a href', ' ', text)
    text = re.sub(r'&', '', text) 
    text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
    text = re.sub(r'<br />', ' ', text)
    text = re.sub(r'\'', ' ', text)
    text = re.sub("(\\t)", ' ', text) #remove escape charecters
    text = re.sub("(\\r)", ' ', text) 
    text = re.sub("(\\n)", ' ', text)
    text = re.sub("(__+)", ' ', text)   #remove _ if it occors more than one time consecutively
    text = re.sub("(--+)", ' ', text)   #remove - if it occors more than one time consecutively
    text = re.sub("(~~+)", ' ', text)   #remove ~ if it occors more than one time consecutively
    text = re.sub("(\+\++)", ' ', text)   #remove + if it occors more than one time consecutively
    text = re.sub("(\.\.+)", ' ', text)   #remove . if it occors more than one time consecutively
    text = re.sub(r"[<>()|&©ø\[\]\'\",;?~*!]", ' ', text) #remove <>()|&©ø"',;?~*!
    text = re.sub("(mailto:)", ' ', text) #remove mailto:
    text = re.sub(r"(\\x9\d)", ' ', text) #remove \x9* in text
    text = re.sub("([iI][nN][cC]\d+)", 'INC_NUM', text) #replace INC nums to INC_NUM
    text = re.sub("([cC][mM]\d+)|([cC][hH][gG]\d+)", 'CM_NUM', text) #replace CM# and CHG# to CM_NUM
    text = re.sub("(\.\s+)", ' ', text) #remove full stop at end of words(not between)
    text = re.sub("(\-\s+)", ' ', text) #remove - at end of words(not between)
    text = re.sub("(\:\s+)", ' ', text) #remove : at end of words(not between)
    text = re.sub("(\s+.\s+)", ' ', text) #remove any single charecters hanging between 2 spaces
    
    # Optionally, remove stop words
    if remove_stopwords:
        text = text.split()
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
        text = " ".join(text)
    return text

Although care must be taken when dealing with stopwords, specially in NLP applications such as sentiment analysis, this is not a concern in this project because they do not provide much use for training our model. However, we will keep them for our summaries so that they sound more like natural phrases.

import time
clean_texts = []
start = time.time()
for i,text in enumerate(reviews.text):
      clean_texts.append(clean_text(text))
      if not (i+1)%100000:
          print(f"{(i+1)} reviews are cleaned! time elapsed so far: {(time.time() - start):.1f} seconds!")
print(f"Reviews are cleaned! total time elapsed to clean: {(time.time() - start):.1f} seconds!")
print("\n")
clean_summaries = []
start = time.time()
for i,summary in enumerate(reviews.summary):
      clean_summaries.append(clean_text(summary, remove_stopwords=False))
      if not (i+1)%100000:
          print(f"{(i+1)} summaries are cleaned! time elapsed so far: {(time.time() - start):.1f} seconds!")
print(f"Summaries are cleaned! total time spent to clean: {(time.time() - start):.1f} seconds!")

100000 reviews are cleaned! time elapsed so far: 54.0 seconds!
200000 reviews are cleaned! time elapsed so far: 108.0 seconds!
300000 reviews are cleaned! time elapsed so far: 163.2 seconds!
400000 reviews are cleaned! time elapsed so far: 218.2 seconds!
500000 reviews are cleaned! time elapsed so far: 273.4 seconds!
600000 reviews are cleaned! time elapsed so far: 327.6 seconds!
700000 reviews are cleaned! time elapsed so far: 384.7 seconds!
800000 reviews are cleaned! time elapsed so far: 441.6 seconds!
900000 reviews are cleaned! time elapsed so far: 496.1 seconds!
Reviews are cleaned! total time elapsed to clean: 550.9 seconds!


100000 summaries are cleaned! time elapsed so far: 11.0 seconds!
200000 summaries are cleaned! time elapsed so far: 22.1 seconds!
300000 summaries are cleaned! time elapsed so far: 33.1 seconds!
400000 summaries are cleaned! time elapsed so far: 44.1 seconds!
500000 summaries are cleaned! time elapsed so far: 55.1 seconds!
600000 summaries are cleaned! time elapsed so far: 66.1 seconds!
700000 summaries are cleaned! time elapsed so far: 77.1 seconds!
800000 summaries are cleaned! time elapsed so far: 88.1 seconds!
900000 summaries are cleaned! time elapsed so far: 99.1 seconds!
Summaries are cleaned! total time spent to clean: 110.0 seconds!

# Sanity check to make sure they are actually cleaned
for i in range(5):
    print(f"Cleaned Review #{i+1}: {clean_texts[i]}")
    print(f"Cleaned Summary #{i+1}: {clean_summaries[i]}\n")

Cleaned Review #1: pleased purchase looks exactly like picture look great cake definitely sparkle
Cleaned Summary #1: love it
    
Cleaned Review #2: nicely crafted small going add flowers something compensate size
Cleaned Summary #2: nice but small
    
Cleaned Review #3: still pretty well made super picky listen whispers look like number
Cleaned Summary #3: the looks like 5  kina
    
Cleaned Review #4: got wedding cake everything even person would recommend anyone
Cleaned Summary #4: would recommend this to friend 
    
Cleaned Review #5: want put top wedding cake love true picture
Cleaned Summary #5: topper

data=pd.DataFrame({'text':clean_texts,'summary':clean_summaries})
data.head()

text	summary
pleased purchase looks exactly like picture lo...	love it
nicely crafted small going add flowers somethi...	nice but small
still pretty well made super picky listen whis...	the looks like 5 kina
got wedding cake everything even person would ...	would recommend this to friend
want put top wedding cake love true picture	topper

import pickle
f = open("clean_texts.pkl", "rb")
clean_texts = pickle.load(f)
f.close()
f = open("clean_summaries.pkl", "rb")
clean_summaries = pickle.load(f)
f.close()
data=pd.DataFrame({'text':clean_texts,'summary':clean_summaries})
data.head()

text	summary
pleased purchase looks exactly like picture lo...	love it
nicely crafted small going add flowers somethi...	nice but small
still pretty well made super picky listen whis...	the looks like 5 kina
got wedding cake everything even person would ...	would recommend this to friend
want put top wedding cake love true picture	topper

Sentence Length Distribution

Next, we analyze the length of the text and summary to get an overall idea about the distribution of length of the text. This can help us decide on the maximum length of both the review texts and summaries.

sns.set_style('white')
fig, axes = plt.subplots(2, 1, figsize=(10, 10), dpi=80)
axes = axes.flatten()
plt.subplots_adjust(hspace=0.35)
sns.set(font_scale=1.5)
sns.despine()
entities = ["text", "summary"]
colors = ["teal", "orange"]
for i, entity in enumerate(entities):
    sns.distplot(data[entity].apply(lambda x: len(x.split())), color=colors[i], ax=axes[i], label=entities[i])
    axes[i].set_xlabel("Number of words in the sentence")
    axes[i].legend(loc='best')
plt.suptitle('Sentence length distributions before removing short reviews');

Figure 1

Removing short text and summaries

The are a lot of long review texts that can negatively influence the model behavior. I set the following thresholds to remove the text and summary instances from the original data:

text_max_num_words = 150
summary_max_num_words = 20
text_min_num_words = 25
summary_min_num_words = 2

data['text_word_count'] = data['text'].apply(lambda x: len(x.strip().split()))
data['summary_word_count'] = data['summary'].apply(lambda x: len(x.strip().split()))
data.head()

text	summary	text_word_count	summary_word_count
pleased purchase looks exactly like picture lo...	love it	11	2
nicely crafted small going add flowers somethi...	nice but small	9	3
still pretty well made super picky listen whis...	the looks like 5 kina	11	5
got wedding cake everything even person would ...	would recommend this to friend	9	5
want put top wedding cake love true picture	topper	8	1

text_max_num_words = 150
summary_max_num_words = 20
text_min_num_words = 25
summary_min_num_words = 2
data = data[(data.text_word_count>text_min_num_words) 
          & (data.text_word_count<text_max_num_words) 
          & (data.summary_word_count>summary_min_num_words) 
          & (data.summary_word_count<summary_max_num_words)
]
data = data.drop(
        ['text_word_count', 'summary_word_count'], 1
)

Now we can take a look at the resulting data that meet the length thresholds we prevoiusly defined.

sns.set_style('white')
fig, axes = plt.subplots(2, 1, figsize=(10, 10), dpi=80)
axes = axes.flatten()
plt.subplots_adjust(hspace=0.35)
sns.set(font_scale=1.5)
sns.despine()
entities = ["text", "summary"]
colors = ["teal", "orange"]
for i, entity in enumerate(entities):
    sns.distplot(data[entity].apply(lambda x: len(x.strip().split())), 
                 color=colors[i], ax=axes[i], label=entities[i], bins=16)
    axes[i].set_xlabel("word count")
    axes[i].legend(loc='best')
plt.suptitle('Sentence length distributions after removing short reviews');

Figure 2

Data tokenization and vectorization

Adding special tokens

There are the special tokens used in seq2seq (image from here):

START- the same as start on the picture below - the first token which is fed to the decoder along with the thought vector in order to start generating tokens of the summary.
END - "end of sentence" - the same as end on the picture below - as soon as decoder generates this token we consider the summary to be complete (you can't use usual punctuation marks for this purpose cause their meaning can be different)

start_token = 'starttoken'
end_token = 'endtoken'
data['summary'] = data['summary'].apply(lambda x : start_token + ' ' + x + ' ' + end_token)
data = data.reset_index(drop=True)
data.head()

text	summary
tried overseas last year remember exactly sinc...	starttoken yellow label lipton tea endtoken
first came across lipton yellow label tea trip...	starttoken a great tea endtoken
first tasted caracas business trip south ameri...	starttoken best black tea endtoken
best tea ever first france readily available e...	starttoken best tea ever nothing like the ame...
wow new flavor block real tea looking received...	starttoken wow this is outstanding endtoken

Data Split to train and validatiaon

Next I split the data into train and validation sets where 90% of the data is used for training and the rest for validation.

from sklearn.model_selection import train_test_split
indices = np.arange(len(data['text']))
x_tr, x_val, y_tr, y_val, tr_indices, val_indices = train_test_split(
    data['text'], 
    data['summary'],
    indices, 
    test_size=0.1, 
    random_state=1
)

Vocabulary Size

Vocabulary size is one of the important parameters the we need to know when building the model. It determines the size of input data that we feed to the encoder.

def count_words(count_dict, text):
    '''Count the number of occurrences of each word in a set of text'''
    for sentence in text:
        for word in sentence.split():
            count_dict[word] = count_dict.get(word, 0) + 1
    word_count_dict = {}
    count_words(word_count_dict, x_tr)
    print("Size of vocabulary train (text):", len(word_count_dict))
    text_wc = len(word_count_dict)
    count_words(word_count_dict, y_tr)
    print("Size of vocabulary train (summary):", len(word_count_dict) - text_wc)
    print("Size of vocabulary train (text + review):", len(word_count_dict))

    Size of vocabulary train (text): 65923
    Size of vocabulary train (summary): 2230
    Size of vocabulary train (text + review): 68153

Text Tokenizer

from keras.preprocessing.text import Tokenizer 
from keras.preprocessing.sequence import pad_sequences
# Tokenizer for review texts
text_tokenizer = Tokenizer()
text_tokenizer.fit_on_texts(x_tr)

# needs to be present at least 5 times so that we dont consider it rare
text_rare_min_count=4
unique_rare_word_count=0
unique_word_count=0
num_rare_word_used=0
num_word_used=0
for word, count in text_tokenizer.word_counts.items():
    unique_word_count += 1
    num_word_used += count
    if count < text_rare_min_count:
        unique_rare_word_count += 1
        num_rare_word_used += count
    print(f"number of unique words is {unique_word_count:d}", 
          f"\nand number unique rare words is {unique_rare_word_count}")
    print("unique rare word to uniqe word ratio in all review texts = {:.2f}".format(
             (unique_rare_word_count/unique_word_count)*100)
         )
    print("rare word usage percentage in all review texts = {:.2f}%".format(
             (num_rare_word_used/num_word_used)*100)
         )

    number of unique words is 65770 
    and number unique rare words is 38717
    unique rare word to uniqe word ratio in all review texts = 58.87
    rare word usage percentage in all review texts = 0.80%

# Tokenizer for review texts
text_tokenizer = Tokenizer(num_words=unique_word_count-unique_rare_word_count+1)
text_tokenizer.fit_on_texts(x_tr)
# convert text sequences into integer sequences
x_tr_seq = text_tokenizer.texts_to_sequences(x_tr) 
x_val_seq = text_tokenizer.texts_to_sequences(x_val)
# padding zero upto maximum length
x_tr = pad_sequences(x_tr_seq,  maxlen=text_max_num_words, padding='post')
x_val = pad_sequences(x_val_seq, maxlen=text_max_num_words, padding='post')

words = list(text_tokenizer.index_word.values())
i = 0
while text_tokenizer.word_counts[words[i]] >= text_rare_min_count:
    i += 1
print(f"dictionary has {i} words that appear more than the minimum threshold")

    dictionary has 27053 words that appear more than the minimum threshold

# Sanity check
print(words[i-1],text_tokenizer.word_counts[words[i-1]])
print(words[i-1],text_tokenizer.texts_to_sequences([words[i-1]]))
print(words[i],text_tokenizer.word_counts[words[i]])
print(words[i],text_tokenizer.texts_to_sequences([words[i]]))

    mezzomix 4
    mezzomix [[27053]]
    rissoto 3
    rissoto [[]]

text_vocab_size = text_tokenizer.num_words
print(f"text_vocab_size: {text_vocab_size}")

    text_vocab_size: 27054

Summary Tokenizer

# Tokenizer for review summaries
summary_tokenizer = Tokenizer()
summary_tokenizer.fit_on_texts(y_tr)

# needs to be present at least 6 times so that we dont consider it rare
summary_rare_min_count=6
unique_rare_word_count=0
unique_word_count=0
num_rare_word_used=0
num_word_used=0
for word, count in summary_tokenizer.word_counts.items():
    unique_word_count += 1
    num_word_used += count
    if count < text_rare_min_count:
        unique_rare_word_count += 1
        num_rare_word_used += count
    print(f"number of unique words is {unique_word_count:d}", 
          f"\nand number unique rare words is {unique_rare_word_count}")
    print("unique rare word to uniqe word ratio in all review summaries = {:.2f}".format(
             (unique_rare_word_count/unique_word_count)*100)
         )
    print("rare word usage percentage in all review summaries = {:.2f}%".format(
             (num_rare_word_used/num_word_used)*100)

    number of unique words is 19731 
    and number unique rare words is 13691
    unique rare word to uniqe word ratio in all review summaries = 69.39
    rare word usage percentage in all review summaries = 2.15%

# Tokenizer for review summaries
summary_tokenizer = Tokenizer(num_words=unique_word_count - unique_rare_word_count + 1) 
summary_tokenizer.fit_on_texts(y_tr)
  # convert text sequences into integer sequences
y_tr_seq = summary_tokenizer.texts_to_sequences(y_tr)
y_val_seq = summary_tokenizer.texts_to_sequences(y_val)
  # padding zero upto maximum length
y_tr = pad_sequences(y_tr_seq,  maxlen=summary_max_num_words, padding='post')
y_val = pad_sequences(y_val_seq, maxlen=summary_max_num_words, padding='post')

words = list(summary_tokenizer.index_word.values())
i = 0
while summary_tokenizer.word_counts[words[i]] > summary_rare_min_count-1:
    i += 1
print(f"dictionary has {i} words that appear more than the minimum threshold")

    dictionary has 6040 words that appear more than the minimum threshold

# Sanity check
print(words[i-1],summary_tokenizer.word_counts[words[i-1]])
print(words[i-1],summary_tokenizer.texts_to_sequences([words[i-1]]))
print(words[i],summary_tokenizer.word_counts[words[i]])
print(words[i],summary_tokenizer.texts_to_sequences([words[i]]))

    poisoning 6
    poisoning [[6040]]
    impact 5
    impact [[]]

summary_vocab_size = summary_tokenizer.num_words
print(f"summary_vocab_size: {summary_vocab_size}")

    summary_vocab_size: 6041

assert summary_tokenizer.word_counts['starttoken']==len(y_tr), 'we have a problem!'
assert summary_tokenizer.word_counts['endtoken']==len(y_tr), 'we have a problem!'

Remove text and summaries that only inclue the start and end tokens

train data

remove_ids = []
for i,summary in enumerate(y_tr):
      if len(summary[np.where(summary == 0)])>17: 
          remove_ids.append(i)
for id_ in remove_ids:
      print(y_tr[id_])

    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

print(f"{len(remove_ids):d} rows will be removed from the training data!")
x_tr=np.delete(x_tr, remove_ids, axis=0)
y_tr=np.delete(y_tr, remove_ids, axis=0)

    29 rows will be removed from the training data!

val data

remove_ids = []
for i,summary in enumerate(y_val):
      if len(summary[np.where(summary == 0)])>17: 
          remove_ids.append(i)
for id_ in remove_ids:
      print(y_val[id_])

    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    [1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

print(f"{len(remove_ids):d} rows are being removed from the validation data!")
x_val=np.delete(x_val, remove_ids, axis=0)
y_val=np.delete(y_val, remove_ids, axis=0)

    3 rows are being removed from the validation data!

Embeddings

A word embedding is a representation of text as a vector where similar words are expected to have a similar representation in the vector space. In this work, I tested two well known word embeddings Conceptnet Numberbatch and Glove. Conceptnet Numberbatch (CN) and Glove have about 500'000 and 400'000 word embeddings, respectively. Glove offeres word embeddings in 50, 100, and 300 dimensions and CN word embeddings' dimension is 300.

From Sentences to Embeddings

Using Conceptnet Numberbatch for word embeddings

embeddings_index = {}
with open('numberbatch-en-19.08.txt', encoding='utf-8') as f:
      for line in f:
          values = line.split(' ')
          word = values[0]
          embedding = np.asarray(values[1:], dtype='float32')
          embeddings_index[word] = embedding
print('Word embeddings count:', len(embeddings_index))

    Word embeddings count: 516783

embedding_dim = len(values)-1
print(f"Embedding dimension = {embedding_dim}")

    Embedding dimension = 300

missing from the Conceptnet Numberbatch embeddings but are among the review words

ensures that the added words are common enough that the model can understand their meaning.

Text Embeddings

# Find the number of words that are missing from CN, and are used more than our threshold.
added_missing_words_count = 0
total_missing_words_count = 0
for word, count in text_tokenizer.word_counts.items():
      if word not in embeddings_index:
          total_missing_words_count += 1
          if count >= text_rare_min_count:
              added_missing_words_count += 1
              
missing_ratio = 100*total_missing_words_count/len(word_count_dict)
print(f"Number of review text words included in CN embeddings: {text_vocab_size-total_missing_words_count}")
print(f"Number of review text words missing from CN embeddings: {total_missing_words_count}")
print(f"Number of review text words missing from CN embeddings that will be added: {added_missing_words_count}")

    Number of review text words included in CN embeddings: 2714
    Number of review text words missing from CN embeddings: 24340
    Number of review text words missing from CN embeddings that will be added: 3750

embeddings_matrix_text = np.zeros((text_vocab_size, embedding_dim))
first_index = 1
for word, index in list(text_tokenizer.word_index.items())[:text_vocab_size-1]: # only include non rare
      embeddings_vector = embeddings_index.get(word)
      if embeddings_vector is not None:
          embeddings_matrix_text[index] = embeddings_vector
      else:
          # If word not in CN, create a random embedding for it
          new_embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))
          embeddings_matrix_text[index] = new_embedding

Summary Embeddings

# Find the number of words that are missing from CN, and are used more than our threshold.
added_missing_words_count = 0
total_missing_words_count = 0
for word, count in summary_tokenizer.word_counts.items():
      if word not in embeddings_index:
          total_missing_words_count += 1
          if count >= text_rare_min_count:
              added_missing_words_count += 1
              
missing_ratio = 100*total_missing_words_count/len(word_count_dict)
print(f"Number of review summary words included in CN embeddings: {summary_vocab_size-total_missing_words_count}")
print(f"Number of review summary words missing from CN embeddings: {total_missing_words_count}")
print(f"Number of review summary words missing from CN embeddings that will be added: {added_missing_words_count}")

    Number of review summary words included in CN embeddings: 1784
    Number of review summary words missing from CN embeddings: 4257
    Number of review summary words missing from CN embeddings that will be added: 480

embeddings_matrix_summary = np.zeros((summary_vocab_size, embedding_dim))
first_index = 1
for word, index in list(summary_tokenizer.word_index.items())[:summary_vocab_size-1]: # only include non rare
      embeddings_vector = embeddings_index.get(word)
      if embeddings_vector is not None:
          embeddings_matrix_summary[index] = embeddings_vector
      else:
          # If word not in CN, create a random embedding for it
          new_embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))
          embeddings_matrix_summary[index] = new_embedding

Building the model

from tensorflow.keras.layers import Input, LSTM, Dense, Concatenate, TimeDistributed, Embedding
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from keras import backend as K 
K.clear_session()
latent_dim = 128
embedding_dim=300

Encoder

Encoder Architecture

encoder_inputs = Input(shape=(text_max_num_words,))

# embedding layer
enc_emb = Embedding(
    text_vocab_size,
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embeddings_matrix_text),
    trainable=True,
)(encoder_inputs)

# The output of the embedding layer is fed to a LSTM

# encoder lstm 1
encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True,dropout=0.4,recurrent_dropout=0.4)
encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)

# encoder lstm 2
encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True,dropout=0.4,recurrent_dropout=0.4)
encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)

# encoder lstm 3
encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True,dropout=0.4,recurrent_dropout=0.4)
encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)

Decoder

Decoder Architecture

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))

# embedding layer
dec_emb_layer = Embedding(
    summary_vocab_size,
    embedding_dim,
    embeddings_initializer=tf.keras.initializers.Constant(embeddings_matrix_summary),
    trainable=True,
)
dec_emb = dec_emb_layer(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True,dropout=0.4,recurrent_dropout=0.2)
decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])

Attention

The main intuition behind model is to understand in order to generate a summary word at time step $t$, how much weight, i.e. Attention, is required to be assigned to each word in the input sequence. For instance, for the input sentence:

"The weather is so nice today that I can imagine myself being outdoor for the entire day

Love this weather!,

weather

love

nice

here

from tensorflow.python.keras.layers import Layer
from tensorflow.python.keras import backend as K

class BahdanauAttention(Layer):
    """
    This class implements Bahdanau attention (https://arxiv.org/pdf/1409.0473.pdf).
    There are three sets of weights introduced W_a, U_a, and V_a
     """

    def __init__(self, **kwargs):
        super(BahdanauAttention, self).__init__(**kwargs)
    def build(self, input_shape):
        assert isinstance(input_shape, list)
        # Create a trainable weight variable for this layer.
        self.W_a = self.add_weight(name='W_a',
                                   shape=tf.TensorShape((input_shape[0][2], input_shape[0][2])),
                                   initializer='uniform',
                                   trainable=True)
        self.U_a = self.add_weight(name='U_a',
                                   shape=tf.TensorShape((input_shape[1][2], input_shape[0][2])),
                                   initializer='uniform',
                                   trainable=True)
        self.V_a = self.add_weight(name='V_a',
                                   shape=tf.TensorShape((input_shape[0][2], 1)),
                                   initializer='uniform',
                                   trainable=True)
        super(BahdanauAttention, self).build(input_shape)  # Be sure to call this at the end
    def call(self, inputs, verbose=False):
        """
        inputs: [encoder_output_sequence, decoder_output_sequence]
        """
        assert type(inputs) == list
        encoder_out_seq, decoder_out_seq = inputs
        if verbose:
            print('encoder_out_seq>', encoder_out_seq.shape)
            print('decoder_out_seq>', decoder_out_seq.shape)
        def energy_step(inputs, states):
            """ Step function for computing energy for a single decoder state
            inputs: (batchsize * 1 * de_in_dim)
            states: (batchsize * 1 * de_latent_dim)
            """
            assert_msg = "States must be an iterable. Got {} of type {}".format(states, type(states))
            assert isinstance(states, list) or isinstance(states, tuple), assert_msg
            """ Some parameters required for shaping tensors"""
            en_seq_len, en_hidden = encoder_out_seq.shape[1], encoder_out_seq.shape[2]
            de_hidden = inputs.shape[-1]
            """ Computing S.Wa where S=[s0, s1, ..., si]"""
            # <= batch size * en_seq_len * latent_dim
            W_a_dot_s = K.dot(encoder_out_seq, self.W_a)
            """ Computing hj.Ua """
            U_a_dot_h = K.expand_dims(K.dot(inputs, self.U_a), 1)  # <= batch_size, 1, latent_dim
            if verbose:
                print('Ua.h>', U_a_dot_h.shape)
            """ tanh(S.Wa + hj.Ua) """
            # <= batch_size*en_seq_len, latent_dim
            Ws_plus_Uh = K.tanh(W_a_dot_s + U_a_dot_h)
            if verbose:
                print('Ws+Uh>', Ws_plus_Uh.shape)
            """ softmax(va.tanh(S.Wa + hj.Ua)) """
            # <= batch_size, en_seq_len
            e_i = K.squeeze(K.dot(Ws_plus_Uh, self.V_a), axis=-1)
            # <= batch_size, en_seq_len
            e_i = K.softmax(e_i)
            if verbose:
                print('ei>', e_i.shape)
            return e_i, [e_i]
        def context_step(inputs, states):
            """ Step function for computing ci using ei """
            assert_msg = "States must be an iterable. Got {} of type {}".format(states, type(states))
            assert isinstance(states, list) or isinstance(states, tuple), assert_msg
            # <= batch_size, hidden_size
            c_i = K.sum(encoder_out_seq * K.expand_dims(inputs, -1), axis=1)
            if verbose:
                print('ci>', c_i.shape)
            return c_i, [c_i]
        fake_state_c = K.sum(encoder_out_seq, axis=1)
        fake_state_e = K.sum(encoder_out_seq, axis=2)  # <= (batch_size, enc_seq_len, latent_dim
        """ Computing energy outputs """
        # e_outputs => (batch_size, de_seq_len, en_seq_len)
        last_out, e_outputs, _ = K.rnn(
            energy_step, decoder_out_seq, [fake_state_e],
        )
        """ Computing context vectors """
        last_out, c_outputs, _ = K.rnn(
            context_step, e_outputs, [fake_state_c],
        )
        return c_outputs, e_outputs
    def compute_output_shape(self, input_shape):
        """ Outputs produced by the layer """
        return [
            tf.TensorShape((input_shape[1][0], input_shape[1][1], input_shape[1][2])),
            tf.TensorShape((input_shape[1][0], input_shape[1][1], input_shape[0][1]))
        ]


# Attention layer
attn_layer = BahdanauAttention(name='attention_layer')
attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
# Concat attention input and decoder LSTM output
decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
# dense layer
decoder_dense =  TimeDistributed(Dense(summary_vocab_size, activation='softmax'))
decoder_outputs = decoder_dense(decoder_concat_input)
# Define the model 
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.summary()

    Model: "model"
      __________________________________________________________________________________________________
    Layer (type)                    Output Shape         Param #     Connected to                     
    ==================================================================================================
    input_1 (InputLayer)            [(None, 150)]        0                                            
    __________________________________________________________________________________________________
    embedding (Embedding)           (None, 150, 300)     8116200     input_1[0][0]                    
    __________________________________________________________________________________________________
    lstm (LSTM)                     [(None, 150, 128), ( 219648      embedding[0][0]                  
    __________________________________________________________________________________________________
    input_2 (InputLayer)            [(None, None)]       0                                            
    __________________________________________________________________________________________________
    lstm_1 (LSTM)                   [(None, 150, 128), ( 131584      lstm[0][0]                       
    __________________________________________________________________________________________________
    embedding_1 (Embedding)         (None, None, 300)    1812300     input_2[0][0]                    
    __________________________________________________________________________________________________
    lstm_2 (LSTM)                   [(None, 150, 128), ( 131584      lstm_1[0][0]                     
    __________________________________________________________________________________________________
    lstm_3 (LSTM)                   [(None, None, 128),  219648      embedding_1[0][0]                
                                                                     lstm_2[0][1]                     
                                                                     lstm_2[0][2]                     
    __________________________________________________________________________________________________
    attention_layer (BahdanauAttent ((None, None, 128),  32896       lstm_2[0][0]                     
                                                                     lstm_3[0][0]                     
    __________________________________________________________________________________________________
    concat_layer (Concatenate)      (None, None, 256)    0           lstm_3[0][0]                     
                                                                     attention_layer[0][0]            
    __________________________________________________________________________________________________
    time_distributed (TimeDistribut (None, None, 6041)   1552537     concat_layer[0][0]               
    ==================================================================================================
    Total params: 12,216,397
    Trainable params: 12,216,397
    Non-trainable params: 0
    __________________________________________________________________________________________________

Training the model

from tensorflow.keras.optimizers import Nadam, SGD
# opt = SGD(lr=0.001)
model.compile(optimizer='Nadam', loss='sparse_categorical_crossentropy')

model_config = model.get_config()
model_config['name'] = "seq2seq"
checkpoint = ModelCheckpoint(filepath=f"{model_config['name']}.h5", 
                             monitor='val_loss', 
                             mode='min', 
                             verbose=1, 
                             save_best_only=True)
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=30)
callbacks_list = [checkpoint, es]
# fit model
history = model.fit([x_tr,y_tr[:,:-1]], 
                    y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:], 
                    validation_data=([x_val,y_val[:,:-1]], 
                                     y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]), 
                    epochs=20, 
                    batch_size=256, 
                    callbacks=callbacks_list, 
                    shuffle=False)

    Epoch 1/20
    576/576 [==============================] - ETA: 0s - loss: 2.1279
    Epoch 00001: val_loss improved from inf to 1.84586, saving model to seq2seq.h5
    576/576 [==============================] - 1143s 2s/step - loss: 2.1279 - val_loss: 1.8459
    Epoch 2/20
    576/576 [==============================] - ETA: 0s - loss: 1.7921
    Epoch 00002: val_loss improved from 1.84586 to 1.71711, saving model to seq2seq.h5
    576/576 [==============================] - 1139s 2s/step - loss: 1.7921 - val_loss: 1.7171
    Epoch 3/20
    576/576 [==============================] - ETA: 0s - loss: 1.6870
    Epoch 00003: val_loss improved from 1.71711 to 1.64130, saving model to seq2seq.h5
    576/576 [==============================] - 1148s 2s/step - loss: 1.6870 - val_loss: 1.6413
    Epoch 4/20
    576/576 [==============================] - ETA: 0s - loss: 1.6165
    Epoch 00004: val_loss improved from 1.64130 to 1.59147, saving model to seq2seq.h5
    576/576 [==============================] - 1141s 2s/step - loss: 1.6165 - val_loss: 1.5915
    Epoch 5/20
    576/576 [==============================] - ETA: 0s - loss: 1.5621
    Epoch 00005: val_loss improved from 1.59147 to 1.54568, saving model to seq2seq.h5
    576/576 [==============================] - 1144s 2s/step - loss: 1.5621 - val_loss: 1.5457
    Epoch 6/20
    576/576 [==============================] - ETA: 0s - loss: 1.5087
    Epoch 00006: val_loss improved from 1.54568 to 1.50402, saving model to seq2seq.h5
    576/576 [==============================] - 1142s 2s/step - loss: 1.5087 - val_loss: 1.5040
    Epoch 7/20
    576/576 [==============================] - ETA: 0s - loss: 1.4583
    Epoch 00007: val_loss improved from 1.50402 to 1.46740, saving model to seq2seq.h5
    576/576 [==============================] - 1151s 2s/step - loss: 1.4583 - val_loss: 1.4674
    Epoch 8/20
    576/576 [==============================] - ETA: 0s - loss: 1.4145
    Epoch 00008: val_loss improved from 1.46740 to 1.44078, saving model to seq2seq.h5
    576/576 [==============================] - 1137s 2s/step - loss: 1.4145 - val_loss: 1.4408
    Epoch 9/20
    576/576 [==============================] - ETA: 0s - loss: 1.3773
    Epoch 00009: val_loss improved from 1.44078 to 1.42007, saving model to seq2seq.h5
    576/576 [==============================] - 1155s 2s/step - loss: 1.3773 - val_loss: 1.4201
    Epoch 10/20
    576/576 [==============================] - ETA: 0s - loss: 1.3461
    Epoch 00010: val_loss improved from 1.42007 to 1.40498, saving model to seq2seq.h5
    576/576 [==============================] - 1141s 2s/step - loss: 1.3461 - val_loss: 1.4050
    Epoch 11/20
    576/576 [==============================] - ETA: 0s - loss: 1.3180
    Epoch 00011: val_loss improved from 1.40498 to 1.39290, saving model to seq2seq.h5
    576/576 [==============================] - 1146s 2s/step - loss: 1.3180 - val_loss: 1.3929
    Epoch 12/20
    576/576 [==============================] - ETA: 0s - loss: 1.2928
    Epoch 00012: val_loss improved from 1.39290 to 1.38313, saving model to seq2seq.h5
    576/576 [==============================] - 1140s 2s/step - loss: 1.2928 - val_loss: 1.3831
    Epoch 13/20
    576/576 [==============================] - ETA: 0s - loss: 1.2705
    Epoch 00013: val_loss improved from 1.38313 to 1.37605, saving model to seq2seq.h5
    576/576 [==============================] - 1145s 2s/step - loss: 1.2705 - val_loss: 1.3761
    Epoch 14/20
    576/576 [==============================] - ETA: 0s - loss: 1.2500
    Epoch 00014: val_loss improved from 1.37605 to 1.36912, saving model to seq2seq.h5
    576/576 [==============================] - 1136s 2s/step - loss: 1.2500 - val_loss: 1.3691
    Epoch 15/20
    576/576 [==============================] - ETA: 0s - loss: 1.2316
    Epoch 00015: val_loss improved from 1.36912 to 1.36402, saving model to seq2seq.h5
    576/576 [==============================] - 1138s 2s/step - loss: 1.2316 - val_loss: 1.3640
    Epoch 16/20
    576/576 [==============================] - ETA: 0s - loss: 1.2141
    Epoch 00016: val_loss improved from 1.36402 to 1.36063, saving model to seq2seq.h5
    576/576 [==============================] - 1139s 2s/step - loss: 1.2141 - val_loss: 1.3606
    Epoch 17/20
    576/576 [==============================] - ETA: 0s - loss: 1.1979
    Epoch 00017: val_loss improved from 1.36063 to 1.35715, saving model to seq2seq.h5
    576/576 [==============================] - 1142s 2s/step - loss: 1.1979 - val_loss: 1.3572
    Epoch 18/20
    576/576 [==============================] - ETA: 0s - loss: 1.1833
    Epoch 00018: val_loss improved from 1.35715 to 1.35532, saving model to seq2seq.h5
    576/576 [==============================] - 1141s 2s/step - loss: 1.1833 - val_loss: 1.3553
    Epoch 19/20
    576/576 [==============================] - ETA: 0s - loss: 1.1693
    Epoch 00019: val_loss improved from 1.35532 to 1.35249, saving model to seq2seq.h5
    576/576 [==============================] - 1142s 2s/step - loss: 1.1693 - val_loss: 1.3525
    Epoch 20/20
    576/576 [==============================] - ETA: 0s - loss: 1.1561
    Epoch 00020: val_loss improved from 1.35249 to 1.35094, saving model to seq2seq.h5
    576/576 [==============================] - 1139s 2s/step - loss: 1.1561 - val_loss: 1.3509

# save the results
import pickle
f = open(f"{model_config['name']}_history.pkl", "wb")
pickle.dump(history.history, f)
f.close()

Post Processing

Visualizing the model history

def col_to_hex(n, colmap='tab20'):
      """colormap to n hex colors"""
      out = []
      for i in range(n):
          r,g,b,_ = plt.cm.get_cmap(colmap,n)(i)
          out.append(f"#{int(r*255):02x}{int(g*255):02x}{int(b*255):02x}")
      return out
model_names = ['CN', 'noembedding', 'Glove_50', 'Glove_100', 'Glove_300']
histories = []
for model in model_names:
    history_file=f"seq2seq_{model}_history.pkl"
    f = open(history_file, "rb")
    histories.append(pickle.load(f))
    f.close()

sns.set_style('white')
fig, ax = plt.subplots(1, 1, figsize=(10, 6), dpi=80)
sns.set(font_scale=1.5)
sns.despine()
n_epochs=20
entities = ["train", "val"]
metrics = "loss"
colors = col_to_hex(len(model_names))
for idx, model_name in enumerate(model_names):
    ax.plot(np.arange(1,n_epochs+1), histories[idx][metric], marker='s', 
            mfc='white',color=colors[idx], linestyle='--', label=f'{model_name} (train)')
    ax.plot(np.arange(1,n_epochs+1), histories[idx]['val_'+metric], marker='o', 
            mfc=colors[idx], color=colors[idx], linestyle=':', label=f'{model_name} (val)')
ax.set_xlabel("epochs", labelpad=10)
ax.set_ylabel("loss", labelpad=10)
ax.legend(bbox_to_anchor=(1,1,0,0), loc='upper right', fontsize=12)
func = lambda x, pos: f"${x:0.0f}$"
ax.xaxis.set_major_locator(ticker.MultipleLocator(2))
ax.xaxis.set_major_formatter(ticker.FuncFormatter(func))

Figure 3

Inference

inverse_summary_word_index=summary_tokenizer.index_word
inverse_text_word_index=text_tokenizer.index_word
summary_word_index=summary_tokenizer.word_index
text_word_index=text_tokenizer.word_index

# Encode the input sequence to get the feature vector
encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])

# Decoder setup

# Below tensors will hold the states of the previous time step
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_hidden_state_input = Input(shape=(text_max_num_words,latent_dim))

# Get the embeddings of the decoder sequence
dec_emb2= dec_emb_layer(decoder_inputs) 

# To predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])

# Attention inference
attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])

# A dense softmax layer to generate prob dist. over the target vocabulary
decoder_outputs2 = decoder_dense(decoder_inf_concat) 

# Final decoder model
decoder_model = Model(
      [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
      [decoder_outputs2] + [state_h2, state_c2])

Greedy Search

Greedy search identifies the token with the highest probability at each time step, adds the word corresponding to that token to the start of the generated summary, and feeds the token back to the model to generate the next work. Summary generation goes on until the stop condition is met which is either receiving the end token or the summary reaching max_summary_num_words.

def greedy_search(text):
      
      input_seq = text2seq(text)
      
      # Encode the input as state vectors.
      e_out, e_h, e_c = encoder_model.predict(input_seq)
      
      # Generate empty target sequence of length 1.
      target_seq = np.zeros((1,1))
      
      # Populate the first word of target sequence with the start word.
      target_seq[0, 0] = summary_word_index['starttoken']
      stop_condition = False
      decoded_sentence = ''
      while not stop_condition:
        
          output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
          # Sample a token
          sampled_token_index = np.argmax(output_tokens[0, -1, :])
          sampled_token = inverse_summary_word_index[sampled_token_index]
          
          if(sampled_token!='starttoken'):
              # Exit condition: either hit max length or find stop word.
              if (sampled_token == 'endtoken' or len(decoded_sentence.split()) >= (summary_max_num_words-1)):
                  stop_condition = True
              else:
                  decoded_sentence += ' '+sampled_token
          # Update the target sequence (of length 1).
          target_seq = np.zeros((1,1))
          target_seq[0, 0] = sampled_token_index
          # Update internal states
          e_h, e_c = h, c
      return decoded_sentence

def seq2summary(input_seq):
      newString=''
      for i in input_seq:
          if((i!=0 and i!=summary_word_index['starttoken']) and i!=summary_word_index['endtoken']):
              newString=newString+inverse_summary_word_index[i]+' '
      return newString.strip()
def seq2text(input_seq):
      newString=''
      for i in input_seq:
          if(i!=0):
              newString=newString+inverse_text_word_index[i]+' '
      return newString.strip()
def text2seq(text):
      seq = []
      cleaned_text = clean_text(text)
      for word in cleaned_text.split():
          if word in text_word_index.keys():
              if text_word_index[word] < text_vocab_size:
                  seq.append(text_word_index[word])
      if len(seq) > text_max_num_words:
          return np.array(seq)[:text_max_num_words].reshape(1,text_max_num_words).astype('int32')
      input_seq = np.zeros((text_max_num_words))
      input_seq[:len(seq)]=np.array(seq)
      return input_seq.reshape(1,text_max_num_words).astype('int32')

for i in range(5):
      print("Review:",data['text'].iloc[i])
      print("Original summary:",ndata['summary'].iloc[i])
      print("Predicted summary:", greedy_search(data['text'].iloc[i]))
      print("\n")

    Review: tried overseas last year remember exactly since whirlwind tour four countries would guess bangkok kuala lumpur hotel could find brand costco supermarkets shop price amazon lot regular lipton brand tempted order anyway
      Original summary: yellow label lipton tea
      Predicted summary:  lipton lipton tea bags
      
      
      Review: first came across lipton yellow label tea trip france many years ago became favorite tea travel found china dubai europe live small town course could find local grocery tea shop shelves always buy boxes travel delighted see sale amazon flavor light clean
      Original summary: a great tea
      Predicted summary:  best tea ever
      
      
      Review: first tasted caracas business trip south america immediately tasted difference usual lipton sell states much better bitterness great aftertaste bought next trip back contacted lipton sell states said latin america region 15 years ago could even get yellow label london started get internet suppliers favorite went business went back old stand amazon getting since everyone taste falls love much better standard lipton tea buy love black tea must
      Original summary: best black tea
      Predicted summary:  best ceylon tea have found
      
      
      Review: best tea ever first france readily available except online marketed us lipton must bought third party sellers nothing like american market lipton tea round flavor much smoother subtle richness highly recommended
      Original summary: best tea ever  nothing like the american lipton tea 
      Predicted summary:  best tea ever
      
      
      Review: wow new flavor block real tea looking received amazon drank first cup sweat felt really effect whole body specially head direct effects way thinking love never drink tea one
      Original summary: wow  this is outstanding 
      Predicted summary:  i like this tea

# decoder model + attention
decoder_model_attn = Model(
      [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
      [decoder_outputs2] + [state_h2, state_c2, attn_states_inf])

import itertools
def plot_attention(attention_layer_weights, text, summary, cmap='viridis'):
      fig, ax = plt.subplots(1, 1, figsize=(10,10), dpi=100)
      ax.matshow(attention_layer_weights[:,:], cmap=cmap)
      if len(text)<50:
          text_fs = 14
      elif len(text)<80:
          text_fs = 11
      else:
          text_fs = 9
      ax.set_xticklabels([' '] + text, fontsize=text_fs, rotation=90)
      ax.set_yticklabels([' '] + summary, fontsize=text_fs)
      ax.xaxis.set_major_locator(ticker.MultipleLocator(1)) 
      ax.yaxis.set_major_locator(ticker.MultipleLocator(1))
      plt.show()

Attention Plots

def greedy_search(text):
      
      # To store attention plots of the output
      attention_plot = np.zeros((summary_max_num_words, text_max_num_words))
      
      # padding zero upto maximum length
      input_seq = text2seq(text)
      
      # Encode the input as state vectors.
      # returns encoder_outputs, state_h, state_c
      e_out, e_h, e_c = encoder_model.predict(input_seq)
      
      # Generate empty target sequence of length 1.
      target_seq = np.zeros((1,1))
      
      # Populate the first word of target sequence with the start word.
      target_seq[0, 0] = summary_word_index['starttoken']
      stop_condition = False
      decoded_summary = ''
      counter = 0
      while not stop_condition:
        
          output_tokens, h, c, attention_weights = decoder_model_attn.predict([target_seq] + [e_out, e_h, e_c])
          
          # Sample a token
          sampled_token_index = np.argmax(output_tokens[0, -1, :])
          sampled_token = inverse_summary_word_index[sampled_token_index]
          
          if(sampled_token!='starttoken'):
              # Exit condition: either hit max length or find stop word.
              if (sampled_token == 'endtoken' or len(decoded_summary.split()) >= (summary_max_num_words-1)):
                  stop_condition = True
              else:
                  decoded_summary += ' '+sampled_token
                  
          # storing the attention weights to plot later on
          attention_plot[counter] = attention_weights
          counter += 1
          # Update the target sequence (of length 1).
          target_seq = np.zeros((1,1))
          target_seq[0, 0] = sampled_token_index
          # Update internal states
          e_h, e_c = h, c
      return text, decoded_summary, attention_plot
  # Summarize
def summarize(original_text, original_summary=None, algo='greedy'):
      if algo == 'greedy':
          text, summary, attention_plot = greedy_search(original_text)
      else:
          print("Algorithm {} not implemented".format(algo))
          return
      
      print(f'Input text: {original_text}')
      if original_summary is not None:
          print(f'** Original Summary: {original_summary}')
      print(f'** Predicted Summary: {summary}')
      text = text.strip().split(' ')
      summary = summary.strip().split(' ')
      attention_plot = attention_plot[:len(summary), :len(text)]
      plot_attention(attention_plot, text, summary)

data['text_word_count'] = data['text'].apply(lambda x: len(x.strip().split()))
data['summary_word_count'] = data['summary'].apply(lambda x: len(x.strip().split()))

test_data = data[(data.text_word_count>10) 
                  & (data.text_word_count<40)]

for i in range(5):
      input_text = test_data['text'].iloc[i]
      original_summary = test_data['summary'].iloc[i]
      summarize(input_text, original_summary=original_summary)

      Input text: tried overseas last year remember exactly since whirlwind tour four countries would guess bangkok kuala lumpur hotel could find brand costco supermarkets shop price amazon lot regular lipton brand tempted order anyway
      ** Original Summary: yellow label lipton tea
      ** Predicted Summary:  lipton lipton tea bags

Figure 5

    Input text: best tea ever first france readily available except online marketed us lipton must bought third party sellers nothing like american market lipton tea round flavor much smoother subtle richness highly recommended
      ** Original Summary: best tea ever  nothing like the american lipton tea 
      ** Predicted Summary:  best tea ever

Figure 6

    Input text: wow new flavor block real tea looking received amazon drank first cup sweat felt really effect whole body specially head direct effects way thinking love never drink tea one
      ** Original Summary: wow  this is outstanding 
      ** Predicted Summary:  i like this tea

Figure 7

    Input text: like product rich full bodied flavor lipton tea long long ago weak tea drinker e dunks bag cup water buy regular lipton black tea like strong rich flavored product spend bit purchase yellow label tea believe disappointed
      ** Original Summary: what is not to like about this product
      ** Predicted Summary:  not as good as the same as the same as the same as the same

Figure 8

    Input text: best tea ever sold us market us anything like lipton buy us rich dark tasty also come wrapped heavy duty plastic bag keeps fresh protects box stars way
      ** Original Summary: the best tea flavor
      ** Predicted Summary:  best tea ever

Figure 9

test_data = data[(data.text_word_count>50) 
                 & (data.text_word_count<90)]

for i in range(40,45):
      input_text = test_data['text'].iloc[i]
      original_summary = test_data['summary'].iloc[i]
      summarize(input_text, original_summary=original_summary)

    Input text: difficulty breathing due viral infection chronic asthma tea keep hand time case someone family develops cough addition various western herbs support clear breathing eucalyptus pleurisy root tea traditional chinese herbal mixture bi yan pan magic ingredient helps clear mucous lungs grateful amazon carries entire traditional medicinals line sometimes hard find breathe easy tea stores variety competitors market teas claim provide relief breathing difficulties breathe easy outclasses others tried plus pleasant flavor licorice peppermint especially compared herbal teas perceptible medical effects highly recommend breathe easy tea
      ** Original Summary: herbal remedy for bronchial distress
      ** Predicted Summary:  great for sore throats

Figure 10

    Input text: acupuncturist said bi yan pian good allergies sinus problems suffer extreme allergies going acupuncture get rid body reaction allergies meantime going antihistamines tea really works wonders bi yan pian disolves mucus allergy reactions thereby letting body get rid toxins antihistamines dry making toxins stay body oh yeah lowers immunity get sick good reason go anithistamines tea tastes good combination use honey ginger amazing continue using tea needed works well drink much actually dry acupuncture part hope helps decide
      ** Original Summary: very taste  will continue to purchase
      ** Predicted Summary:  great for your health

Figure 11

    Input text: let tell tea mmk baby comes sick right dying cold cannot breathe sad miserable like hey lets try tea maybe help best thing ever happen breathe magic swear seeped full 15 minutes worth also tastes awesome want buy tea 10 10 would recommend write review breathing normally despite cold telling buy
      ** Original Summary: magic in cup
      ** Predicted Summary:  i am so happy to have this

Figure 12

    Input text: get wrong like tea lot tastes good cooling effect overall nice herbal tea however reading reviews see people impression herbs made organic garden california small time company company based california source herbs around world read vague description countries areas herbs come website many two herbs come us anything wrong misleading packaging
      ** Original Summary: a good tea for winter
      ** Predicted Summary:  good for you

Figure 13

    Input text: past months suffered regular sinus congestion considering product sure know frustrating able breathe properly tea wonders drink sinuses clear remarkably finally breathe way want tremendous relief moments cannot stand congestion definitely turn tea still looking long term solution keep shelf find downside taste initially pleasant leaves annoyingly sweet aftertaste back mouth yuck worth
      ** Original Summary: cloyingly sweet aftertaste  but totally worth it
      ** Predicted Summary:  works for me

Figure 14

Beam Search

# Beam search implementation
def beam_search(text, beam_width=3, text_max_num_words=150, summary_max_num_words=20,
                  num_units = 128, start_token='starttoken', end_token='endtoken', verbose=True):
      
      attention_plot = np.zeros((summary_max_num_words, text_max_num_words))
      # padding zero upto maximum length
      input_seq = text2seq(text)
      e_out, e_h, e_c = encoder_model.predict(input_seq)
      # Generate empty target sequence of length 1.
      target_seq = np.zeros((1,1))
      
      # Populate the first word of target sequence with the start word.
      target_seq[0, 0] = summary_word_index[start_token]
      end_token_id = summary_word_index[end_token]
      # initial beam with (tokens, last hidden state, attn, score)
      # last hidden state = encoder hidden state = e_h
      start_pt = [(target_seq, e_h, attention_plot, 0.0)]  # initial beam 
      stop_condition = False
      counter = 0
      decoded_summary = ''
      while not stop_condition:
          
          candidates = [] # empty list to store candidates
          
          for row in start_pt:
              
              # handle beams emitting end signal
              allend = True
              dec_input = row[0].ravel()[-1] # last seq
              if dec_input != end_token_id:
                  
                  tmp = np.zeros((1,1))
                  # Populate the first word of target sequence with the start word.
                  tmp[0,0] = dec_input
                  dec_input = tmp
                  e_h = row[1]  # second item is decoder hidden state
                  attention_plt = np.zeros((summary_max_num_words, text_max_num_words)) +\
                                  row[2] # new attn vector
                  output_tokens, h, c, attention_weights = decoder_model_attn.predict([dec_input] + [e_out, e_h, e_c])
      
                  # storing the attention weights to plot later on
                  attention_plt[counter] = attention_weights
                  
                  # take top-K in this beam where k is the beam width
                  top_k_indices = np.argsort(output_tokens[0, -1, :])[::-1][:beam_width]
                  top_k_scores = output_tokens[0, -1, :][top_k_indices]
                  
                  
                  for token_index, token_score in zip(top_k_indices, top_k_scores):
                      sampled_token = inverse_summary_word_index[token_index]
                      score = row[3] - np.log(token_score)
                      tmp = np.hstack((row[0], np.array(token_index).reshape(1,1))) # update summary
                      candidates.append((tmp, h, attention_plt, score))
                      if (token_index == end_token_id or len(candidates[-1][0]) >= (summary_max_num_words-1)):
                          stop_condition = True
                  allend=False
                          
              else:
                  candidates.append(row)  # add ended beams back in
          
          if allend:
              break # end for loop as all sequences have ended
          
          # Update internal states
          e_h, e_c = h, c
          #sort by score
          start_pt = sorted(candidates, key=lambda x: x[3])[:beam_width]
          counter += 1
                  
      if verbose:
          # print all the final summaries
          for i, row in enumerate(start_pt):
              tokens = [x for x in row[0].ravel() if x > end_token_id] # end_token_id = 2
              print("Summary {} with {:5f}: {}".format(i, row[3], seq2summary(tokens)))
      # return final sequence    
      summary = seq2summary([x for x in start_pt[0][0].ravel() if x>end_token_id])
      attention_plot = start_pt[0][2]  # third item in tuple
      return text, summary, attention_plot
  # Summarize
def summarize(text, original_summary=None, algo='greedy', beam_width=3, verbose=1):
      if algo == 'greedy':
          text, summary, attention_plot = greedy_search(text)
      elif algo=='beam':
          text, summary, attention_plot = beam_search(text, beam_width=beam_width, verbose=verbose)
      else:
          print("Algorithm {} not implemented".format(algo))
          return
      
      print(f'Input text: {text}')
      if original_summary is not None:
          print(f'** Original Summary: {original_summary}')
      print(f'** Predicted Summary: {summary}')
      text = text.strip().split(' ')
      summary = summary.strip().split(' ')
      attention_plot = attention_plot[:len(summary), :len(text)]
      plot_attention(attention_plot, text, summary)

test_data = data[(data.text_word_count>10) 
                 & (data.text_word_count<40)]

for i in range(15,20):
      input_text = test_data['text'].iloc[i]
      original_summary = test_data['summary'].iloc[i]
      summarize(input_text, original_summary=original_summary, algo='beam', beam_width=5)

    Summary 0 with 6.054612: not the tea is
    Summary 1 with 6.473819: this tea is the
    Summary 2 with 6.594498: this tea is not
    Summary 3 with 6.935582: not what is the
    Summary 4 with 6.987707: not the tea was
    Input text: discovered yellow label tea e asia strong consistently fresh never bitter tea comes tiny rolled leaves ie balls even brewing remains balls easier get cup leaves
    ** Original Summary: fresh  strong tea
    ** Predicted Summary: not the tea is

Figure 15

    Summary 0 with 8.859888: beware of the tea were
    Summary 1 with 9.179970: beware of the seller is
    Summary 2 with 9.286277: beware of the seller
    Summary 3 with 9.296453: beware of the product were
    Summary 4 with 9.306166: beware of the seller was
    Input text: received brand new box tea amazon expires weeks tomorrow mailed back going two stars excellent tea fresh update going one star since grocery items cannot returned shame since good tea
    ** Original Summary: good tea  old product
    ** Predicted Summary: beware of the tea were

Figure 16

    Summary 0 with 5.130577: great for beginners
    Summary 1 with 5.940405: i love it
    Summary 2 with 6.115343: great for beginners and
    Summary 3 with 6.839922: great for beginners but
    Summary 4 with 7.025602: my favorite for beginners
    Input text: glad tough find local store makes nice cup tea know buying enjoy used loose leaves little bags filled powder try smaller size first see like love
    ** Original Summary: it is nice and difficult to find locally
    ** Predicted Summary: great for beginners

Figure 17

    Summary 0 with 3.939362: best black tea
    Summary 1 with 4.403360: the best black tea
    Summary 2 with 5.191790: my favorite tea
    Summary 3 with 5.632589: the best tea
    Summary 4 with 6.030829: best tea for the
    Input text: black tea harsh black teas churn stomach breakfast one get dark let maintains nice flavor matter long short let steep excellent product loose teas try forlife brew mug extra fine tea infuser lid best infuser owned
    ** Original Summary: love this tea
    ** Predicted Summary: best black tea

Figure 18

    Summary 0 with 5.473339: this is the best
    Summary 1 with 6.242138: pero is the best
    Summary 2 with 6.819582: the best coffee
    Summary 3 with 7.568939: this is not the
    Summary 4 with 7.609267: i am not the
      Input text: doctor suggested drank coffee try organo gold cafe supreme doctor advised regrets noticed increase vitality alertness taste name implies supreme advising family friends others merits beverage hot iced quite refreshing
      ** Original Summary: organo gold cafe supreme 100  certified ganoderma extract sealed
      ** Predicted Summary: this is the best

Figure 19

Summary and Conclusion

References

www.analyticsvidhya.com
Advanced Natural Language Processing with TensorFlow 2
Colah's blog