<img src="./images/qinsti.png" align="left" alt="drawing" width="100"/>
<br><br>
<div align="left">
    <h2>Topic Modeling - Primer</h2>
    <h3>Latent Semantic Analysis on Financial news items</h3>
</div>



## Context

The purpose of this notebook is to demonstrate a simple example of topic modeling on a set of news items. The sample news items are from
Refinitiv's *Machine Readable news* content offering. 

The following are the main steps of building a topic model
* Load the news items
* Preprocess the text
  * Load Spacy 
  * Remove stopwords
  * Lemmatization
  * Tokenize
* Create a Document Term Matrix
* Use `sklearn` to perform LSA

## Load Spacy Language Model

`Spacy` library in Python has a lot of useful modules for NLP tasks. It has many preprocessing modules and operates with many of the popular deep learning 
frameworks

In [1]:
import spacy
import pandas as pd
import numpy as np
nlp = spacy.load('en_core_web_lg')


## Load News Data

The dataset comprises sample news item from the financial domain. These news items are related to *commodity arbitrage* and *loans* related items 

In [2]:
news_items = pd.read_csv("./data/news-body-samples-v1.csv",sep="\t")
news_items.topic.value_counts()
cond       = news_items.apply(lambda x: 300<=len(x['body']) <=6000, axis=1)
news_items = news_items.assign(l_status = cond)
news_items = news_items[news_items.l_status==True]
news_items.topic.value_counts()

N2:COMARB    1484
N2:LOA       1368
Name: topic, dtype: int64

## Define custom tokenizer 

For using any specific tokenizer with `sklearn`, one can define tokenizer based on any library of your choice

In [4]:
def my_tokenizer(text):
   tokens = [t for t in nlp(text) if t.is_alpha and not(t.is_space or t.is_punct or t.is_stop or t.like_num)]
   return [t.lemma_.lower().strip() if t.lemma_ != "-PRON-" else t.lower_  for t in tokens ]


## Create Document Term Matrix

One can use `sklearn.feature_extraction` module to create a Document Term Matrix. 
The input to the relevant function is the customized tokenizer function 

In [5]:
news_items.shape

(2852, 3)

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer= TfidfVectorizer(tokenizer = my_tokenizer, 
                                  ngram_range=(1,1),
                               min_df=0.2,
                               max_df=0.9,
                               max_features=1000)


X = tfidf_vectorizer.fit_transform(news_items.iloc[:,0].values)

## Use `sklearn.decomposition` module to fit LSA 

DocumentTerm Matrix can then be used as input for building an unsupervised LSA model

In [7]:
from sklearn.decomposition import TruncatedSVD
lsa = TruncatedSVD(n_components=5 , random_state = 123,
                                algorithm='arpack')

X_topics = lsa.fit_transform(X)

In [8]:
print(X.shape)
print(X_topics.shape)
print(lsa.components_.shape)


(2852, 99)
(2852, 5)
(5, 99)


## Explore top 30 words from the topics

In [13]:
print(tfidf_vectorizer.get_feature_names())

['accord', 'april', 'ara', 'asia', 'bank', 'barge', 'barrel', 'benchmark', 'bid', 'bp', 'brent', 'cargo', 'cent', 'change', 'click', 'close', 'company', 'compare', 'contract', 'crack', 'crude', 'data', 'datum', 'day', 'demand', 'diesel', 'diff', 'discount', 'double', 'early', 'east', 'edit', 'energy', 'euro', 'europe', 'european', 'expect', 'export', 'fall', 'fob', 'friday', 'fuel', 'future', 'gasoil', 'gmt', 'guide', 'high', 'include', 'international', 'keyword', 'late', 'loan', 'london', 'low', 'march', 'margin', 'market', 'message', 'mogas', 'monday', 'month', 'new', 'northwest', 'o', 'offer', 'oil', 'percent', 'premium', 'previous', 'price', 'rate', 'refinery', 'refining', 'report', 'reuter', 'reuters', 'rise', 'say', 'sell', 'shell', 'show', 'source', 'speed', 'stock', 'sulphur', 'supply', 'swap', 'thursday', 'tonne', 'total', 'trade', 'trader', 'tuesday', 'vitol', 'wednesday', 'week', 'west', 'window', 'year']


In [10]:
n_top_words=30

feature_names = tfidf_vectorizer.get_feature_names()
for topic_idx, topic in enumerate(lsa.components_):
    print(topic_idx)
    print(" ".join([ feature_names[i] for i in topic.argsort()[-n_top_words:-1]] ))


0
refinery day margin crack price year sell sulphur future europe reuters loan crude euro fob click percent diff cargo fuel gasoil say bank oil diesel barrel barge ara trade
1
accord bp early expect wednesday total month source monday price friday include tuesday thursday london offer market new asia international percent april company march rate year reuter say loan
2
market company east rate o bid price report edit late week message swap close west month window cent asia cargo offer say source barrel crack new reuters brent oil
3
change data thursday early rise east click fall sulphur diff close gasoil high ara discount euro expect diesel week day month international bp asia percent price market offer bank
4
cent euro low company close supply cargo market march o price include say contract east double source west change brent discount click asia sulphur crack diff diesel oil fuel


## Explore sample news items from a *topic 2*

Topic 2 has many terms that are related to loans. The following retrieves a few news items belonging to this topic:

In [11]:
sample_topic=X_topics[:,1].argsort()[::-1]

for iter_idx,movie_idx in enumerate(sample_topic[:2]):
    print(news_items.iloc[movie_idx,0])
    print("*"*100)


    TOKYO, March 26 (Reuter) - Bank of Yokohama Ltd <8332.T>
will dispose of about 270 billion yen in problem loans in the
business year ending on March 31, the bank said in a statement
on Tuesday.
    Bank of Yokohama officials told a news conference that all
the bank's problem loans to the ailing mortgage firms will be
disposed of by the end of this business year.
    The officials also said the bank plans to forgive all the
problem loans to the mortgage firms and receive tax breaks for
writing off the bad loans.
    As part of its plan to streamline its business, Bank of
Yokohama said in a statement it will ask its directors to
return bonuses given to them this fiscal year and added that it
will also cut their salaries.
****************************************************************************************************
    QUITO, March 20 (Reuter) - Ecuador's Central Bank will
grant Banco Continental a $160 million subordinated loan,
taking temporary control of the tr

## Explore sample news items from _topic 5_

Topic 5 has many terms that are related to commodities arb. The following retrieves a few news items belonging to this topic:

In [12]:
sample_topic=X_topics[:,4].argsort()[::-1]

for iter_idx,movie_idx in enumerate(sample_topic[:2]):
    print(news_items.iloc[movie_idx,0])
    print("*"*100)   


    SINGAPORE, April 10 (Reuters) - Asian jet fuel margins are expected to firm in the coming
week as supply of the fuel has tightened, with refiners maximising gasoil production due to a
negative regrade, traders said.
    
    - JET FUEL: With jet fuel prices weaker than gasoil, refiners are inclined to maximise
production of the latter, traders said. This could boost jet fuel prices again and boost margins
further, they added. 
    - INDIA GASOIL EXPORTS: India's Numaligarh Refinery has signed a pact with Bangladesh
Petroleum Corp for the sale and purchase of gasoil, according to an Indian government
presentation. [nD8N1FU00N]
    - INDIA MAINTENANCE: India's Mangalore Refinery and Petrochemicals Ltd has likely started
its planned maintenance at a crude distillation unit, industry sources said. [REF/A]
    - INDIA FUEL DEMAND: India's diesel sales in March were up 0.3 percent year-on-year, data
from the Petroleum Planning and Analysis Cell (PPAC) of the oil ministry sho