<img src="./images/qinsti.png" align="left" alt="drawing" width="100"/>
<br><br>
<div align="left">
    <h2>Topic Modeling - Primer</h2>
    <h3>Fitting LDA model to Financial news items</h3>
</div>



## Context

The purpose of this notebook is to demonstrate a simple example of topic modeling on a set of news items. The sample news items are from
Refinitiv's *Machine Readable news* content offering. 

The following are the main steps of building a topic model
* Load the news items
* Preprocess the text
  * Load Spacy 
  * Remove stopwords
  * Lemmatization
  * Tokenize
* Create a Document Term Matrix
* Use `sklearn` to create a LDA model


## Load Spacy Language Model

`Spacy` library in Python has a lot of useful modules for NLP tasks. It has many preprocessing modules and operates with many of the popular deep learning 
frameworks

In [1]:
import spacy
import pandas as pd
import numpy as np
nlp = spacy.load('en_core_web_lg')


## Load News Data

The dataset comprises sample news item from the financial domain. These news items are related to *commodity arbitrage* and *loans* related items 

In [2]:
news_items = pd.read_csv("./data/news-body-samples-v1.csv",sep="\t")
news_items.topic.value_counts()
cond       = news_items.apply(lambda x: 300<=len(x['body']) <=6000, axis=1)
news_items = news_items.assign(l_status = cond)
news_items = news_items[news_items.l_status==True]
news_items.topic.value_counts()

N2:COMARB    1484
N2:LOA       1368
Name: topic, dtype: int64

## Define custom tokenizer 

For using any specific tokenizer with `sklearn`, one can define tokenizer based on any library of your choice

In [3]:
def my_tokenizer(text):
   tokens = [t for t in nlp(text) if t.is_alpha and not(t.is_space or t.is_punct or t.is_stop or t.like_num)]
   return [t.lemma_.lower().strip() if t.lemma_ != "-PRON-" else t.lower_  for t in tokens ]


## Create Document Term Matrix

One can use `sklearn.feature_extraction` module to create a Document Term Matrix. 
The input to the relevant function is the customized tokenizer function 

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
ct_vectorizer= CountVectorizer(tokenizer = my_tokenizer, ngram_range=(1,1),
                               min_df=0.2,
                               max_df=0.9,
                               max_features=1000)


X = ct_vectorizer.fit_transform(news_items.iloc[:,0].values)

## Use `sklearn.decomposition` module to fit LDA 

DocumentTerm Matrix can then be used as input for building an unsupervised LDA model

In [5]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=5 , random_state = 123,
                                learning_method='batch')

X_topics = lda.fit_transform(X)

## Explore top 30 words from the topics

In [9]:
n_top_words=30

feature_names = ct_vectorizer.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    print(topic_idx)
    print(" ".join([ feature_names[i] for i in topic.argsort()[-n_top_words:-1]] ))


0
refining say margin double contract sell data barrel o energy low price discount europe fob percent oil fuel cargo future sulphur gasoil euro diff click barge ara trade diesel
1
compare wednesday high early friday fall close thursday new week rise london expect margin day international april march bp month market say offer reuter year price asia percent bank
2
close fall shell report month crack compare total oil stock rise future refinery week crude say cargo sell percent refining day margin ara reuters europe fob barge trade barrel
3
month report high premium barge trader change offer source week east west close o trade price cent barrel diff asia reuters gasoil tonne say crack cargo brent crude fuel
4
stock week month low accord export early report international expect market wednesday friday thursday include tuesday monday london percent new source total april company march year reuter loan bank


## Explore sample news items from a *topic 1*

Topic 1  has many terms that are related to loans. The following retrieves a few news items belonging to this topic:

In [7]:
sample_topic=X_topics[:,1].argsort()[::-1]

for iter_idx,movie_idx in enumerate(sample_topic[:2]):
    print(news_items.iloc[movie_idx,0])


    MANILA, Feb 1 (Reuter) - The Philippine Interbank Offered
Rate (PHIBOR) made its debut on Thursday with the average quote
for one week at 13.75 percent and for one year at 14.25
percent, the Bankers Association of the Philippines said.
    PHIBOR is compiled by Reuters on the basis of
contributions from 20 banks.
    The PHIBOR is designed to be the alternative for Treasury
bill rates as benchmarks for corporate loan pricing, the
Bankers Association said.
    The one-week PHIBOR averaging 13.75 percent was slightly
higher than the 13.625 percent yield of one-week reverse repos
being offered by the Central Bank.
    The two-week PHIBOR was also 13.75 percent. There were no
quoted rates for two- week money in the interbank market.
    The one-month PHIBOR was 13.875 percent which was higher
as compared with the 32- day reverse repos' yield of 12.375
percent being offered by the Central Bank.
    The two-month PHIBOR averaged 13.9375 percent which was
firmer than 12.

## Explore sample news items from _topic 3_

Topic 5 has many terms that are related to loans. The following retrieves a few news items belonging to this topic:

In [8]:
sample_topic=X_topics[:,3].argsort()[::-1]

for iter_idx,movie_idx in enumerate(sample_topic[:2]):
    print(news_items.iloc[movie_idx,0])
    


    LONDON, March 29 (Reuters) -    * Forties differentials
held almost unchanged on Tuesday from the average seen late last
week. Total bought two cargoes for loading in late April,
helping offset some of the more aggressive offers in the daily
window.
    * There were very few takers for a slew of offers, with
diffs offered down at their lowest since early February and the
first ship-to-ship transfers in four months.
    * Shell offered two cargoes for ship-to-ship transfer at
Scapa Flow, which traders sometimes view as a sign of an over
supplied market.
    * The two Aframaxes from which the STS transfers were
offered were the NS Challenger, currently moored off the German
port of Wilhelmshaven, and the Alfa Germania, off Rotterdam. 
    * Forties diffs have fallen by more than 60 cents so far
this month. Historically, March tends to be the weakest month,
in which on average in the last 20 years, diffs have fallen by
32 cents in this month, compared with July, the m