The purpose of this notebook is to demonstrate a simple example of topic modeling on a set of news items. The sample news items are from Refinitiv's Machine Readable news content offering.
The following are the main steps of building a topic model
sklearn to perform LSASpacy library in Python has a lot of useful modules for NLP tasks. It has many preprocessing modules and operates with many of the popular deep learning
frameworks
import spacy
import pandas as pd
import numpy as np
nlp = spacy.load('en_core_web_lg')
The dataset comprises sample news item from the financial domain. These news items are related to commodity arbitrage and loans related items
news_items = pd.read_csv("./data/news-body-samples-v1.csv",sep="\t")
news_items.topic.value_counts()
cond = news_items.apply(lambda x: 300<=len(x['body']) <=6000, axis=1)
news_items = news_items.assign(l_status = cond)
news_items = news_items[news_items.l_status==True]
news_items.topic.value_counts()
N2:COMARB 1484 N2:LOA 1368 Name: topic, dtype: int64
For using any specific tokenizer with sklearn, one can define tokenizer based on any library of your choice
def my_tokenizer(text):
tokens = [t for t in nlp(text) if t.is_alpha and not(t.is_space or t.is_punct or t.is_stop or t.like_num)]
return [t.lemma_.lower().strip() if t.lemma_ != "-PRON-" else t.lower_ for t in tokens ]
One can use sklearn.feature_extraction module to create a Document Term Matrix.
The input to the relevant function is the customized tokenizer function
news_items.shape
(2852, 3)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer= TfidfVectorizer(tokenizer = my_tokenizer,
ngram_range=(1,1),
min_df=0.2,
max_df=0.9,
max_features=1000)
X = tfidf_vectorizer.fit_transform(news_items.iloc[:,0].values)
sklearn.decomposition module to fit LSA¶DocumentTerm Matrix can then be used as input for building an unsupervised LSA model
from sklearn.decomposition import TruncatedSVD
lsa = TruncatedSVD(n_components=5 , random_state = 123,
algorithm='arpack')
X_topics = lsa.fit_transform(X)
print(X.shape)
print(X_topics.shape)
print(lsa.components_.shape)
(2852, 99) (2852, 5) (5, 99)
print(tfidf_vectorizer.get_feature_names())
['accord', 'april', 'ara', 'asia', 'bank', 'barge', 'barrel', 'benchmark', 'bid', 'bp', 'brent', 'cargo', 'cent', 'change', 'click', 'close', 'company', 'compare', 'contract', 'crack', 'crude', 'data', 'datum', 'day', 'demand', 'diesel', 'diff', 'discount', 'double', 'early', 'east', 'edit', 'energy', 'euro', 'europe', 'european', 'expect', 'export', 'fall', 'fob', 'friday', 'fuel', 'future', 'gasoil', 'gmt', 'guide', 'high', 'include', 'international', 'keyword', 'late', 'loan', 'london', 'low', 'march', 'margin', 'market', 'message', 'mogas', 'monday', 'month', 'new', 'northwest', 'o', 'offer', 'oil', 'percent', 'premium', 'previous', 'price', 'rate', 'refinery', 'refining', 'report', 'reuter', 'reuters', 'rise', 'say', 'sell', 'shell', 'show', 'source', 'speed', 'stock', 'sulphur', 'supply', 'swap', 'thursday', 'tonne', 'total', 'trade', 'trader', 'tuesday', 'vitol', 'wednesday', 'week', 'west', 'window', 'year']
n_top_words=30
feature_names = tfidf_vectorizer.get_feature_names()
for topic_idx, topic in enumerate(lsa.components_):
print(topic_idx)
print(" ".join([ feature_names[i] for i in topic.argsort()[-n_top_words:-1]] ))
0 refinery day margin crack price year sell sulphur future europe reuters loan crude euro fob click percent diff cargo fuel gasoil say bank oil diesel barrel barge ara trade 1 accord bp early expect wednesday total month source monday price friday include tuesday thursday london offer market new asia international percent april company march rate year reuter say loan 2 market company east rate o bid price report edit late week message swap close west month window cent asia cargo offer say source barrel crack new reuters brent oil 3 change data thursday early rise east click fall sulphur diff close gasoil high ara discount euro expect diesel week day month international bp asia percent price market offer bank 4 cent euro low company close supply cargo market march o price include say contract east double source west change brent discount click asia sulphur crack diff diesel oil fuel
Topic 2 has many terms that are related to loans. The following retrieves a few news items belonging to this topic:
sample_topic=X_topics[:,1].argsort()[::-1]
for iter_idx,movie_idx in enumerate(sample_topic[:2]):
print(news_items.iloc[movie_idx,0])
print("*"*100)
TOKYO, March 26 (Reuter) - Bank of Yokohama Ltd <8332.T>
will dispose of about 270 billion yen in problem loans in the
business year ending on March 31, the bank said in a statement
on Tuesday.
Bank of Yokohama officials told a news conference that all
the bank's problem loans to the ailing mortgage firms will be
disposed of by the end of this business year.
The officials also said the bank plans to forgive all the
problem loans to the mortgage firms and receive tax breaks for
writing off the bad loans.
As part of its plan to streamline its business, Bank of
Yokohama said in a statement it will ask its directors to
return bonuses given to them this fiscal year and added that it
will also cut their salaries.
****************************************************************************************************
QUITO, March 20 (Reuter) - Ecuador's Central Bank will
grant Banco Continental a $160 million subordinated loan,
taking temporary control of the troubled financial
institution, a high ranking government official said.
The one-year loan will allow Banco Continental, one of
Ecuador's largest banks, survive its liquidity crisis, the
official who asked for anonymity said.
The Central bank will acquire the voting rights of all of
the bank's shares and some of the its subsidiary companies'
shares, such as Banco Continental de Curazao. The shares will
be deposited in a trust fund.
****************************************************************************************************
Topic 5 has many terms that are related to commodities arb. The following retrieves a few news items belonging to this topic:
sample_topic=X_topics[:,4].argsort()[::-1]
for iter_idx,movie_idx in enumerate(sample_topic[:2]):
print(news_items.iloc[movie_idx,0])
print("*"*100)
SINGAPORE, April 10 (Reuters) - Asian jet fuel margins are expected to firm in the coming
week as supply of the fuel has tightened, with refiners maximising gasoil production due to a
negative regrade, traders said.
- JET FUEL: With jet fuel prices weaker than gasoil, refiners are inclined to maximise
production of the latter, traders said. This could boost jet fuel prices again and boost margins
further, they added.
- INDIA GASOIL EXPORTS: India's Numaligarh Refinery has signed a pact with Bangladesh
Petroleum Corp for the sale and purchase of gasoil, according to an Indian government
presentation. [nD8N1FU00N]
- INDIA MAINTENANCE: India's Mangalore Refinery and Petrochemicals Ltd has likely started
its planned maintenance at a crude distillation unit, industry sources said. [REF/A]
- INDIA FUEL DEMAND: India's diesel sales in March were up 0.3 percent year-on-year, data
from the Petroleum Planning and Analysis Cell (PPAC) of the oil ministry showed. [nENNH450SE]
- It also rose more than 10 percent from February, suggesting the dip in demand earlier this
year was temporary. [nL3N1HI2UA]
- EMARAT JET FUEL: Emirates General Petroleum Corp (Emarat) is seeking 60,000 tonnes of jet
fuel for delivery on May 13-14. The tender closes on April 17 and is valid until April 19.
- NIGERIA SULPHUR SPEC: Nigeria will raise quality standards on its imported gasoline,
diesel and kerosene from July 1, a change health campaigners have long said is necessary to
protect citizens from toxic fuel.
- All imported diesel from July 1 can contain a maximum of 50 ppm sulphur, while gasoline
and kerosene can contain a maximum of 150 ppm, according to an environment ministry official and
information from the Standards Organization of Nigeria, the body responsible for setting
requirements for imported goods. [nL8N1HF2NU]
- SINGAPORE CASH DEALS: One gasoil and one jet fuel deal.
- Please refer to <O/AS>
MID-DISTILLATES
CASH ($/T) ASIA CLOSE Change % Change Prev RIC
Close
Spot Gas Oil 0.5% 65.01 1.21 1.90 63.80 <GO-SIN>
GO 0.5 Diff -0.65 0.00 0.00 -0.65 <GO-SIN-DIF>
Spot Gas Oil 0.25% 65.21 1.21 1.89 64.00 <GO25-SIN>
GO 0.25 Diff -0.45 0.00 0.00 -0.45 <GO25-SIN-DIF>
Spot Gas Oil 0.05% 65.71 1.21 1.88 64.50 <GO005-SIN>
GO 0.05 Diff 0.05 0.00 0.00 0.05 <GO005-SIN-DIF>
Spot Gas Oil 0.001% 66.11 1.26 1.94 64.85 <GO10-SIN>
GO 0.001 Diff 0.45 0.05 12.50 0.40 <GO10-SIN-DIF>
Spot Jet/Kero 65.35 1.32 2.06 64.03 <JET-SIN>
Jet/Kero Diff -0.18 -0.03 20.00 -0.15 <JET-SIN-DIF>
For a list of derivatives prices, including margins, please double click
the RICs below.
Brent M1 <BRENTSGMc1>
Gasoil M1 <GOSGSWMc1>
Gasoil M1/M2 <GOSGSPDMc1>
Gasoil M2 <GOSGSWMc2>
Regrade M1 <JETREGSGMc1>
Regrade M2 <JETREGSGMc2>
Jet M1 <JETSGSWMc1>
Jet M1/M2 <JETSGSPDMc1>
Jet M2 <JETSGSWMc2>
Gasoil 500ppm-Dubai Cracks M1 <GOSGCKMc1>
Gasoil 500ppm-Dubai Cracks M2 <GOSGCKMc2>
Jet Cracks M1 <JETSGCKMc1>
Jet Cracks M2 <JETSGCKMc2>
East-West M1 <LGOAEFSMc1>
East-West M2 <LGOAEFSMc2>
LGO M1 <LGOAMc1>
LGO M1/M2 <LGOASPDMc1>
LGO M2 <LGOAMc2>
Crack LGO-Brent M1 <LGOACKMc1>
Crack LGO-Brent M2 <LGOACKMc2>
(Reporting by Jessica Jaganathan; Editing by Biju Dwarakanath)
((Jessica.Jaganathan@thomsonreuters.com; +65 6870 3822; Reuters Messaging:
jessica.jaganathan.thomsonreuters.com@reuters.net))
Keywords: MARKETS DISTILLATES/ASIA
****************************************************************************************************
SINGAPORE, Sept 19 (Reuters) - The ultra low sulphur diesel cash differential continued its
climb on Tuesday as it was more profitable for traders to ship cargoes from Asia to Europe,
trade sources said.
- NEW ZEALAND FUEL SHORTAGE: New Zealand's jet fuel shortage on Tuesday forced 39 flights to
be cancelled, 13 of them international, with concerns the fuel crisis may spread after fuel
stations in the country's largest city Auckland halted high-octane gasoline sales. [nL4N1M0276]
The fuel shortage, caused by a damaged pipeline to Auckland Airport, has caused widespread
disruption to air travel since the weekend and comes only days before Saturday's national
election with infrastructure shortages a hotly contested issues.
- INDIA OFFERS GASOIL: Indian Oil Corp and Essar Oil offered gasoil cargoes for loading in
October, traders said.
The offers came despite the monsoon season ending in September. IOC has recently upgraded
refining units and plans to upgrade more units to produce better quality fuel and may have more
high sulphur gasoil as a result, a trader said.
- PAKISTAN BUYS JET FUEL: Pakistan State Oil bought 10,000 tonnes of jet fuel for Oct. 5 to
31 delivery from E3 Energy at a premium of $5.36 a barrel on a delivered basis, traders said.
- SINGAPORE CASH DEALS: Six gasoil deals and one jet fuel trade. <O/AS>
MID-DISTILLATES
CASH ($/T) ASIA CLOSE Change % Change Prev RIC
Close
Spot Gas Oil 0.5% 65.61 -0.57 -0.86 66.18 <GO-SIN>
GO 0.5 Diff -1.50 0.00 0.00 -1.50 <GO-SIN-DIF>
Spot Gas Oil 0.25% 65.91 -0.57 -0.86 66.48 <GO25-SIN>
GO 0.25 Diff -1.20 0.00 0.00 -1.20 <GO25-SIN-DIF>
Spot Gas Oil 0.05% 67.67 -0.61 -0.89 68.28 <GO005-SIN>
GO 0.05 Diff 0.56 -0.04 -6.67 0.60 <GO005-SIN-DIF>
Spot Gas Oil 0.001% 68.46 -0.47 -0.68 68.93 <GO10-SIN>
GO 0.001 Diff 1.35 0.10 8.00 1.25 <GO10-SIN-DIF>
Spot Jet/Kero 67.46 -0.59 -0.87 68.05 <JET-SIN>
Jet/Kero Diff 0.13 -0.02 -13.33 0.15 <JET-SIN-DIF>
For a list of derivatives prices, including margins, please double click
the RICs below.
Brent M1 <BRENTSGMc1>
Gasoil M1 <GOSGSWMc1>
Gasoil M1/M2 <GOSGSPDMc1>
Gasoil M2 <GOSGSWMc2>
Regrade M1 <JETREGSGMc1>
Regrade M2 <JETREGSGMc2>
Jet M1 <JETSGSWMc1>
Jet M1/M2 <JETSGSPDMc1>
Jet M2 <JETSGSWMc2>
Gasoil 500ppm-Dubai Cracks M1 <GOSGCKMc1>
Gasoil 500ppm-Dubai Cracks M2 <GOSGCKMc2>
Jet Cracks M1 <JETSGCKMc1>
Jet Cracks M2 <JETSGCKMc2>
East-West M1 <LGOAEFSMc1>
East-West M2 <LGOAEFSMc2>
LGO M1 <LGOAMc1>
LGO M1/M2 <LGOASPDMc1>
LGO M2 <LGOAMc2>
Crack LGO-Brent M1 <LGOACKMc1>
Crack LGO-Brent M2 <LGOACKMc2>
(Reporting by Jessica Jaganathan; Editing by Keith Weir)
((Jessica.Jaganathan@thomsonreuters.com; +65 6870 3822; Reuters Messaging:
jessica.jaganathan.thomsonreuters.com@reuters.net))
Keywords: MARKETS DISTILLATES/ASIA
****************************************************************************************************