The purpose of this notebook is to demonstrate a simple example of topic modeling on a set of news items. The sample news items are from Refinitiv's Machine Readable news content offering.
The following are the main steps of building a topic model
sklearn to create a LDA modelSpacy library in Python has a lot of useful modules for NLP tasks. It has many preprocessing modules and operates with many of the popular deep learning
frameworks
import spacy
import pandas as pd
import numpy as np
nlp = spacy.load('en_core_web_lg')
The dataset comprises sample news item from the financial domain. These news items are related to commodity arbitrage and loans related items
news_items = pd.read_csv("./data/news-body-samples-v1.csv",sep="\t")
news_items.topic.value_counts()
cond = news_items.apply(lambda x: 300<=len(x['body']) <=6000, axis=1)
news_items = news_items.assign(l_status = cond)
news_items = news_items[news_items.l_status==True]
news_items.topic.value_counts()
N2:COMARB 1484 N2:LOA 1368 Name: topic, dtype: int64
For using any specific tokenizer with sklearn, one can define tokenizer based on any library of your choice
def my_tokenizer(text):
tokens = [t for t in nlp(text) if t.is_alpha and not(t.is_space or t.is_punct or t.is_stop or t.like_num)]
return [t.lemma_.lower().strip() if t.lemma_ != "-PRON-" else t.lower_ for t in tokens ]
One can use sklearn.feature_extraction module to create a Document Term Matrix.
The input to the relevant function is the customized tokenizer function
from sklearn.feature_extraction.text import CountVectorizer
ct_vectorizer= CountVectorizer(tokenizer = my_tokenizer, ngram_range=(1,1),
min_df=0.2,
max_df=0.9,
max_features=1000)
X = ct_vectorizer.fit_transform(news_items.iloc[:,0].values)
sklearn.decomposition module to fit LDA¶DocumentTerm Matrix can then be used as input for building an unsupervised LDA model
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=5 , random_state = 123,
learning_method='batch')
X_topics = lda.fit_transform(X)
n_top_words=30
feature_names = ct_vectorizer.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
print(topic_idx)
print(" ".join([ feature_names[i] for i in topic.argsort()[-n_top_words:-1]] ))
0 refining say margin double contract sell data barrel o energy low price discount europe fob percent oil fuel cargo future sulphur gasoil euro diff click barge ara trade diesel 1 compare wednesday high early friday fall close thursday new week rise london expect margin day international april march bp month market say offer reuter year price asia percent bank 2 close fall shell report month crack compare total oil stock rise future refinery week crude say cargo sell percent refining day margin ara reuters europe fob barge trade barrel 3 month report high premium barge trader change offer source week east west close o trade price cent barrel diff asia reuters gasoil tonne say crack cargo brent crude fuel 4 stock week month low accord export early report international expect market wednesday friday thursday include tuesday monday london percent new source total april company march year reuter loan bank
Topic 1 has many terms that are related to loans. The following retrieves a few news items belonging to this topic:
sample_topic=X_topics[:,1].argsort()[::-1]
for iter_idx,movie_idx in enumerate(sample_topic[:2]):
print(news_items.iloc[movie_idx,0])
MANILA, Feb 1 (Reuter) - The Philippine Interbank Offered
Rate (PHIBOR) made its debut on Thursday with the average quote
for one week at 13.75 percent and for one year at 14.25
percent, the Bankers Association of the Philippines said.
PHIBOR is compiled by Reuters on the basis of
contributions from 20 banks.
The PHIBOR is designed to be the alternative for Treasury
bill rates as benchmarks for corporate loan pricing, the
Bankers Association said.
The one-week PHIBOR averaging 13.75 percent was slightly
higher than the 13.625 percent yield of one-week reverse repos
being offered by the Central Bank.
The two-week PHIBOR was also 13.75 percent. There were no
quoted rates for two- week money in the interbank market.
The one-month PHIBOR was 13.875 percent which was higher
as compared with the 32- day reverse repos' yield of 12.375
percent being offered by the Central Bank.
The two-month PHIBOR averaged 13.9375 percent which was
firmer than 12.5625 percent offered by the Central Bank for its
60-day reverse repos.
The three-month PHIBOR was 14.0 percent, higher than the
12.6 percent yield offered by Central Bank for its 90-day
special series Treasury bills.
The six-month PHIBOR was 14.0625 percent as against
13.5625 percent yield for Central Bank's 223-day bills.
The one-year PHIBOR was 14.25 percent. The Central Bank
was not offering comparable terms to the market.
The PHIBOR will be updated every day.
- Manila newsroom 63 2 8109636 Fax 8176267
MANILA, Feb 1 (Reuter) - The Philippine Interbank Offered
Rate (PHIBOR) made its debut on Thursday with the average quote
for one week at 13.75 percent and for one year at 14.25 percent,
the Bankers Association of the Philippines said.
PHIBOR is compiled by Reuters on the basis of contributions
from 20 banks.
The PHIBOR is designed to be the alternative for Treasury
bill rates as benchmarks for corporate loan pricing, the Bankers
Association said.
The one-week PHIBOR averaging 13.75 percent was slightly
higher than the 13.625 percent yield of one-week reverse repos
being offered by the Central Bank.
The two-week PHIBOR was also 13.75 percent. There were no
quoted rates for two-week money in the interbank market.
The one-month PHIBOR was 13.875 percent which was higher as
compared with the 32-day reverse repos' yield of 12.375 percent
being offered by the Central Bank.
The two-month PHIBOR averaged 13.9375 percent which was
firmer than 12.5625 percent offered by the Central Bank for its
60-day reverse repos.
The three-month PHIBOR was 14.0 percent, higher than the
12.6 percent yield offered by Central Bank for its 90-day
special series Treasury bills.
The six-month PHIBOR was 14.0625 percent as against 13.5625
percent yield for Central Bank's 223-day bills.
The one-year PHIBOR was 14.25 percent. The Central Bank was
not offering comparable terms to the market.
The PHIBOR will be updated every day.
- Manila newsroom 63 2 8109636 Fax 8176267
Topic 5 has many terms that are related to loans. The following retrieves a few news items belonging to this topic:
sample_topic=X_topics[:,3].argsort()[::-1]
for iter_idx,movie_idx in enumerate(sample_topic[:2]):
print(news_items.iloc[movie_idx,0])
LONDON, March 29 (Reuters) - * Forties differentials
held almost unchanged on Tuesday from the average seen late last
week. Total bought two cargoes for loading in late April,
helping offset some of the more aggressive offers in the daily
window.
* There were very few takers for a slew of offers, with
diffs offered down at their lowest since early February and the
first ship-to-ship transfers in four months.
* Shell offered two cargoes for ship-to-ship transfer at
Scapa Flow, which traders sometimes view as a sign of an over
supplied market.
* The two Aframaxes from which the STS transfers were
offered were the NS Challenger, currently moored off the German
port of Wilhelmshaven, and the Alfa Germania, off Rotterdam.
* Forties diffs have fallen by more than 60 cents so far
this month. Historically, March tends to be the weakest month,
in which on average in the last 20 years, diffs have fallen by
32 cents in this month, compared with July, the month in which
diffs see the largest rise of an average 46 cents.
FLOWS, NEWS, DATA
* Norway's key industrial trade union broke off wage
negotiations with employers, raising the risk of a large strike
involving workers from oil services firm Aker Solutions
AKSOL.OL, among numerous others. [nL5N1712EH]
FIXTURES - REUTERS SHIPPING DATA
* No new fixtures in the week to Monday, according to
Reuters data.
* Olympic Loyalty II loaded at Hound Point over the weekend,
according to Genscape data and was last sailing south along the
English coast, bound for China.
* Maran Canopus exiting the English Channel, according to
Reuters ship-tracking data.
* Samail loaded Forties in the last week, but remained
anchored off Hound Point.
WINDOW SUMMARY
FORTIES
* BP sold Total a cargo of Forties for loading Apr 18-22 at
a discount of 35 cents to dated Brent.
* Chevron sold a cargo to Total of Forties for loading Apr
18-22 at a discount of 35 cents to dated Brent.
* Shell offered a cargo of Forties for loading Apr 13-22 via
STS at Scapa Flow from the Aframax NS Challenger at a discount
of 20 cents to the dated price.
* Shell offered a second cargo of Forties for loading Apr
13-22 via STS at Scapa Flow from the Aframax Alfa Germania at a
discount of 20 cents to the dated Brent benchmark.
* Shell offered a cargo of Forties for loading Apr 9-11 at a
discount of 80 cents to dated Brent.
* Shell offered a cargo of Forties for loading Apr 22-24 at
a discount of 45 cents to dated Brent.
* Glencore offered a cargo of Forties for loading Apr 7-9 at
a discount of 90 cents to dated Brent.
* Chevron offered a cargo of Forties for loading Apr 19-21
at a discount of 25 cents to dated Brent.
* Petroineos offered a cargo of Forties for loading Apr
23-25 at 30 cents below the dated benchmark.
* Vitol withdrew an offer for Forties for loading Apr 24-26
at a discount of 35 cents to dated Brent.
* BP withdrew an offer for Forties for loading Apr 12-14 at
a discount of 75 cents to dated Brent.
* BP withdrew an offer for Forties for loading Apr 14-16 at
a discount of 20 cents to dated Brent.
* Petroineos withdrew an offer for Forties for loading Apr
13-15 at a discount of 35 cents.
* On average, based on all qualifying bids, offers and
trades, Forties is the lowest-priced crude and therefore sets
the price of dated Brent.
EKOFISK/OSEBERG
* Total bid for a cargo of Ekofisk for loading Apr 22-26 at
35 cents above dated Brent.
* Shell offered a cargo of Ekofisk for loading Apr 14-16 at
35 cents above the dated price.
* The quality premium for April-loading cargoes for Ekofisk
is 45 cents and 75 cents for Oseberg.
* Total bid for a cargo of Oseberg for Apr 18-20 at 75 cents
above the dated price.
BRENT
* Vitol offered a cargo of Brent for loading Apr 13-15 at a
discount of 60 cents to the dated benchmark price.
* Total bid for a cargo of Brent for loading Apr 19-27 at a
discount of 20 cents to dated Brent.
* Shell offered a cargo of Brent for loading Apr 19-21 at 15
cents below dated Brent.
(Reporting by Amanda Cooper; Editing by Alexander Smith)
((amanda.cooper@thomsonreuters.com; +442075423424; Reuters
Messaging: amanda.cooper.thomsonreuters.com@reuters.net))
((NORTH SEA CRUDE OIL DIFFERENTIALS AND OUTRIGHT PRICES:
<0#BFO-DIF> <0#C-E>
Dated BFO <BFO->
Brent <BFO-BRT> <BFO-E>
Forties <BFO-FOT> <FOT-E>
Oseberg <BFO-OSE> <OSE-E>
Ekofisk <BFO-EKO> <EKO-E>
Statfjord <BFO-STA> <STA->
Monthly North Sea crude loading programmes [O/LOAD]
New North Sea oilfields [NSEA/NEW]
NORTH SEA SWAPS
<FFFP> for the latest contracts for difference.
BFO <0#BFO-> Dated BFO weekly swaps <0#BRT-CFD>
Dated BFO front line swaps <IHEU/SWAP/BRENT>
NYMEX crude <0#CL:>
ICE crude <0#LCO:>
[OPEC] OPEC [NSEA] North Sea
[CRU] crude oil [PROD] oil products
[DRV] derivatives [PRO/E] European products
[O/L] Latest IPE [O/N] Latest NYMEX
<OILOIL> <NYMOIL> <IPEOIL> <OILSPD> <OILARB>
<CRDWLD><PRODEUR> <PRODUS> <APROD>
<ENERGY> speed guide <CRUDE/1> crude speed guide))
Keywords: NORTHSEA OIL/
LONDON, March 31 (Reuters) -
* Forties differentials offered down at their lowest since
November at one point in a session dominated by steep discounts
and with Total the lone bidder and buyer for all four BFOE
grades.
* The average of the trades, bids and offers for Forties
left the differential marginally higher on the day, while Brent
diffs traded down at last week's three-month lows.
* Monthly BFOE loading programmes due on Friday.
FIXTURES - REUTERS SHIPPING DATA
* No new fixtures in the week to Thursday, according to
Reuters data.
* Olympic Loyalty II loaded at Hound Point over the weekend,
according to Genscape data and was last exiting the English
Channel, bound for China. Traders said Shell had fixed the
vessel.
* Maran Canopus is off the coast of Portugal, heading to
South Korea, according to Reuters ship-tracking data.
* Samail loaded Forties in the last week, but remained
anchored off Hound Point.
WINDOW SUMMARY
FORTIES
* BP sold Total a cargo of Forties for loading April 15-17
at a discount of 80 cents to dated Brent.
* Total bid for a cargo of Forties for April 23-25 at a
discount of 45 cents to the dated price.
* Total withdrew a bid for Forties loading April 17-19 at a
discount of 50 cents to the dated price.
* Shell withdrew an offer for a cargo of Forties for loading
via STS at Scapa Flow from the Aframax NS Challenger April 19-21
at a discount of 50 cents to dated Brent.
* Shell withdrew an offer for a cargo of Forties for loading
via STS at Scapa Flow from the VLCC Atlantas April 19-21 at a
discount of 25 cents to dated Brent.
BRENT
* Vitol sold Total a cargo of Brent for loading April 13-15
at a discount of 70 cents to dated Brent.
* Shell sold Total a cargo of Brent for loading April 19-28
at a discount of 45 cents to the dated Brent.
* Total bid for a cargo of Brent for loading April 19-28 at
a discount of 45 cents to the dated price.
* Shell withdrew an offer for a cargo of Brent loading April
19-21 at a discount of 45 cents to dated Brent and withdrew a
second offer for a cargo of Brent for loading April 26-28 at a
discount of 35 cents to the dated benchmark.
EKOFISK/OSEBERG
* Total bid for a cargo of Ekofisk for loading April 22-26
at a premium of 45 cents to dated Brent.
* Total bid for a cargo of Oseberg for loading April 18-29
at a premium of 85 cents to the dated price.
* The quality premium for April-loading cargoes for Ekofisk
is 45 cents and 75 cents for Oseberg.
(Reporting by Amanda Cooper; Editing by David Evans)
((amanda.cooper@thomsonreuters.com; +442075423424; Reuters
Messaging: amanda.cooper.thomsonreuters.com@reuters.net))
((NORTH SEA CRUDE OIL DIFFERENTIALS AND OUTRIGHT PRICES:
<0#BFO-DIF> <0#C-E>
Dated BFO <BFO->
Brent <BFO-BRT> <BFO-E>
Forties <BFO-FOT> <FOT-E>
Oseberg <BFO-OSE> <OSE-E>
Ekofisk <BFO-EKO> <EKO-E>
Statfjord <BFO-STA> <STA->
Monthly North Sea crude loading programmes [O/LOAD]
New North Sea oilfields [NSEA/NEW]
NORTH SEA SWAPS
<FFFP> for the latest contracts for difference.
BFO <0#BFO-> Dated BFO weekly swaps <0#BRT-CFD>
Dated BFO front line swaps <IHEU/SWAP/BRENT>
NYMEX crude <0#CL:>
ICE crude <0#LCO:>
[OPEC] OPEC [NSEA] North Sea
[CRU] crude oil [PROD] oil products
[DRV] derivatives [PRO/E] European products
[O/L] Latest IPE [O/N] Latest NYMEX
<OILOIL> <NYMOIL> <IPEOIL> <OILSPD> <OILARB>
<CRDWLD><PRODEUR> <PRODUS> <APROD>
<ENERGY> speed guide <CRUDE/1> crude speed guide))
Keywords: NORTHSEA OIL/