# MLT-02 SVM Model Building

- Authored by: *Jay Parmar*
- Last modified on: *20th August 2023*

## Goal: Is to predict whether the next day is going to open above the current day open or not.

## Approach

1. Define a Classification Task
2. Read the Dataset
3. Generate Target Values
4. Feature Selection
5. Feature Extraction
6. Generate Train-Test Datasets
7. Feature Scaling
8. Build Model
9. Train Model
10. Predict
11. Evaluate

### 1. Classification Task

*To predict whether the next day open is going to be above the current day open or not.*

In [None]:
# Importing necessary library
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import yfinance as yf
%matplotlib inline

warnings.filterwarnings('ignore')

# Set the seaborn visualization style
sns.set()

### 2. Read the dataset

In [None]:
# Fetch data
df = yf.download('TSLA', start='2012-01-01', end='2022-12-31', auto_adjust=True)

In [None]:
# Make the copy of the data. We will work on the copied data.
data = df.copy()

In [None]:
# Inspect the data
data.head()

In [None]:
# Check the shape of the data
data.shape

### 3. Generate Target Values

Let's say we want to predict the next days movement: up or down. What kind of ML problem would it be? It would be a classification task. And for that purpose, we need to create our target values. Let's create it.

In [None]:
# Generate log returns
data['returns'] = np.log(data['Close'] / data['Close'].shift(1))

If the next day is up day, we will designate it with 1, else if it is a down day, we will mark it with -1.

In [None]:
# Create target values
# data['target'] = np.where(data.returns.shift(-1) > 0, 1, 0)

data['target'] = np.where(data['Open'].shift(-1) > data['Open'], 1, 0)

data['Volume'] = data['Volume'].shift(1)

In [None]:
# Count the values in the target column
data['target'].value_counts()

In [None]:
features = ['Volume', 'returns']
label = 'target'

### 4. Feature Selection

We have OHLCV data with us. These OHCLV columns are our features. Based on this data, we will try to predict the next day's movement. But let's first understand which of these features can actually be used. Our intuition says that Close price plays the major role in the determining the next days movement. So we'll be considering it. What about other features.

To decide on what features to use and which one to ignore, let's analyze their relationship, starting with the Volume column.

In [None]:
# Scatter plot of Close and Volume
plt.figure(figsize=(10,6))
sns.scatterplot(x=data['returns'], y=data['Volume']);

### 5. Feature Extraction

Our intuition says that only these two features might not be able to capture the intricacies of the stock movement. We need more features. What we can do to generate more features? The answer is, create or extract new features based on the existing ones.

Let's try to create new features. We will consider the following quantitative features.

- Rolling standard deviation
- Rolling moving average of close price
- Rolling percentage change
- Rolling moving average of volume
- Difference between close and open

In [None]:
# Creating features
features_list = []

# SD based features
for i in range(5, 20, 5):
    col_name = 'std_' + str(i)
    data[col_name] = data['Close'].rolling(window=i).std()
    features_list.append(col_name)
    
# MA based features
for i in range(10, 30, 5):
    col_name = 'ma_' + str(i)
    data[col_name] = data['Close'].rolling(window=i).mean()
    features_list.append(col_name)
    
# Daily pct change based features
for i in range(3, 12, 3):
    col_name = 'pct_' + str(i)
    data[col_name] = data['Close'].pct_change().rolling(i).sum()
    features_list.append(col_name)
    
# Feature based on volume
col_name = 'vma_4'
data[col_name] = data['Volume'].rolling(4).mean()
features_list.append(col_name)

# Intraday movement
col_name = 'co'
data[col_name] = data['Close'] - data['Open']
features_list.append(col_name)

This process of extracting information from the existing features is called feature extraction. We now have a handful of features as shown below. 

In [None]:
features_list

We'll be using these features to predict the next days movement. We won't be using the `Close` and `Volume` columns. Now, is the time to generate our train and test data. Onwards to it.

As we are dealing with time-series data, we need to split our data set in such a way that it doesn't have a lookahead bias. But before we do it, can you think of any potential issue. Again, resorting to our old friend `.info()` will help us see for any potential issues.

In [None]:
data.info()

In [None]:
data.isna().sum()

Yes, there is an issue. There are many null values in many features. We need to get rid of them before we move further.

In [None]:
# Removing nan values
data.dropna(inplace=True)

In [None]:
data[features_list + ['target']].head()

In [None]:
# sns.pairplot(data[features_list+['target']], hue='target')

### 6. Generate Training & Testing Datasets

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = data[features_list].iloc[:-1]
y = data.iloc[:-1]['target']

X_train , X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size= 0.25, 
                                                    shuffle=False)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

### 7. Feature Scaling

Now, we are almost ready to train our model and start predicting. But many ML algorithms requires normalized data. So, we need to make sure that the data we feed to our model is normalized. For that purpose, let's start by analyzing the data distribution of the features.

In [None]:
X_train.columns

In [None]:
sns.pairplot(X_train[['std_5', 'ma_10', 'vma_4']]);

In [None]:
X_train[['std_5', 'ma_10', 'vma_4']].describe()

From the above plot we can see that features have different distribution and their scales are different. Hence, it won't be a good idea to feed these data as it is to the ML algorithm. We need to scale the data. We can use `StandardScaler` package from the `sklearn` library to do so.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
# Scaling the features
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)

sns.pairplot(X_train_scaled_df[['std_5', 'ma_10', 'vma_4']]);

From the above plot, we can see that values are now scaled with mean 0 and std 1. Another, interesting thing to note about `StandardScaler` is that is just scales the values, it doesn't change the data distribution.

Likewise, all features will now have mean 0 and std 1. 

In [None]:
X_train_scaled_df.describe().round(2)

### 8. Define a Model

Now that we have our data scaled, we can start training our model. We will use `SVC` algorithm.

In [None]:
# Import necessary package
from sklearn.svm import SVC

In [None]:
SVC?

In [None]:
# Create model
model = SVC(kernel='poly', random_state=1)

### 9. Train the Model

In [None]:
# Train model
model.fit(X_train_scaled, y_train)

Finally, we have arrived at the most interesting point, where we can predict. Let's do it.

### 10. Predict using the Traing Model

In [None]:
# Predict on a train dataset
y_pred_train = model.predict(X_train_scaled)

In [None]:
print('Model accuracy on training data:', model.score(X_train_scaled, y_train))

In [None]:
# Predict on a test dataset
y_pred = model.predict(X_test_scaled)

In [None]:
print('Model accuracy on testing data:', model.score(X_test_scaled, y_test))

### 11. Evaluate the Model

In [None]:
# Another method to calculate accuracy
from sklearn.metrics import accuracy_score

print('Model accuracy on training data:', accuracy_score(y_train, y_pred_train))
print('Model accuracy on testing data:', accuracy_score(y_test, y_pred))

In [None]:
# Importing necessary packages
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
# Printing the confusion matrix
print(confusion_matrix(y_test, y_pred))

In [None]:
# Plotting the Confustion matrix
cm = confusion_matrix(y_test, y_pred)
df = pd.DataFrame(cm, index=['Nope', 'Up'], columns=['Nope', 'Up'])
plt.figure(figsize=(5, 4))
sns.heatmap(df, annot=True, fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

In [None]:
# Printing the classification report
print(classification_report(y_test, y_pred))