# Notebook Instructions

1. All the <u>code and data files</u> used in this course are available in the downloadable unit of the <u>last section of this course</u>.
2. You can run the notebook document sequentially (one cell at a time) by pressing **Shift + Enter**. 
3. While a cell is running, a [*] is shown on the left. After the cell is run, the output will appear on the next line.

This course is based on specific versions of Python packages. You can find the details of the packages in <a href='https://quantra.quantinsti.com/quantra-notebook' target="_blank" >this manual</a>.

# Random Forest
Random Forest, also called Random Decision Forests, is a method in machine learning capable of performing both regression and classification tasks. It is a type of ensemble learning that uses multiple learning algorithms for prediction.

Random Forest comprises of decision trees, which are graphs of decisions representing their course of action or statistical probability. These multiple trees are plotted to a single tree called the Classification and Regression (CART) Model. To classify an object based on its attributes, each tree gives a classification that is said to vote for that class. The forest then chooses the classification with the maximum number of votes. For regression, it considers the average of the outputs for different trees.

<b>Working</b>
1. It assumes the number of cases as N. Then, randomly but with replacement, the sample of these N cases is taken out, which will be the training set.
2. Considering M to be the input variables, a number m is selected such that m < M. The best split between m and M is used to split the node. The value of m is held constant as the trees are grown.
3. Each tree is grown as large as possible.
4. By aggregating the predictions of n trees (i.e., majority votes for classification, the average for
regression), random forest predicts the new data.


Random Forest has certain advantages and disadvantages.

<b>Advantages</b>
1. This method balances the errors which are present in the dataset.
2. It is an effective method because it maintains accuracy even if it has to estimate the missing data.
3. Using the out-of-bag error estimate removes the need for a set-aside test set.
4. Random Forest helps in unsupervised clustering, data views, and outliner detection.

<b>Disadvantages</b>

Disadvantages of the random forest may include its inability to be at par excellence for the regression problem as it does not give precise continuous nature predictions. It cannot predict beyond the range in the
training data. Further, it does not provide complete control to the modeller.

You can learn more about <a href="https://blog.quantinsti.com/random-forest-algorithm-in-python/"> Random Forest</a> and their application in trading in this article. 

In this notebook, you will perform the following steps:

1. [Import Data](#data)


2. [Independent Variables](#x)


3. [Dependent Variable](#y)


4. [Split the Dataset](#split)


5. [Train the Model](#model)


6. [Accuracy Score](#score)   

## Import library

In [1]:
# For data manipulation
import numpy as np
import pandas as pd

# Import RandomForestClassifier and accuracy_score functions from sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore")

<a id='data'></a> 

## Import Data

We will read the daily data of stock, Bank of America, to create features.

In [2]:
# The data is stored in the directory 'data'
path = '../data/'

# Read stock data from csv file
data = pd.read_csv(path + 'BAC_2010_2021.csv', index_col=0)
data.index = pd.to_datetime(data.index)
data.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2009-12-31,15.09,15.24,15.01,15.06,13.114834,94322600
2010-01-04,15.24,15.75,15.12,15.69,13.663458,180845200
2010-01-05,15.74,16.209999,15.7,16.200001,14.107587,209521300
2010-01-06,16.209999,16.540001,16.030001,16.389999,14.273045,205257900
2010-01-07,16.68,17.190001,16.51,16.93,14.743298,320868400


<a id='x'></a> 

## Independent Variables
We will create independent variables which consist of 2 features. The features are:
1. Difference of open and close price
2. Difference of high and low price

In [3]:
# Create input features
data['Open-Close'] = (data['Open'] - data['Close'])
data['High-Low'] = (data['High'] - data['Low'])

# Drop NaN values
data.dropna(inplace=True)

# Store the features in a variable X
X = data[['Open-Close', 'High-Low']]
X.head(2)

Unnamed: 0_level_0,Open-Close,High-Low
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2009-12-31,0.03,0.23
2010-01-04,-0.45,0.63


<a id='y'></a> 

## Dependent Variable 
When the next day's close price is greater than today's close price, we use 1 as a signal and else use -1. We will store this in the variable y, which is the dependent/target variable.

In [4]:
y = np.where(data['Adj Close'].shift(-1) > data['Adj Close'], 1, -1)
y

array([ 1,  1,  1, ...,  1, -1, -1])

<a id='split'></a> 

## Split the Dataset
We will split the dataset into train and test samples. The train data consists of 75% of the total datasets. On the remaining, we will test the accuracy of the model.

In [5]:
# Training dataset length
split = int(len(data) * 0.75)

# Splitting the X and y into train and test datasets
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

<a id='model'></a> 

## Train the Model

We will use the `RandomForestClassifier` function from sklearn to train and fit the model.

In [6]:
# Create and fit the model on train dataset
clf = RandomForestClassifier(random_state=5)
model = clf.fit(X_train, y_train)

<a id='score'></a> 

## Accuracy Score
The model is trained on the training dataset. Now it's time to test the accuracy of the model on the test dataset. We will use `accuracy_score` function to test the accuracy.

In [7]:
print('Prediction Accuracy (%): ', accuracy_score(y_test, model.predict(X_test), normalize=True)*100.0)

Prediction Accuracy (%):  51.41242937853108


This is a very simple model with an accuracy of around 51% on the test dataset.<br><br>  