# Python Basics for Quants-2 (PBQ-2)

 #### Notebook updated  on: 18-02-2022 by Ashutosh Dave
 #### This document builds on and is in continuation with the PBQ-1 document by Vivek Krishnamoorthy

### Today's Agenda

- A quick recap from the last lecture
    - Primitive and advanced datatypes
    - Control structures
    - Functions in Python
    
- Libraries/Packages in Python
    - Understanding the hierarchy
    - Importing and working with various modules
    
- The `NumPy` Library
    - Understanding the numpy ndarray data structure
    - Vectorization and how NumPy arrays are different from Python lists
    - Use-cases of various methods for NumPy arrays
    - Using mathematical functions from the NumPy library
     
- The `pandas` Library
    - Understanding pandas Series and DataFrames 
    - Series creation and manipulation
    - Use-cases of various methods for Series
    - DataFrame creation and manipulation
    - Reading and writing csv/excel data
    - Use-cases of various methods for DataFrames
    
- Some commonly done calculations in quantitative trading
    - Percentage change calculations
    - rolling() and shift() methods
    - Dealing with missing/naN data
  

In [None]:
# To display multiple outputs from the same cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Python Data Types & Data Structures

### Role of syntax in Determining the Data Type

In [None]:
# Syntax automatically determines the type in Python, hence pay attention to the syntax

a = 2
type(a)

In [None]:
a = 2.0
type(a)

In [None]:
a = '2'
type(a)

In [None]:
a = 2>3
type(a)

### Data Structures

- Advanced data types(also called **data structures**) can hold multiple data points
- They are a handy way to store and manipulate data
- Select a data structure depending on the type of data and your requirements

In [None]:
# List: A collection of heterogenous or homogenous data points

top_stocks = ['MSFT', 'AAPL', 'TSLA']

In [None]:
# Info about a stock in a list

microsoft = ['MSFT', 210.9, 1800900]
type(microsoft)

In [None]:
# Info about a stock in a tuple, which is like an immutable list
# Use tuples when you want to safeguard the sanctity of data
# Tuples are more efficient but less flexible compared to a list

microsoft = ('MSFT', 210.9, 1800900)
type(microsoft)

In [None]:
# Dictionary: key-value pairs

microsoft = {'symbol':'MSFT', 'last_price':210.9, 'traded_volume': 1800900}
type(microsoft)

In [None]:
microsoft['last_price']

![](pbq_pic1.png)

In [None]:
# A set contains a collection of unique unordered elements
# Set can only have unique elements unlike lists

set = {'a','a','u'}
set

In [None]:
# Ordering of elements is not an issue in case of sets
{'a','u'} == {'u','a'}

In [None]:
# Compare that to lists
['a', 'u'] == ['u', 'a']

### Methods and Attributes

Python objects such as lists, integers, strings, etc. all have functions associated with them that are used to manipulate the data contained in them. These functions are often referred to as methods. **List objects have list methods, float objects have float methods, and so on.**

Similarly, objects also have **attributes which are the properties associated with them**.

We will use some of them as we go along this lecture.

In [None]:
# Methods for strings
name = 'google'

name.upper()
name.capitalize()

In [None]:
# Methods for lists
players = ['Ronaldo', 'Messi', 'Ibrahimovic', 'Suarez']

# Using the append() method of lists to add another player
players.append('Lewandowski')
players

In [None]:
# The append method is a built-in function for lists, but it is NOT available for tuples as tuples are immutable

my_tuple = ('Ronaldo', 'Messi', 'Ibrahimovic', 'Suarez')

my_tuple.append('Lewandowski')

<br>
<br>
<br>

## Conditional statements & Control Sructures in Python

### The `for`loop and `if-elif-else` branching

In programming, one of the key tasks we perform is a **certain set of actions repeatedly for a sequence of data.** Here's where Python's `for` loop comes into its own. It has a simple and flexible interface which allows us to iterate through many 'iterable' objects. 'Iterable' essentially means anything that can be looped over.


In [None]:
# Quarterly GDP growth in %
gdp = [7.2, 7, 8, 6.6]


signal = []

In [None]:
# Pay attention to the indentation and the syntax


for i in range(len(gdp)):
    
    if gdp[i]>7 :
        signal.append('buy')
    
    elif gdp[i]< 7:
        signal.append('sell')
        
    else:
        signal.append('hold')

In [None]:
print(signal)

###  One line `if` statements and List Comprehensions

In [None]:
last_price = 12
average_price = 10

# One line if statement with multiple conditions
'long' if last_price > average_price else 'short' if last_price < average_price else 'hold' 


Now we'll look at some examples of **list comprehension, which is is an elegant tool to dynamically create sequences from other sequences** <br>
The general **syntax** for list comprehension is **[expression for item in list]**.

In [None]:
# Quarterly GDP growth in %
gdp = [7.2, 7, 8, 6.6]

# Using list comprehension and a one-line if statement together to generate a list of signals from the list 'gdp'
signal = [('buy' if growth > 7 else 'sell' if growth <  7 else 'hold') for growth in gdp]
signal

In [None]:
# Another example
prices = [12.8, 13.9 ,13.2, 11.5, 10.3, 7.9 , 10, 11.3, 11.5, 12.0 ,12.9]

log_prices = [np.log(p) for p in prices]
log_prices

### The `while` loop 

The while statement (loops): **repeat the action until a specified condition is met.**
The statement `a = a + 1` can be replaced by `a += 1`. <br>
Similarly, `a -= 1` means `a = a - 1`.

In [None]:
month = 1      # initialize a counter variable

while month <=12:
    print(month,'/''2020')
    month = month + 1     # update the counter variable


### Using `break` and `continue` in loops

When iterating through a sequence, you can **exit the loop completely using break**
and **skip the current iteration using continue.** See the example below.
Before running it, think about what the expected output would be.

In [None]:
for i in range(10):
    if i == 4:
        continue
    else:
        print(i)

In [None]:
for i in range(10):
    if i == 4:
        break
    else:
        print(i)

<br>
<br>
<br>

## Functions

**Functions are a set of statements that together perform a particular task.** <br>For example, we can create a function to find the maximum in a list of numbers.

### In-built functions:

Fortunately for us, there are hundreds of functions which are native or built-in to Python. 

```python
## Some commonly used functions which some of which we have seen already
## Try each of them out and see if you can guess what the output would be

min(20, 21, 25, 21, 9, 18, 15)
round(100/9, 2)
list((1, 2, 3, 4, 5))
type(range(10))
```

In [None]:
# passing integers as agrument into min()

min(1, 2, 3)

In [None]:
# chaining of functions

my_tuple =(1,2,3)

type(min(list(my_tuple)))

#### `methods`

Yes, we have seen them before!

Methods are in-built functions for a datatype or data structure.
Some methods are common between datatypes/data structures, where as some are unique to a specific datatype/data structure.

In [None]:
# The append() method for lists
my_list = [1, 2, 3]

my_list.append(5)

print(my_list)

In [None]:
# The upper() method for strings

my_string = 'nick leeson'

my_string.upper()

In [None]:
# .items() method for dictionaries

my_dict = {'US':'NASDAQ','UK':'FTSE'}
my_dict.items()

#### `In-built` functions from different  libraries

Different libraries/packages also come loaded with their own data structures and functions.<br>We need to import the library/package, before we can use the functions in them.<br><br>
For example, the NumPy library contains a function .log() which when run on a sequence of values, returns the log values:


In [None]:
import numpy as np

oil_price = [54.3, 56.8, 58, 57.6, 55]

np.log(oil_price)

### User-defined functions:

We shall also learn to create our own functions for tasks that are unique to our objectives (if the built-in or the functions that come from libraries do not suffice). **User-defined functions** improve the clarity of our code and **facilitate code reuse** (https://en.wikipedia.org/wiki/Don%27t_repeat_yourself), a cardinal principle of good programming.

We **define a function using the keyword `def` followed by the function name and the arguments that it takes as input.** We then write our lines of code as a code block to specify the actions that need to be performed. A function can return any kind of object including other functions. For this, we use `return` statements. We can have any number of `return` statements (including zero). 

In [None]:
# Defining a function that returns a string stating whether the number is negative or not.

def negative_or_not(x):
    if x < 0:
        return 'Negative number'
    else:
        return 'Non-negative number'

We **call a function using its name and passing it the arguments that it needs as input.**

In [None]:
negative_or_not(2)
negative_or_not(-2)

### `lambda` functions/Anonymous functions:

We can also create anonymous functions called `lambda` functions which can be very convenient in some cases, such as **when the operations performed are relatively straightforward.**
<br><br>
**Syntax** for a lambda function is <br> **`lambda` arguments : expression**<br>
The expression is executed and the result is returned.

We can write the `lambda` version of the  user defined function we defined above, just in a single line of code :

In [None]:
negative_or_not_lambda = lambda x: 'Negative' if x<0 else 'Non negative'

In [None]:
negative_or_not_lambda(-2)

In [None]:
negative_or_not_lambda(0)

In the above case, we have explicitly given a name to our lambda function,
however, we will see use cases in the subsequent lectures, where we will directly use them 
without naming/saving them at all. This leads to efficient utilization of memory resources. Due to this feature, lambda functions are also called 'anonymous' functions.
<br>
For now, just learn the syntax of a `lambda` function carefully.

<br>
<br>


## Libraries/modules/ packages

One of the key design features of Python is, it's a relatively small core language supported by **many high quality libraries (mostly from third parties)**. So we import them depending on the actions/tasks we need to perform. These **libraries are also loosely referred to as modules or packages**. You'll find these words used interchangeably. So get used to it.

```python
## Importing libraries
```
We import or load these libraries using the keyword `import`. We usually import it using an alias to make access easier. If we want to import all the components of a library, we can use *

```python
import pandas as pd
import numpy as np

from math import * # this is almost always a bad idea. Because you won't know what exactly is imported. 
# Plus namespace issues (you can read about this online)
# instead import only what you need
from math import log

# You can also import only a sublibrary or submodule if that's all you need
import statsmodels.stats as sms

## to check what components exist in the imported library we use dir()
```

In [None]:
# Importing the in-built OS library
import os
# Using the getcwd function from the OS library to get the current working directory
os.getcwd()

In [None]:
os.chdir('C:\\Users\\Ashutosh\\Desktop\\new')

In [None]:
dir(os)

## The `NumPy` Library

`NumPy` is a very popular library and an integral part of the scientific ecosystem in Python. It is widely used in academia, finance and industry. The key purpose that `NumPy` serves is **fast processing with low memory overheads.**<br>
<br>
`NumPy` provides the powerful array data type called `numpy.ndarray`. These arrays are somewhat like native Python lists, except that the data must be **homogeneous (all elements of the same type).**

`NumPy` is best utilized when the computations that have to be performed can be **vectorized**. Vectorization essentially involves the elimination of a `for` loop in performing the task. Meaning, rather than performing the same task sequentially hundreds of times, it is done so **batchwise**. For ex. computing moving averages for an entire array, or generating a sequence of random values.

#### Importing

In [None]:
import numpy as np # np is a standard alias for NumPy 

In [None]:
# Library object has an attribute called version
np.__version__

### Multiple ways to create a NumPy ndarray

`NumPy` provides a number of functions to create arrays:

In [None]:
# Conversion of lists into arrays using np.array()

array_from_list = np.array([1,2,3,4])
array_from_list

In [None]:
print([1,2,3,4])

In [None]:
# notice that there are no commas when we print a NumPy array
print(array_from_list)

In [None]:
# Creation of arrays using np.linspace()
# linspace() returns an array of evenly spaced numbers over a specified interval

array_linspace = np.linspace(0, 4, num=6)
print(array_linspace)

In [None]:
help(np.linspace)

In [None]:
# Creation of arrays using np.arange()

array_arange = np.arange(2, 6, step=0.5)
print(array_arange)

As an exercise, check out how you can create arrays using:  np.zeros() and np.ones().

### NumPy array vs Python list

Both the data structures have their own advantages. Lists are more flexible and can store heterogenous data, whereas numpy arrays store homogenous data and is much faster in dealing with numbers and numerical/scientifics computations due to **vectorization**..

In [None]:
a= [i for i in range(1000)]
print(a)

In [None]:
b = np.array(a)
print(b)

In [None]:
%%timeit
max(a)

In [None]:
%%timeit
np.max(b)

### Vectorized operations on NumPy arrays

In [None]:
# Let us consider two lists:

height = [1.73, 1.68, 1.71, 1.89, 1.79]
weight = [65.4, 59.2, 63.6, 88.4, 68.7]

In [None]:

weight / height ** 2
# such operations do not work on lists straightaway

In [None]:
# so, we have to use a lengthy for-loop to calculate the bmi for each obeservation

bmi_list = []

for i in range(len(height)):
    bmi_list.append(weight[i] / height[i] ** 2)

print(bmi_list)

In [None]:
# Need another for-loop to categorize the observations into 'healthy' and 'obese' based on the bmi values

health_report = []

for i in bmi_list:
    if i < 23:
        health_report.append('Healthy')
    else:
        health_report.append('Obese')

print(health_report)

In [None]:
# We can achieve the same objective in much convenient way through Vectorization
# let us convert the lists to a NumPy ndarrays

np_height = np.array(height) 
np_weight = np.array(weight)

In [None]:
# this is a vectorized calculation step

bmi_array = np_weight / np_height ** 2   

In [None]:
# The final result of the vectorized operation is also a NumPy ndarray

bmi_array

In [None]:
# Using np.where() to generate health report

health_report_array = np.where(bmi_array < 23, 'Healthy', 'Obese')
print(health_report_array)

In [None]:
dir(np)

### Generating random samples from statistical distributions using NumPy

We can generate random samples from many  statistical distributions such as normal, uniform, binomial, poisson, f etc.
using the functions from random submodule of NumPy:

In [None]:
# Generating 10,000 random samples from a normal distribution with mean=0 and stdev=1

x = np.random.normal(0, 1, size=10000) 

In [None]:
x

In [None]:
np.mean(x)

In [None]:
np.std(x)

In [None]:
# Generating 10 random samples from a uniform distribution between 2 and 4.

x = np.random.uniform(2, 4, size=10)
print(x)

In [None]:
np.random.uniform(2, 4, size=10)

In [None]:
np.random.uniform(2, 4, size=10)

In [None]:
# Seeding the random number generation
# This essentialy helps us to generate the same set of random numbers every time
# Helpful when you have to present your findings/analysis to others

In [None]:
np.random.seed(100)
np.random.uniform(2, 4, size=10)

In [None]:
# We can recreate the same set of random numbers by providing the same seed
np.random.seed(100)
np.random.uniform(2, 4, size=10)

In [None]:
help(np.random)

In [None]:
help(np.random.seed)

In [None]:
# np.random.randint() can be used to generate random integers between specific values
ty  = np.random.randint(10,21,5)
ty

In [None]:
# Optional/Try later: Visualizing the distributions of created random samples using the Seaborn library
import seaborn as sns
import matplotlib.pyplot as plt

# Generating random data from the normal and Uniform distributions
standard_normal = np.random.normal(0,1,10000)
uniform = np.random.uniform(-2,2,10000)

# Plotting the distributions using the distplot function from the Seaborn library
sns.set(rc={'figure.figsize':(12,7)})
sns.distplot(standard_normal, color='blue', label='Standard Normal',hist=False)
sns.distplot(uniform, color='red', label='Uniform',hist=False)
plt.title(' Samples from Normal and Uniform Distributions',fontsize=15)
plt.xlabel('Values')
plt.ylabel('PDF')
plt.legend()
plt.show()

### Attributes of NumPy arrays: ndim, dtype and shape

In [None]:
# creating higher dimensional array like a matrix
x = np.array([[2, 4, 6], [6, 8, 10]], np.int64)
print(x)

In [None]:
print(type(x))

In [None]:
print(x.ndim) # The 'ndim' attribute gives the dimension of the array

In [None]:
print(x.shape)

In [None]:
print(x.dtype)

### Other functions of interest provided by NumPy

#### `np.where()`


**np.where(condition, expression_1, expression_2)**<br>
For each element of the array,
- the first expression value is returned if the condition is true
- the second expression value is returned if the condition is false

In [None]:
x = np.array([2, 3, 1, 7, 2, 8])

In [None]:
np.where(x>3, 'buy', 'sell')

- If only the condition is given (without any expressions), np.where() will return an array of index locations at which the condition is true

In [None]:
# The following by default returns an array of index locations at which the condition is true. 

np.where(x>3)

In [None]:
# Fetching elements of array x which are greater than 3

x[np.where(x>3)]

#### Homework: Try other functions from NumPy such as np.mean, np.std, np.log, np.exp etc.

#### `np.log()`

In [None]:
x

In [None]:
z= np.log(x)
z

In [None]:
np.exp(z)

In [None]:
x = np.array([1,2,3])

In [None]:
x
print(x)

<br>

## The `pandas` Library

`pandas` like `NumPy` is a key constituent of the scientific computing framework in Python.
It's a powerful tool which can perform data manipulation and other excel like operations at high speed. 

To give you a sense of how comprehensive it is, the [official documentation](https://pandas.pydata.org/pandas-docs/stable/) of `pandas` is over 3000 pages!

**`pandas` is built on top of NumPy.** Many of the more sophisticated statistical libraries such as `statsmodels` and `scikit-learn` in turn are built on top of `pandas`. Thus there is a **high degree of compatibility** among these libraries.<br><br>
Two of the most important data structures in `pandas` that we will extensively use are `Series` and `DataFrame`.

In [None]:
# importing the pandas library
import pandas as pd

[](pbq5.png)

### `Series` data structure in `pandas`
**`Series` can be thought of as a single 'column' of data in a spreadsheet, along with its index.** In other words, it is a 
one dimensional labelled-array.


In [None]:
# Consider a list which contains daily returns of a stock for 5 days.

returns_list = [0.05, 0.03, -0.04, 0.04, 0.02]

In [None]:
# We can easily convert it into a pandas Series

import pandas as pd

returns_series = pd.Series(returns_list)

print(type(returns_series))
print(returns_series)

Notice that the output **automatically creates indices numbered from 0 to 4.** This is similar to a NumPy array except that
there are many additional features unique to pandas Series.

In [None]:
# We can provide a more meaningful index:

returns_series = pd.Series(returns_list, index=['Monday','Tuesday','Wednesday','Thursday','Friday'])
print(returns_series)

In [None]:
returns_series.index

In [None]:
# We can also change the index 

returns_series.index=['Mon','Tue','Wed','Thu','Fri']
print(returns_series)

In fact, **`Series` are like Python dictionaries**, with the index being analogous to the `keys` 
and the actual data equivalent to `values`. Even the syntax is similar to Python dictionaries.

In [None]:
returns_series['Thu']

Similarly, we can also manually create a Series object by passing in a dictionary object into the pd.Series() function:

In [None]:
a = {0: 'a', 1: 'b', 2: 'a', 3: 'b', 4: 'd', 5: 'b', 6: 'c', 7: 'a', 8: 'a', 9: 'c'}

# converting a dictionary to a Series
my_series = pd.Series(a)

In [None]:
my_series

In [None]:
my_series.index

In [None]:
my_series.values

In [None]:
my_series.value_counts()

In [None]:
my_series.unique()


#### Some methods and attributes for Series

In [None]:
#fetching all the unique values only
my_series.unique()

# counting the occourances of each unique value
my_series.value_counts()

In [None]:
# checking the dimension and shape
# Note that the NumPy array attributes 'ndim' and 'shape' exist for pandas Series also
my_series.ndim
my_series.shape

In [None]:
# The series.describe() gives us a statistical summary of the data in the Series container
my_series.describe()

In [None]:
returns_series.describe()

## `DataFrame` data structure in `pandas`

We can think of `DataFrame` like an Excel sheet with **multiple columns of data along with its index**, where each column is capable of handling different data types. In other words, it is a **two-dimensional labelled data container.**

### Creating a DataFrame manually using a python dictionary

In [None]:
# We can create a DataFrame manually, just by passing in a dictionary object into pd.DataFrame command
my_dataframe = pd.DataFrame({"Name":["Ash", "Jay", "Mario", "Vivek"], 
                             "Roll":[1, 2, 3, 4], 
                             "Marks":[35, 64, 34, 65]})
my_dataframe

In [None]:
type(my_dataframe)

 ### Deconstructing a pandas DataFrame

A DataFrame consists of three parts viz.
1. `index`
2. `columns`
3. `values`

In [None]:
my_dataframe.index

In [None]:
type(my_dataframe.index)

In [None]:
# Gives you details of the columns in the DataFrame
my_dataframe.columns

In [None]:
my_dataframe.values

In [None]:
# values are of the type NumPy arrays
type(my_dataframe.values)

**Notice that the values in a DataFrame container are of the type NumPy ndarray.**


### Reading csv/excel data into a DataFrame 

#### The working directory


We will now read a csv file into a pandas DataFrame, however
before we do that, **we need to make sure our csv file is stored in the current working directory.**
How do you check where your current working directory is? We need to take the help of the 'os' module for that!

In [None]:
import os

# Using getcwd() function from the os library to get the current working directory
os.getcwd()

#os.chdir('provide path here')      # You can use this command to change/set a folder as your working directory

In [None]:
# Reading a csv file into a DataFrame using read_csv() function from the pandas library

data1 = pd.read_csv("TCS.NS.csv")

data1.head(7)   # Useful way to get a quick look at your DataFrame
#data1.tail()
# try pd.read_excel()

In [None]:
# Here we're loading the csv file and specifying that the 'Date' column is to be assigned as index

data2 = pd.read_csv("TCS.NS.csv", index_col="Date")
data2.head(7)

In [None]:
type(data2.index)

In [None]:
# We use parse_dates=True so that pandas recognizes our date column which makes filtering and data manipulation more convenient later
import pandas as pd
data3 = pd.read_csv("TCS.NS.csv", index_col="Date", parse_dates=True, dayfirst=True)
data3.head(7)

In [None]:
type(data3.index)

In [None]:
# The .copy() method creates a copy of the original dataframe. 
# We will now create a copy and work with it.
df = data3.copy()

In [None]:
df.shape # Gives you an idea of how many rows and columns are there

In [None]:
df.info() #Information about the DataFrame

In [None]:
df.describe() # Gives a statistical summary for each column in the DataFrame

### Selecting subsets of DataFrame

There are three key ways by which we select subsets of our dataset for further analysis.
* Using the indexing operator `[]`
* Using the label based indexing operator `.loc[]` 
* Using the integer based indexing operator`.iloc[]`

We'll now look at some use cases using these different indexers.

### Using the indexing operator `[]`

In [None]:
openprices = df['Open']
type(openprices)
# openprices is a Series object which has two components viz. index and the data (called values). There are NO columns

In [None]:
closeprices = df['Close']
closeprices 

In [None]:
closeprices = df[['Close']]
# this is technically a DataFrame. Check the type and see for yourself

In [None]:
closeprices

In [None]:
# We can select multiple columns by passing a list of column names
# Order in which they are written doesn't matter

close_open = df[['Close', 'Open']]
close_open.head(10)

In [None]:
df.head()

In [None]:
# creating a new column and initializing it with all zeroes

df['New'] = 0 
df.head()

In [None]:
# Deleting a column

del df['New'] 
df.head()

In [None]:
# We can use methods on this subsetted data

df['Open'].sum()

In [None]:
# Selecting specific rows
df[0:4]

In [None]:
# Selecting rows based on conditions

df[(df['Open']<1250)]    # single condition

In [None]:
df[(df['Open']<1250) & (df['Close']>1260)]    # multiple conditions

In [None]:
# Another way is to use df.query
df.query('Open<1250 and Close>1260')

###  Using label based  indexing operator `.loc[]`

It selects data based on the **'label'** of the rows and columns. Also, it can simultaneously select subsets of rows and columns.

In [None]:
# Selecting a single row by label

df.loc['2019-01-07'] 

In [None]:
# using the slice notation : to select a range of columns or rows
# The last value is included here compared to other data structures like Python lists

df.loc['2019-01-07': '2019-02-03'] 

In [None]:
df.loc['2019-05-02':] ## slicing from the specified date to the end

In [None]:
df.loc[:'2019-01-16'] ## slicing from the start to the specified date

In [None]:
df.loc['2019-01-07': '2019-02-03':3] ## slicing from the specified date to specified date and selecting every
# 3rd day

In [None]:
# Selecting rows and columns simultaneously

df.loc['2019-01-07': '2019-02-03',['Open','High','Low']]

In [None]:
# remember that when we created the 'data2' DataFrame object, 
# we didn't ask Python to parse the index as a date

data2.loc['Aug 2016'].tail()
# so we get an error.

In [None]:
# However, for 'data3' we used parse_dates=True
data3.loc['Aug 2016'].tail()

In [None]:
# When the index of the DataFrame is a DatetimeIndex object as in the case of data3,
# there is increased flexibility in manipulation using the index labels.

# Remember that when we created the data3 DataFrame object,
# we explicitly told Python to parse the index as a date, unlike the data2.

data3.loc['Oct 2019'][data3['Direction']=='UP']

In [None]:
# fetching the same in two steps
df3=data3.loc['Oct 2019']
df3[df3['Direction']=='UP']

In [None]:
# We can also use the following format for date label for data3 DataFrame

data3.loc['08-2016'].tail()

### Using the integer based indexing operator `.iloc[]`

It selects data based on the **integer locations** of the rows and columns. Very similar to the .loc[] operator. One difference is that `.loc` includes the last value of index when slicing, whereas `.iloc` excludes it like the rest of Python.

In [None]:
df =data3.copy()

In [None]:
# Selecting a single row

df.iloc[4] 

In [None]:
# Selecting multiple rows

df.iloc[[4, 8, 20]] 

In [None]:
# using the slice notation : to select a range of columns or rows

df.iloc[0:6] 

In [None]:
 # starting with 0th position and every alternate row till 6th(excluding the 6th)
    
df.iloc[0:6:2]

In [None]:
# Selecting rows and columns simultaneously with .iloc

df.iloc[[4, 5], [0,3]] # selecting two rows and two columns

In [None]:
df.iloc[4:6, [0, 3]] # selecting a slice of rows and two columns

### Visualization: Plotting the data using Matplotlib library 

One of the great features about `pandas` is how it works so symbiotically with some of the other popular Python packages such as `NumPy` and `Matplotlib`. Under the hood, the `values` of the `pandas Series` and the `pandas Dataframe` objects are `NumPy ndarrays`. 

We will use `matplotlib` in conjunction with `pandas` to plot and visualize our data. You can read and learn more about it [here](https://matplotlib.org/api/pyplot_summary.html).

In [None]:
# You can do a quick plot of the Close prices over time using the .plot() function

import matplotlib.pyplot as plt
%matplotlib inline

df['Adj Close'].plot(figsize=(15,7)) 
plt.show()


For plotting multiple lines on the same graph, **pass a list of column names to the plot() method.**


In [None]:
df[['Close','Open']].plot(figsize=(15,7))
plt.show()

In [None]:
df.head()

## Some commonly done calculations in quantitative trading

In [None]:
# Recreating a copy of data3
df = data3.copy()
df.head()

### Percentage change calculations

In [None]:
# Computing the percentage change between today's Close and previous day's Close using the method pct_change()

df['Close_to_Close'] = 100 * df['Close'].pct_change()
df.head()

In [None]:
# Computing the percentage change between today's Close and Open
# This is an example of vectorized operation

df['intraday_return'] = 100 * (df['Close'] / df['Open'] - 1) 
df.head()

### The shift operator

The shift operator **time-shifts the specified columns either forward or backward by the # of steps specified.** 
Here shift(1) brings the Close price column down by one step. The shift operator is helpful in making 
vectorized operations possible in certain cases.

In [None]:
# Creating a new column called 'Previous_Close' by using the shift operator on Close column.

df['Previous_Close'] =  df['Close'].shift()
df.head()

In [None]:
# Calculating the percentage return between previous close and today's open

df['overnight_return'] = 100*(df['Open']/df['Previous_Close']-1)
df.head()

### Moving Average calculations using .rolling()

In [None]:
# Creting a new column called 'MA' containing 5 day Moving Average of Close prices.

n = 5

df['MA_5'] = df['Close'].rolling(window=n).mean()
df.head(10)

### Working with missing/NaN values

In [None]:
df.head()

In [None]:
# Find out the number of missing values in each column
df.isnull().sum()

In [None]:
df.dropna(axis=0).head()  #deletes all rows with NaN values and creates a new dataframe (leaves the original dataframe unchanged)

In [None]:
df.dropna(axis=1).head() # deletes all columns with NaN values and creates a new dataframe (leaves the original dataframe unchanged)
# df.dropna(axis=1,inplace=True)

In [None]:
df.fillna( value = 0) 

In [None]:
df.fillna(method='bfill').head()

# Try df.fillna(method='ffill') after the lecture

In [None]:
# Filling NaNs in a particular column
# Providing 'inplace=True' modifies the original DataFrame

df['MA_5'].fillna( value = np.mean(df['MA_5']),inplace=True)  
df.head()

### The correlation matrix

In [None]:
df.corr()

In [None]:
df[['intraday_return','overnight_return']].corr()

### The apply method

In [None]:
# Creating a function to calculate the intraday range
def daily_range(x):
    return x['Close']-x['Open']

In [None]:
df.head()

In [None]:
# Creating a new column by applying the function we defined above to the DataFrame df
df['daily_range'] = df.apply(daily_range, axis=1)
df.head()

### Writing back to excel

In [None]:
df.to_excel('modified_df.xlsx')
#df.to_csv('modified_df.csv')

### A little homework!

Search and find out the syntax and use cases for the following:
- **df.set_index()**
- **df.concat()**
- **df.groupby()**
<br>
<br>
*An awesome source for beginners:https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

<a id = 'references'></a>
## References

1. The Official Python documentation - https://docs.python.org/3/contents.html
2. Python Basics - https://www.quantinsti.com/Python-Basics-Handbook.pdf
3. Jupyter Lab - http://jupyterlab.readthedocs.io/en/stable/
6. `pandas` documentation - https://pandas.pydata.org/pandas-docs/stable/
7. Learning pandas, Michael Heydt - https://www.amazon.com/Learning-Pandas-Python-Discovery-Analysis/dp/1783985127
8. Style Guide for Python Code - https://www.python.org/dev/peps/pep-0008/
9. Dealing With Error And Exceptions In Python - https://www.quantinsti.com/blog/dealing-python-error-exceptions/
10. Python Exception: Raising And Catching Exceptions In Python - https://www.quantinsti.com/blog/python-exception/
11. Don't repeat yourself: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself
12. `Matplotlib` - https://matplotlib.org/api/pyplot_summary.html