The more fundamental part of ML is Mathematics, But if we take the list it goes on like algebra, statistics, calculus, geometry etc., and obviously we get confused. When we approach experts and ask, at this stage most of them will suggest us to go ahead with Linear Algebra.
But the problem will not stop there. we will get stuck up on the deep derivations and get confused again. So Now, we will have the overview of the topics in Linear Algebra and motivation behind it.
Why we Learn Linear Algebra?
When we are working with data, the first step is to and arrange our data in rows and columns format which looks like matrix where linear algebra comes into play.
For 2- dimensions, we call it as Matrix. For 1-dimension we call it as Vectors and for n-dimensions we call it as Tensors.
This matrix can be implemented via Numpy package which we discussed in previous blog. Before we look into that, let us take a quick look on the reasons to NOT to learn linear Algebra:
1. Learning Linear Algebra will take months to years to study the entire field. This will delay in achieving goal to work on Real Time ML problems.
2. Since Not all topics in Linear algebra is relevant to theoretical Machine learning but also the applied ML.
Implementing Matrices in Python
A 2-dimensional Numpy array is used to represent matrix in python. A numpy array with lists of list gives us matrix. An example with 2X3 matrix as below.
Matrix – Matrix Multiplication
The matrix multiplication is also called as matrix DOT product which is more complicated than the previous operations.
Basic rule for this Dot product is that, the number of columns in the First matrix must equal to the number of rows in second matrix.
If A is of shape M x N and B is of shape N x P then C is of shape M x P.
#identity Matrix
from numpy import identity
I = identity(3)
print(I)
Have fun by making changes to the above code and do let me know if you found something interesting 🙂
we will explore our data with a new dataset where in the dataset is extracted from open source UCI Repository.
As a First step, we will understand our data using pandas library.
1. Reading our data: Our data is downloaded as csv file, hence we use read_csv method to read our data.
#import necessary python packages
import numpy as np
import pandas as pd
#Reading our data
data=pd.read_csv("your csv file location path goes here[data.csv]")
#To print the values, just enter the variable name
data
print(data.shape)
print(data.columns)
In order to display first 5 rows of data, we use head() method and for the last 5 rows we use tail() method.
data.head() #For first 5 rows
data.tail() #For last 5 rows
As we have read our data and let’s have a quick look at the variables in the data.
Unique values of features (for more information please see the link above):
If a variable can take on any value between two specified values, it is called a continuous variable; otherwise, it is called a discrete variable.
The below examples will clarify the difference between discrete and continuous variables.
Suppose the fire department mandates that all fire fighters must weigh between 150 and 250 pounds. The weight of a fire fighter would be an example of a continuous variable; since a fire fighter’s weight could take on any value between 150 and 250 pounds.
Suppose we flip a coin and count the number of heads. The number of heads could be any integer value between 0 and plus infinity. However, it could not be any number between 0 and plus infinity. We could not, for example, get 2.5 heads. Therefore, the number of heads must be a discrete variable.
2. Exploring our data:
Now, we will try to answer some of the basic questions to understand the data better.
** Count the number of males and females in Sex feature[Column]? Hint: To count the values, we need to use value_counts() method from pandas library.
data['sex'].value_counts()
** Calculate the average age of Men and Women? Hint: To calculate column value, we will use column name itself, but to access row value we will use .loc method
** The maximum number of hours a person works per week (hours-per-week feature). Likewise check with minimum hours and average hours a person can work.Hint: Use the column hours-per-week and max() method
data['hours-per-week'].max()
** The number of people works in such a maximum hours. Hint: Use the above function , store in a variable
max = data['hours-per-week'].max()
workaholics = data[data['hours-per-week'] == max].shape[0]
workaholics
** Now, we could see 85 of them were working 99 hours per week. Can we get how many of them earns >50K salary?
3. Visualizing our data : As we can visualize our plots through scatter plots, pair plots , histogram etc., as we studied in our previous matplotlib blog.
import seaborn as sns
# seaborn is also one of the visualization python library
sns.set_style=("whitegrid")
sns.pairplot(data , hue = 'salary', height = 3)
plt.show()
Hope you will have fun by changing the above codes and getting much more insights on the data. we will check with another interesting datasets in the coming blogs.
As the name suggests, once we are provided with real time data we should understand the data first and then we should able to get some insights from the provided data.
These insights can be gained only by analyzing and exploring the provided data, which we would call as Exploratory data analysis or EDA Process.
The data analysis can be done by our basic python libraries Numpy, Pandas, Matplotlib, etc., Good knowledge on our data will help us to get the answers that we need or develop an intuition for interpreting the results of future modeling.
There are a lot of ways to reach these goals: we can get a basic description of the data, visualize it, identify patterns in it, identify challenges of using the data, etc.
Direct definition: The Basic definition of EDA is, Exploratory data analysis is an approach to analyzing data sets by summarizing their main characteristics with visualizations. The EDA process is a crucial step prior to building a model in order to unravel various insights that later become important in developing a robust algorithmic model.
Now, we will go through a basic analysis step, when we are ready with real time data.
1. Choosing a Dataset: First, rely on open source data to initiate ML execution. There are mountains of data for machine learning around and some companies (like Google) are ready to give it away. we have many open sources like UCI Repository, Data.gov, Kaggle etc., From here we can choose our dataset base on our interests
2. Exploring the Dataset: The critical step in EDA process, is that understanding the data better by using different python libraries and checking the relationship between data.
3. Data Visualization: This step makes use of the plot library like matplotlib etc., in order to portray our understandings in picture model. This will help us to get more insights on data.
4. The Final step: This step is where our ML algorithms comes into play to provide insights on future unseen data.
As we have some basic understanding of python libraries now. we will gonna have fun with real time data in next blog.
Let’s wait to dirty your hands. have a great day:) Bye
Scikit-Learn is a basic standard to work with machine learning in python. It provides a python libraries for solid implementation of machine-learning algorithms. Scikit-learn provides a wide selection of supervised and unsupervised learning algorithms. Best of all, it’s by far the easiest and cleanest ML library.
Scikit Learn is built on top of several common data and math Python libraries. Such a design makes it super easy to integrate between them all. You can pass numpy arrays and pandas data frames directly to the ML algoirthms of Scikit! It uses the following libraries:
NumPy: For any work with matrices, especially math operations
There is an old sayings, “One picture can tell thousand words“. Its true in case of matplotlib library in python. This python library can visualize our work in 2-dimensional spaces.
Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. One of the greatest benefits of visualization is that it allows us visual access to huge amounts of data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter, histogram etc.
Importing matplotlib :
from matplotlib import pyplot as plt
or
import matplotlib.pyplot as plt
Basic plots in Matplotlib :
Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to make correlations. They’re typically instruments for reasoning about quantitative information. Some of the sample plots are covered here.
#importing matplotlib module
from matplotlib import pyplot as plt
#x-axis values
x=[5,3,7,2,1]
#y-axis values
y=[10,5,2,9,2]
plt.plot(x,y)
#Function to show the plot
plt.show()
Pandas consists of two basic data structure called Series and Dataframe.
Pandas is also one of the most important python library for data manipulation and analysis.
Like Numpy package, we install pandas package by the command “pip install python” in python prompt. Now the installed package can be utilized by the import command, “import pandas as pd “, pd is the variable to access the properties of the package for our data.
Series:
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:
s = pd.Series(data)
Now, we will jump into the basic operations on pandas series in python.
Creating a Pandas Series – In order to create a series from array, we have to import a numpy module and have to use array() function. Check my Numpy Basics blog to know more about numpy arrays.
import pandas as pd
import numpy as np
#Simple Array
data = np.array(['g', 'e', 'e','k','s'])#Converting numpy array into series object and storing the variableseries = pd.Series(data)
In the above operation, we created series using numpy array, Likewise we are going to create a series using python lists.
# a simple list
list=['g', 'e', 'e', 'k', 's']
# create series form a list
series =pd.Series(list)
print(series)
Accessing Elements from Series:
The Series element can be accessed by position as well as by index.
# creating simple arraydata = np.array(['g','e','e','k','s','f','o','r','g','e','e','k','s'])series = pd.Series(data)# Retrieving the first 5 elementsprint(series[:5])
Let’s check on some binary operations on series:
# importing pandas module import pandas as pd # creating a series and assigned to a variable datadata =pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd'])# creating a series and assigned to a variable data1data1 =pd.Series([1, 6, 4, 9], index=['a', 'b', 'd', 'e'])print(data, "\n\n", data1)# adding two series using .adddata.add(data1)# subtracting two series using .subdata.sub(data1)
mul() – Method is used to multiply series or list like objects with same length with the caller series.
div() – Method is used to divide series or list like objects with same length by the caller series.
sum() – Returns the sum of the values for the requested axis.
prod() – Returns the product of the values for the requested axis.
mean() – Returns the mean of the values for the requested axis
pow() – Method is used to put each element of passed series as exponential power of caller series and returned the results.
abs() – Method is used to get the absolute numeric value of each element in Series/DataFrame.
cov() – Method is used to find covariance of two series.
DataFrame:
Pandas Data Frame consists of main components, the data, rows, and columns.
The pandas data frame can be created by loading the data from the external, existing storage like a database, SQL or CSV files.
But the pandas Data Frame can also be created from the lists, dictionary, etc.
One of the ways to create a pandas data frame is shown below:
#import the pandas libraryimport pandas as pd# Dictionary of key pair values called data data = {'Name':['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh'], 'Age': [24, 23, 22, 19, 10]}#Outputdata{'Age': [24, 23, 22, 19, 10], 'Name': ['Ashika', 'Tanu', 'Ashwin', 'Mohit', 'Sourabh']}# Calling the pandas data frame method by passing the dictionary (data) as a parameter df = pd.DataFrame(data) df
Output
Performing on Rows and Columns:
Selecting a Column: In order to select a particular column, all we can do is just call the name of the column inside the data frame.
# Calling the pandas data frame method by passing the dictionary (data) as a parameter df = pd.DataFrame(data)# Selecting column df[['Name']]
output
Selecting Row: Pandas Data Frame provides a method called “loc” which is used to retrieve rows from the data frame. Also, rows can also be selected by using the “iloc” as a function.
# Calling the pandas data frame method by passing the dictionary (data) as a parameterdf = pd.DataFrame(data) # Selecting a row row = df.loc[1] row#outputName Tanu Age 23 Name: 1, dtype: object
The loc method accepts only integers as a parameter.
Working with Missing Data :
Missing data occurs a lot of times when we are accessing big data sets. It occurs often like NaN (Not a number). In order to fill those values, we can use “isnull()” method. This method checks whether a null value is present in a data frame or not.
# importing both pandas and numpy libraries import pandas as pd import numpy as np# Dictionary of key pair values called datadata ={‘First name’:[‘Tanu’, np.nan], ‘Age’: [23, np.nan]}df = pd.DataFrame(data)df
# using the isnull() function
df.isnull()
The isnull() returns false if the null is not present and true for null values.
Now we have found the missing values, the next task is to fill those values with 0 this can be done as shown below:
df.fillna(0)
Now the null values, will be assigned as 0 in the column values.
Now we have reached the end of our blog, Hope you enjoyed reading my blog!
We are here to explore one of the basic package in python called Numpy. Without any mathematical computations, we could not infer any insights from the data, Hence we use Numpy package for the same.
As we can arrange our data into one-Dimensional, two-dimensional and so on. The 1D data are Vectors, 2D are Matrices and ND are as Tensors.
Here we will perform some operations on these 1D, 2D and nD form of data. These data can be stored in an data structure called array in different dimensions as said above. A high – level mathematical functions are written in this library, which can support the following:
Powerful N-dimensional array object.
Broadcasting functions.
Tools for integrating C/C++ and Fortran code.
Useful for computing linear algebra, Fourier transform, and random number capabilities.
Now, we will install numpy package using the command, “”pip install numpy“” which will install the numpy package successfully.
In order to use this package to our data, we need to import using the command
“import numpy as np“, np is an variable which will be used everywhere to access numpy package.
Creating Numpy Arrays
Here the numpy arrays are created using the variable np. Since it is an 2-Dim array, we created using 2 square brackets. [[]]
The type function, tells us the type of the array, whereas the shape function provide us the no.of rows and no.of columns in the array. The dtype gives us the datatype of the array values.
Now, we will explore some basic methods in numpy arrays which can be used for our analysis on the real world data.
arange() – arange method can be used to create array values, with the following 4 parameters.
start — starting the array from the start number. [0]
stop — end the array (excluded in stop value). [till 10]
endpoint — This is a boolean value. If the value of endpoint is true, then stop is the last sample of the sequence. The default value of endpoint is true.
retstep — Default value is true. If the value of retstep is true then return samples and steps(the difference between 2 samples)
Total numbers of array element is equal to Multiplication of reshape parameters
zeros() – This method gives 0 values to the sample in the matrix.
np.zeros((2,3))
array([[0.,0.,0.,],
[0.,0.,0.,]])
ones() – np.ones( ) gives the 1 value to each sample in the matrix.
full() – It gives the constant samples to the matrix.
eye( ) – This method gives the identical matrix as a result. In the below example, it gives 2-Dimensional identical matrices.
random() – np.random( ) gives the random samples. The below example gives 2-Dimensional matrix of 2*3(number of rows = 2 and number of columns = 3) with random values.
We have reached the end of our blog, Hope you enjoyed reading my blog!.
Let’s explore pandas basics in the next blog. Comments are welcome 🙂
Data Structures are the simple structures to hold our data. These structures are helpful in performing some operations on our data like storing, managing, accessing etc.,
The basic data structure of the python includes the List ,Tuples , Sets and Dictionaries.
Python Lists :
Python lists will hold data inside the square brackets [] and are arranged in a sequential manner separated by comma. The List can store different types of values including lists by itself.
The Lists are MUTABLE, means the values can be changed after assigning the values to it.
We will check some operations on the Lists below:
In order to store data into the List structure, we need to declare the list first and we can able to do many operations on the lists datastructure.
Declaring a Lists:
# Declaring a variable my_list and storing different data into it my_List = [L1,L2,03,[53]]
Accessing a Lists:
A lists can be accessed by positive indexing as well as negative Indexing. The lists stores the value starting with zero,one and so on.
my_List[3] # will print [53], accessing the 4th value(53) in the list
The following are the methods to perform some operations on the list values.
1. Finding the length of the list – len()
2. List Append – Appending will add a value at the end of the list
3. List Extend – Extend also adds value at the end, but there is small difference when we try to add list values.
4. List Insert – Inserts the value at the specified index, as the insert method takes two parameters. One is the location another is an value to insert.
5. List Remove – Like the other methods, we use remove() method to remove the value from list , and we can use pop() method to do the same.
6. List Sorting – In list we can use the sorted() method to sort the elements of the list.
7. List Reverse – This will return the elements of the list in reverse order.
8. List Looping – we will now loop the elements in the list, using for loop and in operator.
9. List Slicing – Segmenting a part of the list is known as Slicing, The slicing operation can be performed by indexing.
TEST YOURSELF:
Q1) What is the difference between delete, remove and pop methods in python lists?
Use the methods and explore the difference in the methods by yourself and have fun 🙂
Python Tuples :
Tuples operate in the same way as Lists, the only difference is that the tuple elements are enclosed in the paranthesis() and Immutable (Elements cannot be changed once it is assigned).
Let’s see an example for immutable property of the tuple.
TEST YOURSELF:
Q2) How come satish replaces value of 6 in the tuple? Do check and comment below, awaiting for your answers 🙂
Q3) From the above example, we could see the type of t is str, what we have to actually do to get type as “Tuple”? Do check and comment below 🙂
Python Sets:
Till now, we checked the sequential data structures [Lists and Tuples] in python. The sets in python are the unordered collection of unique items as there will not be any duplicates.
The set is Mutable and are enclosed in the curly braces {}.
The following are the creation of set in python.
We add elements to the set by add() method and using update() method. Sets will not support indexing operation in python
As the name suggests, we can perform mathematical operations on sets like Union, Intersection, Difference etc.., Why don’t you try and perform some operations 🙂
The elements from the sets are remove using remove(), discard(), delete(),pop() methods.
Special Type of Set : Frozen Sets
A special type of set called frozenset can be used, in which the elements in the frozenset are immutable same as in the python tuples.
Being immutable, it does not support add() method.
Python Dictionary:
The python dictionary is an unordered collection of items, which contains not only values also the key and value pair.
The following are the methods for the dictionary creation.
Accessing the Dict Values:
We can access the dict values by using keys or the get method. The only difference is that, Using get() method will return None if the key is not present in the dictionary otherwise it returns an key error.
The following are the methods in the python dictionary.
Can u guess the output for values() method?
Hope you enjoy reading my blog, looking forward for your answers and comments below
Hello guys, This is the first post on my new blog. I’m just getting this new blog going, so stay tuned for more. Subscribe below to get notified when I post new updates.
Let’s start our discussion on the basics of python.
Keywords and Identifiers
The Keywords are the same way in python too. The Reserved words in python and are CASE SENSITIVE. The Keywords cannot be used as an function name, variable name or any other identifiers.
Identifier is the name given to entities like class, functions, variables etc.,in Python. It helps differentiating one entity from another.
Rules for Writing Identifiers:
Identifiers can be a combination of letters in lowercase (a to z) or uppercase (A to Z) or digits (0 to 9) or an underscore (_).
An identifier cannot start with a digit. 1variable is invalid, but variable1 is perfectly fine.
Keywords cannot be used as identifiers.
Comments, Indentation and Statements:
we use (#) hash symbol in python comments, which is nothing but we used to make the code more readable and are ignored by compilers and interpreters.
In case of multi-line comments, we use double or triple quotes ”’ or “””
Unlike most of the programming languages like C, C++, Java use braces { } to define a block of code. Python uses indentation.
Providing instructions to python interpreters are statements. we can use multiple statements in a single line using ; (Semi-colon)
Variables and Datatypes:
Variables are the same as in other programming languages, since we don’t have to declare variable types here which can be handled internally in python.
Every value in Python has a datatype. Since everything is an object in Python programming, data types are actually classes and variables which are instance (object) of these classes.
We can use the type() function to know which class a variable or a value belongs to and the isinstance() function to check if an object belongs to a particular class.
a=5
print(a, "is of type",type(a))
a=2.5
print(a, "is of type",type(a))
a=1+2j
print(a, "is of type",type(a))
Operators:
In order to carry out arithmetic and logical computations, we can use operators which are special symbols in python.
The following are the basic operator types in python:
Arithmetic operators
Comparison (Relational) operators
Logical (Boolean) operators
Bitwise operators
Assignment operators
Special operators
Arithmetic operators are used to perform mathematical operations like addition, subtraction, multiplication etc.
+ , -, *, /, %, //, ** are arithmetic operators
Comparison operators are used to compare values. It either returns True or False according to the condition.
>, <, ==, !=, >=, <= are comparison operators
Logical operators are and, or, not operators.
Bitwise operators act on operands as if they were string of binary digits. It operates bit by bit
&, |, ~, ^, >>, << are Bitwise operators
Assignment operators are used in Python to assign values to variables.
a = 5 is a simple assignment operator that assigns the value 5 on the right to the variable a on the left.
The special operators in python are Identity operator and Membership operator.
is and is not are the identity operators in Python.They are used to check if two values (or variables) are located on the same part of the memory.
in and not in are the membership operators in Python. They are used to test whether a value or variable is found in a sequence (string, list, tuple, set and dictionary).
They are used to test whether a value or variable is found in a sequence (string, list, tuple, set and dictionary).
Control Flows:
1. IF_ELSE Statement:
The if…elif…else statement is used in Python for decision making. Unlike in other programming languages, python interprets non-zero values as True. None and 0 are interpreted as False.
2. WHILE Loop:
The while loop in Python is used to iterate over a block of code as long as the test expression (condition) is true.
#Find the product of all numbers in the list
my_list=[10,20,30,40,50]
product = 1
index=0
while index < len(my_list)
product * = my_list[index]
index += 1
print(product)
3. FOR Loop:
The for loop in Python is used to iterate over a sequence (list, tuple, string) or other iterable objects. Iterating over a sequence is called traversal.
We can generate a sequence of numbers using range() function.
range(10) will generate numbers from 0 to 9 (10 numbers).
We can also define the start, stop and step size as range(start,stop,step size). step size defaults to 1 if not provided.
This function does not store all the values in memory, it would be inefficient. So it remembers the start, stop, step size and generates the next number on the go.
for i in range(10): print(i)
Thank you for reading my blog , Hope you enjoyed it:)