Implementing EDA on our Sample Data

we will explore our data with a new dataset where in the dataset is extracted from open source UCI Repository.

As a First step, we will understand our data using pandas library.

1. Reading our data: Our data is downloaded as csv file, hence we use read_csv method to read our data.

#import necessary python packages
import numpy as np
import pandas as pd

#Reading our data
data=pd.read_csv("your csv file location path goes here[data.csv]")

#To print the values, just enter the variable name
data

print(data.shape)
print(data.columns)

In order to display first 5 rows of data, we use head() method and for the last 5 rows we use tail() method.

data.head() #For first 5 rows

data.tail() #For last 5 rows

As we have read our data and let’s have a quick look at the variables in the data.

Unique values of features (for more information please see the link above):

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
salary: >50K,<=50K

Things to Note:

If a variable can take on any value between two specified values, it is called a continuous variable; otherwise, it is called a discrete variable.

The below examples will clarify the difference between discrete and continuous variables.

Suppose the fire department mandates that all fire fighters must weigh between 150 and 250 pounds. The weight of a fire fighter would be an example of a continuous variable; since a fire fighter’s weight could take on any value between 150 and 250 pounds.
Suppose we flip a coin and count the number of heads. The number of heads could be any integer value between 0 and plus infinity. However, it could not be any number between 0 and plus infinity. We could not, for example, get 2.5 heads. Therefore, the number of heads must be a discrete variable.

2. Exploring our data:

Now, we will try to answer some of the basic questions to understand the data better.

** Count the number of males and females in Sex feature[Column]? Hint: To count the values, we need to use value_counts() method from pandas library.

data['sex'].value_counts()

** Calculate the average age of Men and Women? Hint: To calculate column value, we will use column name itself, but to access row value we will use .loc method

** The maximum number of hours a person works per week (hours-per-week feature). Likewise check with minimum hours and average hours a person can work. Hint: Use the column hours-per-week and max() method

data['hours-per-week'].max()

** The number of people works in such a maximum hours. Hint: Use the above function , store in a variable

max = data['hours-per-week'].max()

workaholics = data[data['hours-per-week'] == max].shape[0]

workaholics

** Now, we could see 85 of them were working 99 hours per week. Can we get how many of them earns >50K salary?

rich = (data[(data['hours-per-week'] == max)
                  & (data['salary'] == '>50K')].shape[0]) 
rich

** Count the average time of work (hours-per-week) those who earning a little and a lot (salary) for each country (native-country). Hint: Use For loop

for (country, salary), sub_df in data.groupby(['native-country', 'salary']):
     print(country, salary, round(sub_df['hours-per-week'].mean(), 2))

** Another way of getting the same output

pd.crosstab(data['native-country'], data['salary'],values=data['hours-per-week'], aggfunc=np.mean).T

3. Visualizing our data : As we can visualize our plots through scatter plots, pair plots , histogram etc., as we studied in our previous matplotlib blog.

import seaborn as sns
# seaborn is also one of the visualization python library 

sns.set_style=("whitegrid")
sns.pairplot(data , hue = 'salary', height = 3)
plt.show()

Hope you will have fun by changing the above codes and getting much more insights on the data. we will check with another interesting datasets in the coming blogs.

Have a great day:)

Share this:

Related

Leave a comment Cancel reply