
we will explore our data with a new dataset where in the dataset is extracted from open source UCI Repository.
As a First step, we will understand our data using pandas library.
1. Reading our data: Our data is downloaded as csv file, hence we use read_csv method to read our data.
#import necessary python packages
import numpy as np
import pandas as pd
#Reading our data
data=pd.read_csv("your csv file location path goes here[data.csv]")
#To print the values, just enter the variable name
data
print(data.shape)
print(data.columns)
In order to display first 5 rows of data, we use head() method and for the last 5 rows we use tail() method.
data.head() #For first 5 rows data.tail() #For last 5 rows
As we have read our data and let’s have a quick look at the variables in the data.
Unique values of features (for more information please see the link above):
age: continuous.workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.fnlwgt: continuous.education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.education-num: continuous.marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.sex: Female, Male.capital-gain: continuous.capital-loss: continuous.hours-per-week: continuous.native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.salary: >50K,<=50K
Things to Note:
If a variable can take on any value between two specified values, it is called a continuous variable; otherwise, it is called a discrete variable.
The below examples will clarify the difference between discrete and continuous variables.
- Suppose the fire department mandates that all fire fighters must weigh between 150 and 250 pounds. The weight of a fire fighter would be an example of a continuous variable; since a fire fighter’s weight could take on any value between 150 and 250 pounds.
- Suppose we flip a coin and count the number of heads. The number of heads could be any integer value between 0 and plus infinity. However, it could not be any number between 0 and plus infinity. We could not, for example, get 2.5 heads. Therefore, the number of heads must be a discrete variable.
2. Exploring our data:
Now, we will try to answer some of the basic questions to understand the data better.
** Count the number of males and females in Sex feature[Column]? Hint: To count the values, we need to use value_counts() method from pandas library.

data['sex'].value_counts()
** Calculate the average age of Men and Women? Hint: To calculate column value, we will use column name itself, but to access row value we will use .loc method

** The maximum number of hours a person works per week (hours-per-week feature). Likewise check with minimum hours and average hours a person can work. Hint: Use the column hours-per-week and max() method
data['hours-per-week'].max()
** The number of people works in such a maximum hours. Hint: Use the above function , store in a variable
max = data['hours-per-week'].max() workaholics = data[data['hours-per-week'] == max].shape[0] workaholics
** Now, we could see 85 of them were working 99 hours per week. Can we get how many of them earns >50K salary?
rich = (data[(data['hours-per-week'] == max)
& (data['salary'] == '>50K')].shape[0])
rich
** Count the average time of work (hours-per-week) those who earning a little and a lot (salary) for each country (native-country). Hint: Use For loop
for (country, salary), sub_df in data.groupby(['native-country', 'salary']):
print(country, salary, round(sub_df['hours-per-week'].mean(), 2))
** Another way of getting the same output
pd.crosstab(data['native-country'], data['salary'],values=data['hours-per-week'], aggfunc=np.mean).T
3. Visualizing our data : As we can visualize our plots through scatter plots, pair plots , histogram etc., as we studied in our previous matplotlib blog.

import seaborn as sns
# seaborn is also one of the visualization python library
sns.set_style=("whitegrid")
sns.pairplot(data , hue = 'salary', height = 3)
plt.show()
Hope you will have fun by changing the above codes and getting much more insights on the data. we will check with another interesting datasets in the coming blogs.
Have a great day:)