According to IBM Data Analytics you can expect to spend up to 80% of your time cleaning data.
Hi, Folks hope you all are doing awesome, In today’s tutorial, we are going to learn Data Cleaning Process with Python and Pandas framework.
Python programming language for data science is very popular and easy to learn. and in Data Science / Machine Learning, Data Cleaning is very important. Most of the time data is not in the form as you want, either its a paragraph, string, image, or any other format.
Suppose you want to create Chart, Bar Graph and you have data like this :
Name Class Year Ram five 2k17 Mohan three 2k18 Neeti Six 2Keighteen
in the above table, we have Name, Class, and Year respectively, and you want to differentiate classes on the basis of Year. But in the above table year is like 2k17,2k18 and 2keighteen, So it is impossible to define which is year 2017 or 2018. For this Situation we have an algorithm named: – Fuzzy logic, this algorithm helps you! for this, we’ll talk later in a separate post.
So here I’m taking an example of salary records, were some of the data are missing and we are going to handle.
# Importing libraries import pandas as pd import numpy as np # Read csv file into a pandas dataframe df = pd.read_csv("salary.csv") # Take a look at the first few rows print df.head()
Here Pandas framework and NumPy will he to clean the data.
df = pd.read_csv("salary.csv")
Reading the CSV (or we say that we are loading the data to clean). And the above code will give us tabular data that counted as row and column.
.head() method, we can easily see the first few rows
Last Name First Name Status Salary 0 Williams Mary Full Time 35000 1 Brown Robert NaN 32000 2 Wilson Elizabeth Part Time 12000 3 Moore Jennifer Full Time 0 4 Brown Charles Full Time 39000 5 Price Lisa Part Time 14000 6 Wood Daniel Part Time 13750 7 Coleman Donald Full Time 37500 8 Perry George Part Time 0 9 Steele Donna Full Time 36750
Standard Missing Values: that Pandas can automatically detect.
Going back to our original dataset, let’s take a look at the “status” column. In the second row, there’s an empty cell.
Clearly these are both missing values. Let’s see how Pandas deals with these.
Count Missing Value
Display the numbers of missing value in each column names/feature name which helps us to define what we want.
df.isnull().sum() #output Last Name 0 First Name 0 Status 4 Salary 5 dtype: int64
So, Missing value in the feature Salary is 5 and Status is 4
If we have any column in our datasets that is not relevant to our need or irrelevant information we drop that from our datasets. here we have 4 columns , lets assume we don’t need ‘Status’ column for now, So we are dropping that table. lets see how.
columns_to_delete = ['Status'] df.drop(columns_to_delete, inplace=True, axis=1) df.head() #Output Last Name First Name Salary 0 Williams Mary 35000.0 1 Brown Robert 32000.0 2 Wilson Elizabeth 12000.0 3 Moore Jennifer NaN 4 Brown Charles 39000.0
Here, I made a list of the columns I wanted to delete and then passed it to
A lot of pandas functions can be run in-place.
In pandas, axis 0 represents rows and axis 1 represents columns, so I’ve indicated
axis=1 because I’m deleting columns.
If a user accidentally submits a form twice, or you scrape a website twice without realizing it, or in plenty of other scenarios, you could end up with duplicate data.
You could have entire rows that are duplicates, or just duplicate values in a column when the column should be unique.
drop_duplicates() function in pandas can be used in both of these cases.
Either the whole row can be a duplicate, or you can specify columns to check for duplicates with the subset parameter, which takes a list of column names.
The keep parameter designates whether or not to keep any of the duplicates. The default is first so it would keep the first duplicate and drop the rest of them.
Last will drop all duplicates except for the last occurrence. False will drop all of them.
There is also the option to run it in-place.
Pandas recognizes a few things as missing values.
- Empty cells
- If the data is the wrong type for the column — string data in a numeric column, for example.
their are many way to handle missing value :-
When a dataset has missing or null values, it’s important to decide what to do about them in the context of your project. “Delete rows or columns with missing values.”
You can delete missing/null values in pandas with
For Column – If you wanted to drop any column with missing values.
For Row – You could choose to delete any rows that have a missing value in any column. Or you could delete rows that have a missing value in a subset of columns.
If I want to analyze the violations on a per-building basis, I might just delete rows that do not have a building identification number.
subset parameter allows you to specify a list of columns
Replace missing values.
fillna() on the DataFrame to fill in missing values.
If you wanted to fill in every missing value with a zero.
Or missing values can also be filled in by propagating the value that comes before or after it in the same column
Propagating values backward.
Propagating values forward.
There are many ways to determine what to replace missing data with
- Fixed value
- Regression to predict correct value
- Stochastic regression
- Hot-deck imputation — Randomly pick another row that has similar values on other variables and use its value.
We discuss about all the above point in detail on later articles.
Check Python Python Libraries for Data Science and other articles
I hope this post helped you to know Data Cleaning with Pandas Python : Missing Values. To get the latest news and updates follow us on Twitter & Facebook, subscribe to our YouTube channel. And If you have any query then please let us know by using the comment form.