LA ILAHA ILLA HU
Allah, Your Lord There Is No Deity Except Him.

Python Data Science Pandas Removing Duplicates Fixing Cleaning Removing Wrong Data Cleaning Wrong Data
df.drop_duplicates(inplace = True)
How to remove duplicates in Pandas?
Step 1. Check for Duplicates Duplicated() Method Returns A Boolean df.drop_duplicates(inplace = True)
Step 2. Remove All Duplicates From the Dataset df.drop_duplicates(inplace = True)
Duplicate rows are rows that have been registered more than one time, examine the dataset below.
Apple, Bananas and Mangoes are values that are appearing more than once.
By taking a look at our test data set, we can observe that row 2, 5 and 8 are duplicates.
In a large enough dataset you may not be able to discover duplicates just by looking at the dataset, to discover duplicates, we can use the duplicated() method.
The duplicated() method returns a Boolean values for each row.
Example 1: Check for duplicates in the dataset.
Code
import pandas as pd
import numpy as np
LGI = {
'Low GI Diet Fruits':
["Apple","Apricots","Apple",
"Bananas","Grapes","Bananas",
"Mangoes","Orangs","Mangoes",
"Pineapple"],
'Weight (Gms)' :
[120,60,120,
120,120,120,120,120,120,120],
'GI Scores':
[40,32,40,47,43,47,51,48,51,51]
}
df = pd.DataFrame(LGI)
print(df.duplicated())
the output will be
0 False 1 False 2 True 3 False 4 False 5 True 6 False 7 False 8 True 9 False dtype: boolRemove All Duplicates
Example 2: Remove all duplicates from the dataset..
Code
import pandas as pd
import numpy as np
LGI = {
'Low GI Diet Fruits':
["Apple","Apricots","Apple",
"Bananas","Grapes","Bananas",
"Mangoes","Orangs","Mangoes",
"Pineapple"],
'Weight (Gms)' :[120,60,120,
120,120,120,120,120,120,120],
'GI Scores':[40,32,40,47,43,47,51,48,51,51]
}
df = pd.DataFrame(LGI)
df.drop_duplicates(inplace = True)
print(df)
the output will be
Note: observe that row 2, 5 and 8 have been removed.
If your data is a csv file use
df = pd.read_csv('data.csv')
in place of
df = pd.DataFrame(LGI)
to read the dataset.
Point to Remember: The (inplace = True) will make sure that the method does NOT return a new DataFrame, but it will remove all duplicates from the original DataFrame.
Real Life Example: How To Remove Duplicates From Big Data With Pandas
We had a dataset containing 49775 items, we worked on the following code which we are sharing below. After applying the code our data set came down to 46000 items.
Code
import pandas as pd
import numpy as np
df = pd.read_csv(filename.csv',
encoding= 'ansi')
df.drop_duplicates(inplace = True)
df.to_csv("updated_csv.csv",
index=False)
print(df)