Basic Data Cleaning Techniques

Lesson 30/35 | Study Time: 60 Min

Course: Python Essentials Course Online | Start Learning Today

Real-world data is rarely clean. It comes with missing values, duplicate records, inconsistent formatting, incorrect data types, and outliers that can mislead your analysis or break your model.

Data cleaning, also called data preprocessing is the process of identifying and fixing these issues before analysis or model training begins. In AI and machine learning, the quality of your data directly determines the quality of your results.

A well-cleaned dataset is not optional; it is the foundation of every reliable AI model.

The Most Common Data Quality Issues

Before cleaning, it helps to know what you are looking for:

Setting Up — Loading and Inspecting

Always start by loading your data and running a quick inspection to identify what needs cleaning.

This gives you a complete picture of where the problems are before you start fixing them.

Handling Missing Values

Missing values are the most common data quality problem. There are three standard approaches depending on how many values are missing and how important the column is.

Detecting Missing Values

Option 1 — Drop Rows or Columns with Missing Values

Use this when very few rows are missing or an entire column is mostly empty.

Option 2 — Fill Missing Values

Use this when dropping would lose too much data.

Removing Duplicate Rows

Duplicate records inflate your dataset and distort model training. Identifying and removing them is straightforward.

Fixing Data Types

Columns are sometimes stored in the wrong type, numbers as strings, dates as plain text. This causes errors in calculations and model training.

Fixing Inconsistent Text Values

Inconsistent capitalization and spacing in categorical columns cause incorrect grouping and analysis.

python

# Standardize to lowercase

df["Gender"] = df["Gender"].str.lower()

# Remove leading/trailing whitespace

df["Name"] = df["Name"].str.strip()

# Replace inconsistent values

df["Gender"] = df["Gender"].replace({

"male": "Male",

"m": "Male",

"female": "Female",

"f": "Female"

})

# Check unique values after fixing

print(df["Gender"].unique())

Handling Outliers

Outliers are extreme values that can distort statistical analysis and reduce model accuracy. The most common detection method uses the IQR (Interquartile Range).

For cases where you want to keep the rows but cap extreme values:

Dropping Irrelevant Columns

Columns that add no value to your analysis or model should be removed to keep the dataset clean and efficient.

Renaming Columns

Clean, consistent column names make your code more readable and prevent errors.

Complete Data Cleaning Workflow

Here is a clean, practical end-to-end example:

python

import pandas as pd

df = pd.read_csv("students.csv")

# Step 1 — Inspect

print(df.info())

print(df.isnull().sum())

# Step 2 — Remove duplicates

df = df.drop_duplicates()

# Step 3 — Fix missing values

df["Score"] = df["Score"].fillna(df["Score"].median())

df["Grade"] = df["Grade"].fillna(df["Grade"].mode()[0])

# Step 4 — Fix data types

df["Score"] = pd.to_numeric(df["Score"], errors="coerce")

df["Age"] = df["Age"].astype(int)

# Step 5 — Fix inconsistent text

df["Gender"] = df["Gender"].str.lower().str.strip()

# Step 6 — Remove outliers

Q1, Q3 = df["Score"].quantile([0.25, 0.75])

IQR = Q3 - Q1

df = df[(df["Score"] >= Q1 - 1.5*IQR) & (df["Score"] <= Q3 + 1.5*IQR)]

# Step 7 — Drop irrelevant columns

df = df.drop(columns=["StudentID"])

# Step 8 — Save

df.to_csv("students_clean.csv", index=False)

print("Cleaning complete. Rows remaining:", len(df))

Data Cleaning Checklist

1. Missing values detected and handled

2. Duplicate rows removed

3. Data types corrected

4. Inconsistent text values standardized

5. Outliers detected and addressed

6. Irrelevant columns removed

7. Column names cleaned and consistent

8. Cleaned dataset saved

Previous Lesson Next Lesson

Dean Walker

Product Designer

Profile

Class Sessions

1- What is Python and Why It Is Used in AI 2- Overview of Artificial Intelligence and Its Applications 3- Setting Up Your Python Environment 4- Writing and Running Your First Python Program 5- Variables and Data Types 6- Type Casting and Basic Input/Output 7- Operators (Arithmetic, Comparison, Logical) 8- Writing Clean and Readable Code 9- Conditional Statements (if, elif, else) 10- Loops (for, while) 11- Loop Control Statements (break, continue, pass) 12- Basic Problem-Solving Using Control Flow 13- Lists and List Operations 14- Tuples and Their Usage 15- Dictionaries (Key-Value Pairs) 16- Sets and Basic Operations 17- Choosing the Right Data Structure 18- Defining and Calling Functions 19- Parameters and Return Values 20- Lambda (Anonymous) Functions 21- Scope of Variables (Local vs Global) 22- Writing Modular Code 23- Introduction to Python Libraries 24- NumPy Basics (Arrays, Operations) 25- Pandas Basics (DataFrames, Data Handling) 26- Matplotlib Basics (Data Visualization) 27- Installing and Managing Packages (pip / conda) 28- Reading and Writing Files (Text, CSV) 29- Working with Datasets 30- Basic Data Cleaning Techniques 31- Error Handling (try-except) 32- What is Machine Learning (ML) 33- Types of ML (Supervised, Unsupervised) 34- Using Scikit-learn (Basic Example) 35- Simple AI Project Walkthrough (Prediction Model)