Data Cleaning and Transformation

Lesson 19/37 | Study Time: 60 Min

Course: Enroll in Python for Data Analysis Course Online Today

Raw data is rarely ready for analysis straight out of the box. It often comes with inconsistent formatting, duplicate records, incorrect data types, unwanted whitespace, and columns that need restructuring.

Data cleaning and transformation is the process of fixing these issues and reshaping your data into a form that is accurate, consistent, and analysis-ready.

In practice, data professionals spend a significant portion of their time on this step, as the quality of your analysis is directly tied to the quality of your data.

Setting Up Sample DataFrame

import pandas as pd

import numpy as np

data = {

'Name': [' Alice ', 'bob', 'CAROL', 'David', 'David', 'Eva'],

'Age': [28, 34, 29, 41, 41, 'thirty-six'],

'Department': ['Marketing', 'finance', 'IT', 'HR', 'HR', 'Finance'],

'Salary': [52000, 67000, 71000, 48000, 48000, 74000],

'JoiningDate':['2021-03-15', '2019/07/22', '01-11-2020', '2017-05-10', '2017-05-10', '2018-09-30']

}

df = pd.DataFrame(data)

This DataFrame intentionally contains common real-world issues — extra spaces, inconsistent casing, duplicates, wrong data types, and inconsistent date formats, all of which we will fix step by step.

Removing Duplicate Records

Duplicate rows are one of the most common data quality issues, especially when data is merged from multiple sources or exported from databases. They silently inflate counts and skew statistics.

Detecting Duplicates

Viewing Duplicate Rows

Removing Duplicates

Fixing Inconsistent Text — Whitespace and Casing

Text data is highly prone to inconsistency. A column with 'finance', 'Finance', and 'FINANCE' will be treated as three different categories, which breaks grouping and filtering operations. Similarly, extra spaces cause silent mismatches.

Removing Whitespace

Standardizing Letter Case

After applying .str.lower() to Department, values like 'finance', 'Finance', and 'FINANCE' all become 'finance' — consistent and ready for grouping.

Other Useful String Operations

# Replace specific text within a column

df['Department'] = df['Department'].str.replace('fin', 'Finance', regex=False)

# Check if column contains a substring

df['Name'].str.contains('al', case=False)

# Split a column into two

df[['First', 'Last']] = df['Name'].str.split(' ', expand=True)

Fixing Incorrect Data Types

Pandas sometimes misreads data types during import, numbers stored as strings, dates stored as objects, or boolean-like columns stored as integers. Operating on wrong data types produces errors or incorrect results.

Checking Current Data Types

Converting Data Types

Using errors='coerce' in pd.to_numeric() is particularly important — it converts any value that cannot be converted to a number (like 'thirty-six') into NaN instead of raising an error.

Handling Inconsistent Date Formats

Dates are notoriously inconsistent in raw data. The same dataset might have '2021-03-15', '2019/07/22', and '01-11-2020' in the same column. Pandas' pd.to_datetime() intelligently handles most formats automatically.

Once converted to datetime64, you can perform date arithmetic, sort chronologically, filter by date ranges, and extract time components — none of which is possible when the column is stored as a plain string.

Renaming Columns

Clean, consistent column names make your code more readable and prevent errors, especially when column names have spaces, special characters, or inconsistent casing.

Adding and Dropping Columns

Transformation often involves creating new columns derived from existing ones or removing columns that are no longer needed.

Adding a New Column

Dropping Columns

Applying Custom Transformations — apply() and map()

When built-in methods are not enough, apply() lets you run any custom function across rows or columns, while map() works on individual Series values.

# Apply a custom function to a column

def categorize_salary(salary):

if salary >= 70000:

return 'High'

elif salary >= 50000:

return 'Medium'

else:

return 'Low'

df['Salary_Band'] = df['Salary'].apply(categorize_salary)

# Using lambda for simple transformations

df['Salary_K'] = df['Salary'].apply(lambda x: x / 1000)

# map() — replace values using a dictionary

df['Dept_Code'] = df['Department'].map({

'marketing': 'MKT',

'finance': 'FIN',

'it': 'IT',

'hr': 'HR'

})

Complete Cleaning Workflow — Putting It All Together

In practice, data cleaning follows a logical, sequential order. Here is a concise end-to-end cleaning routine:

# Step 1: Remove duplicates

df = df.drop_duplicates()

# Step 2: Fix text inconsistencies

df['Name'] = df['Name'].str.strip().str.title()

df['Department'] = df['Department'].str.strip().str.lower()

# Step 3: Fix data types

df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

df['Salary'] = df['Salary'].astype(float)

# Step 4: Fix dates

df['JoiningDate'] = pd.to_datetime(df['JoiningDate'], errors='coerce')

# Step 5: Handle missing values (introduced by type conversion)

df['Age'] = df['Age'].fillna(df['Age'].median())

# Step 6: Rename columns for consistency

df.columns = df.columns.str.lower().str.replace(' ', '_')

# Step 7: Verify result

print(df.dtypes)

print(df.isnull().sum())

Previous Lesson Next Lesson

Blake Turner

Product Designer

Profile

Class Sessions

1- What is Data Analysis 2- Importance of Data in Decision Making 3- Overview of Python for Data Analysis 4- Setting up Python Environment (Anaconda / Jupyter Notebook) 5- Introduction to Jupyter Notebook Interface 6- Variables, Data Types, and Operators 7- Conditional Statements and Loops 8- Functions and Modules 9- Working with Lists, Tuples, Sets, and Dictionaries 10- Introduction to NumPy Arrays 11- Array Creation and Operations 12- Indexing, Slicing, and Reshaping 13- Mathematical and Statistical Functions 14- Introduction to Series and DataFrames 15- Reading and Writing Data (CSV, Excel) 16- Data Inspection (.head(), .info(), .describe()) 17- Data Selection and Filtering (.loc[], .iloc[]) 18- Handling Missing Values 19- Data Cleaning and Transformation 20- Importance of Data Visualization 21- Introduction to Matplotlib 22- Creating Line, Bar, Pie, and Histogram Charts 23- Customizing Plots (Titles, Labels, Styles) 24- Introduction to Seaborn (Basic Plots and Styling) 25- Understanding Data Distributions 26- Identifying Patterns and Trends 27- Detecting Outliers 28- Correlation and Relationships Between Variables 29- Summarizing Insights 30- Descriptive Statistics (Mean, Median, Mode) 31- Variance and Standard Deviation 32- Probability Basics 33- Correlation Concepts 34- Loading Real Datasets 35- Data Cleaning Workflow 36- Feature Selection Basics 37- Case Study: End-to-End Data Analysis