Working with Datasets

Lesson 29/35 | Study Time: 60 Min

Course: Python Essentials Course Online | Start Learning Today

A dataset is the foundation of every AI and machine learning project. Before any model can be trained or any prediction made, you need data — structured, loaded, and understood. Working with datasets means knowing how to find them, load them into Python, explore their structure, and prepare them for analysis.

What is a Dataset?

A dataset is a structured collection of data organized in rows and columns, where each row represents a record and each column represents a feature or attribute.

Where to Get Datasets

Several reliable platforms provide free datasets for learning and AI projects:

1. Kaggle — kaggle.com/datasets — largest collection of real-world datasets.

2. UCI ML Repository — archive.ics.uci.edu — classic machine learning datasets.

3. Scikit-learn — built-in sample datasets ready to use in Python.

4. Seaborn — comes with several small built-in datasets for visualization practice.

5. Google Dataset Search — datasetsearch.research.google.com.

Loading a Dataset

The most common way to load a dataset in Python is through Pandas.

Loading from a CSV file

Loading from a URL

You can load datasets directly from the internet without downloading them first.

python

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

df = pd.read_csv(url)

print(df.head())

Loading Built-in Datasets from Scikit-learn

Loading Built-in Datasets from Seaborn

Exploring a Dataset

Before doing anything else, always explore your dataset to understand what you are working with.

What to Look for During Exploration

1. Shape: How many rows and columns?

2. Data types: Are numeric columns stored as numbers or strings?

3. Missing values: Which columns have nulls and how many?

4. Unique values: Are categorical columns clean and consistent?

5. Statistical summary: What are the ranges, means, and distributions?

Selecting and Filtering Data

Once loaded, you frequently need to extract specific parts of the dataset.

python

# Select a single column

print(df["Score"])

# Select multiple columns

print(df[["Name", "Score", "Grade"]])

# Filter rows by condition

passed = df[df["Score"] >= 50]

print(passed)

# Filter with multiple conditions

top_students = df[(df["Score"] >= 80) & (df["Grade"] == "A")]

print(top_students)

# Select specific rows and columns

print(df.loc[0:4, ["Name", "Score"]]) # By label

print(df.iloc[0:5, 0:3]) # By position

Sorting and Ranking Data

Sorting helps you identify top and bottom performers in a dataset.

Understanding Value Distribution

Knowing how your data is distributed is critical before building any AI model.

Grouping and Summarizing

Grouping lets you summarize data by category, useful for comparing groups within a dataset.

Saving a Processed Dataset

After exploring and processing your dataset, save the result for later use.

A Complete Dataset Workflow

Here is a full, practical example of the dataset workflow from load to save:

python

import pandas as pd

# Step 1 — Load

df = pd.read_csv("students.csv")

# Step 2 — Explore

print(df.shape)

print(df.isnull().sum())

print(df.describe())

# Step 3 — Filter

df = df[df["Score"] >= 0] # Remove invalid scores

# Step 4 — Add a feature

df["Status"] = df["Score"].apply(lambda x: "Pass" if x >= 50 else "Fail")

# Step 5 — Sort

df = df.sort_values("Score", ascending=False)

# Step 6 — Summarize

print(df.groupby("Status")["Score"].mean())

# Step 7 — Save

df.to_csv("students_processed.csv", index=False)

print("Dataset saved successfully.")

Previous Lesson Next Lesson

Dean Walker

Product Designer

Profile

Class Sessions

1- What is Python and Why It Is Used in AI 2- Overview of Artificial Intelligence and Its Applications 3- Setting Up Your Python Environment 4- Writing and Running Your First Python Program 5- Variables and Data Types 6- Type Casting and Basic Input/Output 7- Operators (Arithmetic, Comparison, Logical) 8- Writing Clean and Readable Code 9- Conditional Statements (if, elif, else) 10- Loops (for, while) 11- Loop Control Statements (break, continue, pass) 12- Basic Problem-Solving Using Control Flow 13- Lists and List Operations 14- Tuples and Their Usage 15- Dictionaries (Key-Value Pairs) 16- Sets and Basic Operations 17- Choosing the Right Data Structure 18- Defining and Calling Functions 19- Parameters and Return Values 20- Lambda (Anonymous) Functions 21- Scope of Variables (Local vs Global) 22- Writing Modular Code 23- Introduction to Python Libraries 24- NumPy Basics (Arrays, Operations) 25- Pandas Basics (DataFrames, Data Handling) 26- Matplotlib Basics (Data Visualization) 27- Installing and Managing Packages (pip / conda) 28- Reading and Writing Files (Text, CSV) 29- Working with Datasets 30- Basic Data Cleaning Techniques 31- Error Handling (try-except) 32- What is Machine Learning (ML) 33- Types of ML (Supervised, Unsupervised) 34- Using Scikit-learn (Basic Example) 35- Simple AI Project Walkthrough (Prediction Model)