USD ($)
$
United States Dollar
Euro Member Countries
India Rupee
د.إ
United Arab Emirates dirham
ر.س
Saudi Arabia Riyal

Working with Datasets

Lesson 29/35 | Study Time: 60 Min

A dataset is the foundation of every AI and machine learning project. Before any model can be trained or any prediction made, you need data — structured, loaded, and understood. Working with datasets means knowing how to find them, load them into Python, explore their structure, and prepare them for analysis. 

What is a Dataset?

A dataset is a structured collection of data organized in rows and columns, where each row represents a record and each column represents a feature or attribute.

Where to Get Datasets

Several reliable platforms provide free datasets for learning and AI projects:


1. Kaggle — kaggle.com/datasets — largest collection of real-world datasets.

2. UCI ML Repository — archive.ics.uci.edu — classic machine learning datasets.

3. Scikit-learn — built-in sample datasets ready to use in Python.

4. Seaborn — comes with several small built-in datasets for visualization practice.

5. Google Dataset Search — datasetsearch.research.google.com.

Loading a Dataset

The most common way to load a dataset in Python is through Pandas.

Loading from a CSV file


Loading from a URL

You can load datasets directly from the internet without downloading them first.


python

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

df = pd.read_csv(url)

print(df.head())


Loading Built-in Datasets from Scikit-learn


Loading Built-in Datasets from Seaborn

Exploring a Dataset

Before doing anything else, always explore your dataset to understand what you are working with.

What to Look for During Exploration


1. Shape: How many rows and columns?

2. Data types: Are numeric columns stored as numbers or strings?

3. Missing values: Which columns have nulls and how many?

4. Unique values: Are categorical columns clean and consistent?

5. Statistical summary: What are the ranges, means, and distributions?

Selecting and Filtering Data

Once loaded, you frequently need to extract specific parts of the dataset.


python


# Select a single column

print(df["Score"])


# Select multiple columns

print(df[["Name", "Score", "Grade"]])


# Filter rows by condition

passed = df[df["Score"] >= 50]

print(passed)


# Filter with multiple conditions

top_students = df[(df["Score"] >= 80) & (df["Grade"] == "A")]

print(top_students)


# Select specific rows and columns

print(df.loc[0:4, ["Name", "Score"]])       # By label

print(df.iloc[0:5, 0:3])                    # By position


Sorting and Ranking Data

Sorting helps you identify top and bottom performers in a dataset.


Understanding Value Distribution

Knowing how your data is distributed is critical before building any AI model.


Grouping and Summarizing

Grouping lets you summarize data by category, useful for comparing groups within a dataset.




Saving a Processed Dataset

After exploring and processing your dataset, save the result for later use.


A Complete Dataset Workflow

Here is a full, practical example of the dataset workflow from load to save:

python

import pandas as pd

# Step 1 — Load
df = pd.read_csv("students.csv")

# Step 2 — Explore
print(df.shape)
print(df.isnull().sum())
print(df.describe())

# Step 3 — Filter
df = df[df["Score"] >= 0]           # Remove invalid scores

# Step 4 — Add a feature
df["Status"] = df["Score"].apply(lambda x: "Pass" if x >= 50 else "Fail")

# Step 5 — Sort
df = df.sort_values("Score", ascending=False)

# Step 6 — Summarize
print(df.groupby("Status")["Score"].mean())

# Step 7 — Save
df.to_csv("students_processed.csv", index=False)
print("Dataset saved successfully.")

Sales Campaign

Sales Campaign

We have a sales campaign on our promoted courses and products. You can purchase 1 products at a discounted price up to 15% discount.