Introduction to Data Wrangling

Lesson 15/32 | Study Time: 27 Min

Course: Data Science Process for Beginners

Data wrangling is the step that transforms raw, messy, and unstructured data into formats optimized for analysis, modeling, reporting, or visualization.

While data cleaning focuses on correcting errors, data wrangling focuses on reshaping, restructuring, and enriching the dataset so it becomes analytically powerful.

Data wrangling is a creative, iterative process—part engineering, part analysis, and part domain reasoning.

When performed effectively, it elevates dataset usability and directly improves model performance.

Reshaping Data

Reshaping data involves reorganizing the structure of a dataset to make it more suitable for analytical objectives.

Raw data is rarely structured exactly the way a data scientist needs it. You may receive it as a wide spreadsheet, nested JSON, log files, or a narrow long-format table.

Reshaping techniques convert these structures into analytical formats compatible with statistical models, BI dashboards, and machine learning algorithms.

Long vs Wide Format Conversion

Data can exist in two essential structural formats:

Wide Format:

1. Each subject appears once

2. Multiple attributes appear as separate columns

Example: Monthly sales columns like Jan, Feb, Mar

Long Format:

1. Each row represents one observation of a variable

2. Suitable for time-series and machine learning

Example: Month and Sales columns instead of 12 monthly columns

Why Long/Wide Conversion Matters:

This transformation ensures compatibility with the intended analysis.

Tools like pd.melt() and pivot() in Python make conversions seamless.

Pivoting and Unpivoting Data

Pivoting transforms long data into wide format to create summaries or reorganize categories.

Use Cases of Pivoting:

1. Creating user-level summaries (e.g., total purchases by category)

2. Generating cross-tab reports for BI dashboards

3. Reshaping transactional data into analytical matrices

4. Unpivoting (also called melting) is the reverse process where wide data is condensed into long format.

Why It Matters: Unpivoting standardizes datasets with many repeated columns and makes them more compatible with statistical algorithms.

For example, converting columns like “Test1, Test2, Test3” into a single “Test Score” column improves modeling consistency.

Grouping and Aggregating

Aggregation helps summarize and condense large datasets.

Common Aggregations:

1. Sum of sales

2. Average temperature readings

3. Maximum and minimum values per group

4. Count of unique users or sessions

Why Aggregation is Critical:

1. Simplifies large datasets

2. Supports feature engineering (e.g., customer lifetime value)

3. Helps identify trends and patterns

4. Enables efficient model-ready datasets

Grouping and aggregation directly influence the quality of descriptive analytics and predictive features.

Feature Creation (Feature Engineering)

Feature engineering is the process of creating meaningful new variables that enhance the modeling power of algorithms.

Raw data often lacks the expressiveness required by machine learning models. By deriving new features, data scientists expose hidden patterns and relationships.

1. Deriving New Columns from Existing Variables

This involves creating new metrics or indicators based on combinations or transformations of existing columns.

Examples:

1. Age from birthdate

2. Time difference between events

3. Revenue growth rate

4. Customer activity scores

Why It Matters: Machine learning models identify patterns based on the features provided.

Well-designed features capture essential business logic and improve predictions significantly.

Encoding Categorical Variables (OHE, Label Encoding)

Most machine learning algorithms require numerical inputs. Categorical variables need encoding.

One-Hot Encoding:

1. Creates binary columns for each category.

2. Useful for nominal (unordered) categories like color or city.

Label Encoding:

1. Assigns numeric labels to categories.

2. Useful for ordinal categories like education level.

Why Encoding is Important: Incorrect encoding distorts patterns and biases models.

Proper encoding ensures compatibility with ML algorithms and preserves categorical meaning.

Binning and Scaling Data

Binning (Discretization):

1. Splits continuous variables into ranges.

2. Examples include income brackets or age groups.

Scaling:

1. Standardization (Z-score scaling)

2. Normalization (Min–Max scaling)

Impact on Modeling:

1. Many algorithms—such as kNN, SVM, and neural networks—are sensitive to scale.

2. Scaling ensures that all features contribute equally and models converge faster.

Merging and Joining Datasets

Modern data ecosystems rely on multiple structured and unstructured sources—databases, APIs, log files, CRM systems, ERP platforms, and more.

Combining these datasets is essential to build a unified and complete analytical view.

A. Combining Multiple Data Sources Using Keys

Joining involves merging datasets on common identifiers like:

1. Customer ID

2. Order ID

3. Product SKU

4. Email or phone number

5. Timestamp

Why Joins Are Critical:

1. Real-world data is distributed across systems.

2. CRM contains customer profiles

3. Transaction system contains purchase data

4. Web logs contain browsing behavior

Joining enriches datasets by creating a single unified dataset that models real-world entities accurately.

B. Handling Join Types (Inner, Left, Right, Outer)

Different join types control how datasets combine.
Common Join Types

Real-World Relevance

Previous Lesson Next Lesson

Blake Turner

Product Designer

Profile

Class Sessions

1- What is Data Science? 2- Importance of Methodology 3- Overview of Common Frameworks 4- Roles and Applications in the Industry 5- Business Understanding 6- Defining Objectives and Questions 7- Framing Data Science Problems 8- Working with IDEs 9- Identifying Data Requirements 10- Data Sources 11- Basics of Data Collection & Ethics 12- Data Exploration Basics 13- Handling Missing or Inconsistent Data 14- Data Cleaning Essentials 15- Introduction to Data Wrangling 16- Introduction to Analytical Thinking 17- Overview of Analytical Methods 18- Introduction to Key Tools : Python and Excel 19- Summary Statistics 20- Measures of Spread (Variance, Standard Deviation) 21- Central Tendency and Dispersion 22- Interpreting Basic Statistical Outputs 23- Introduction to Data Analysis 24- Basic Data Visualization: Charts, Graphs, Plots 25- Extracting Insights From Data 26- Structuring a Data Science Report 27- Presenting Insights Visually and Textually 28- Introduction to Storytelling with Data 29- Ethical Considerations in Data Science 30- Reviewing the Data Science Workflow 31- Emerging Trends 32- Introduction