Using Scikit-learn (Basic Example)

Lesson 34/35 | Study Time: 60 Min

Course: Python Essentials Course Online | Start Learning Today

Knowing the theory of machine learning is important, but applying it in code is where the real learning happens. Scikit-learn is Python's most popular and beginner-friendly machine learning library.

It provides a clean, consistent interface for training models, making predictions, and evaluating performance, all in just a few lines of code.

Whether you are building a classifier, a regression model, or a clustering algorithm, Scikit-learn handles the heavy lifting so you can focus on the problem, not the mathematics.

Installing and Importing Scikit-learn

Scikit-learn comes pre-installed with Anaconda. To install manually:

Import what you need at the top of your script:

The Scikit-learn Workflow

Every Scikit-learn project follows the same consistent pattern, regardless of which algorithm you use:

Load Data → Prepare Data → Split Data → Train Model → Evaluate → Predict

This consistent structure is one of Scikit-learn's greatest strengths, once you learn it for one algorithm, you can apply it to any other.

Core Concepts Before Building

Before writing code, understand these four essentials:

Step-by-Step — Building a Classification Model

The following example uses the built-in Iris dataset — a classic beginner dataset containing measurements of three species of flowers. The goal is to classify which species a flower belongs to based on its measurements.

Step 1 — Load the Data

The dataset has 150 rows, 4 features (sepal length, sepal width, petal length, petal width), and 3 species (0, 1, 2).

Step 2 — Prepare Features and Target

Step 3 — Split Data into Training and Test Sets

Never train and test on the same data, the model would simply memorize the answers. Use train_test_split to divide your data.

Step 4 — Scale the Features

Feature scaling ensures all features are on the same numerical scale, which improves model performance and training speed.

Important: Always fit the scaler on training data only. Applying it to test data separately prevents data leakage, where test data influences the training process.

Step 5 — Choose and Train the Model

Here a Logistic Regression model is used — a simple, effective classification algorithm.