When we talk about data science, most people immediately think of machine learning models.
But in reality, a large portion of the work happens before the model is even built.
This step is called data preprocessing.
π§ What is Data Preprocessing?
Data preprocessing is the process of:
- cleaning data
- transforming data
- preparing it for analysis or modeling
Raw data is rarely usable in its original form.
It often contains:
- missing values
- inconsistent formats
- duplicate records
- irrelevant features
π Why Preprocessing is Important
Even the best algorithm cannot fix poor-quality data.
For example:
- Missing values can break models
- Inconsistent formats lead to wrong analysis
- Outliers can distort predictions
A simple model on clean data often performs better than a complex model on messy data.
π§© Common Steps in Data Preprocessing
1️⃣ Handling Missing Values
Missing data is very common.
Options include:
- removing rows
- filling with mean/median
- using interpolation
Example (Python)
import pandas as pd
df = pd.read_csv("data.csv")
# Fill missing values with mean
df['age'].fillna(df['age'].mean(), inplace=True)
2️⃣ Removing Duplicates
Duplicate data can bias results.
df.drop_duplicates(inplace=True)
3️⃣ Encoding Categorical Variables
Machine learning models work with numbers, not text.
Example:
df = pd.get_dummies(df, columns=['city'])
4️⃣ Feature Scaling
Some algorithms require data to be on similar scales.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['salary']] = scaler.fit_transform(df[['salary']])
5️⃣ Handling Outliers
Outliers can distort models.
Example:
- remove extreme values
- cap values
- use robust methods
π Real-World Example
Suppose you are building a model to predict house prices.
Raw dataset may have:
- missing values in price
- inconsistent formats
- duplicate entries
- extreme outliers
After preprocessing:
- missing values handled
- clean numerical data
- standardized features
Only then is the data ready for modeling.
No comments:
Post a Comment