Tuesday, 24 March 2026

📊 Data Preprocessing in Data Science: Why Cleaning Data Matters

When we talk about data science, most people immediately think of machine learning models.

But in reality, a large portion of the work happens before the model is even built.

This step is called data preprocessing.


🧠 What is Data Preprocessing?

Data preprocessing is the process of:

  • cleaning data
  • transforming data
  • preparing it for analysis or modeling

Raw data is rarely usable in its original form.

It often contains:

  • missing values
  • inconsistent formats
  • duplicate records
  • irrelevant features



🔍 Why Preprocessing is Important

Even the best algorithm cannot fix poor-quality data.

For example:

  • Missing values can break models
  • Inconsistent formats lead to wrong analysis
  • Outliers can distort predictions

A simple model on clean data often performs better than a complex model on messy data.


🧩 Common Steps in Data Preprocessing




1️⃣ Handling Missing Values

Missing data is very common.

Options include:

  • removing rows
  • filling with mean/median
  • using interpolation

Example (Python)

import pandas as pd

df = pd.read_csv("data.csv")

# Fill missing values with mean
df['age'].fillna(df['age'].mean(), inplace=True)

2️⃣ Removing Duplicates

Duplicate data can bias results.

df.drop_duplicates(inplace=True)

3️⃣ Encoding Categorical Variables

Machine learning models work with numbers, not text.

Example:

df = pd.get_dummies(df, columns=['city'])

4️⃣ Feature Scaling

Some algorithms require data to be on similar scales.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['salary']] = scaler.fit_transform(df[['salary']])

5️⃣ Handling Outliers

Outliers can distort models.

Example:

  • remove extreme values
  • cap values
  • use robust methods



📊 Real-World Example

Suppose you are building a model to predict house prices.

Raw dataset may have:

  • missing values in price
  • inconsistent formats
  • duplicate entries
  • extreme outliers

After preprocessing:

  • missing values handled
  • clean numerical data
  • standardized features

Only then is the data ready for modeling.

Sunday, 22 March 2026

Exploring Oracle Database 26ai: 5 Features That Stand Out

Databases have evolved far beyond simple data storage systems. Modern databases are expected to support AI workloads, analytics, application development, and secure data management—all within a single platform.

Oracle Database 26ai reflects this evolution. It introduces several features designed to make databases more intelligent, scalable, and developer-friendly.

In this post, I’ll highlight five interesting capabilities of Oracle 26ai that caught my attention while exploring recent demos and documentation.


1️⃣ Hybrid Vector Index – Enabling AI-Driven Search

One of the most exciting additions in modern databases is vector search.

Vector search allows databases to store and query embeddings generated by machine learning models. Instead of matching exact words, the system can find results based on semantic similarity.

This capability is essential for building:

  • AI assistants

  • semantic search systems

  • recommendation engines

  • Retrieval Augmented Generation (RAG) applications

Oracle 26ai introduces Hybrid Vector Index, which combines vector similarity search with traditional SQL filtering.

Example

Suppose we store product descriptions and their embeddings.

CREATE TABLE products ( id NUMBER, description VARCHAR2(500), embedding VECTOR(768) );

To search for products similar to a query embedding:

SELECT id, description FROM products ORDER BY VECTOR_DISTANCE(embedding, :query_vector) FETCH FIRST 5 ROWS ONLY;

The hybrid vector index improves performance by combining vector search with structured filtering.

Example:

SELECT id, description FROM products WHERE category = 'electronics' ORDER BY VECTOR_DISTANCE(embedding, :query_vector) FETCH FIRST 5 ROWS ONLY;

This allows AI search and structured queries to work together efficiently.




2️⃣ JSON Relational Duality – Bridging Documents and Tables

Applications often use JSON-based APIs, while databases traditionally store data in relational tables.

Oracle introduced JSON Relational Duality, allowing data to be viewed both as:

  • relational tables

  • JSON documents

without duplicating the data.

This makes it easier to build modern applications while keeping relational integrity.

Example

Create a table storing customer data.

CREATE TABLE customers ( id NUMBER PRIMARY KEY, name VARCHAR2(100), city VARCHAR2(100) );

The same data can be exposed as a JSON document.

SELECT JSON_OBJECT( 'id' VALUE id, 'name' VALUE name, 'city' VALUE city ) FROM customers;

This flexibility allows developers to interact with the database in the format most suitable for their application.




3️⃣ Secure Data Redaction – Protecting Sensitive Information

Security and privacy are critical for modern data systems, especially when dealing with personal or financial information.

Oracle provides data redaction capabilities to prevent sensitive data from being exposed to unauthorized users.

Instead of returning real values, the database can automatically mask or redact sensitive fields.

Example

Suppose we want to hide credit card numbers from certain users.

BEGIN DBMS_REDACT.ADD_POLICY( object_schema => 'HR', object_name => 'PAYMENTS', column_name => 'CARD_NUMBER', policy_name => 'REDACT_CARD', function_type => DBMS_REDACT.FULL ); END;

When queried, the column might appear as:

XXXX-XXXX-XXXX-1234

This feature helps enforce data protection policies directly inside the database.




4️⃣ Domain Types – Improving Data Consistency

Another interesting feature is Domain Types, which allow developers to define reusable data definitions with constraints.

This helps maintain data consistency across multiple tables.

Example

Define a domain for email addresses.

CREATE DOMAIN email_domain AS VARCHAR2(255) CHECK (REGEXP_LIKE(VALUE, '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'));

Now the domain can be reused across tables.

CREATE TABLE users ( id NUMBER, email email_domain );

This ensures every table using this domain follows the same validation rules.




5️⃣ Built-in GraphQL Support

Modern applications often use GraphQL APIs instead of traditional REST APIs.

Oracle Database now supports GraphQL queries directly within the database layer, allowing developers to expose data through GraphQL without building separate middleware.

This reduces complexity in application architectures.

Example GraphQL query:

{ customers { id name city } }

The database can resolve this query and return structured data directly.

This capability helps simplify data access for modern web and mobile applications.




Why These Features Matter

Looking at these features together reveals an interesting trend.

Modern databases are evolving to support multiple workloads:

CapabilityPurpose
Vector SearchAI and semantic search
JSON DualityModern API-friendly data access
Data RedactionSecurity and compliance
Domain TypesData quality and consistency
GraphQLSimplified application integration

Instead of relying on multiple separate systems, many of these capabilities are now built directly into the database engine.


Final Thoughts

Oracle Database 26ai reflects a broader shift in data platforms.

Databases are no longer just repositories for structured tables—they are becoming intelligent data platforms capable of supporting AI, analytics, and modern application architectures.

Features like vector search, JSON duality, and built-in security controls show how database systems are adapting to the needs of modern applications.

For developers and data professionals, understanding these capabilities can open up new possibilities for building AI-powered and data-driven systems

Many of these capabilities were introduced in Oracle 23ai and continue to evolve in Oracle 26ai as part of Oracle’s AI-focused database platform.

Thursday, 5 March 2026

📊 Analysis vs Analytics: Understanding the Foundation of Data-Driven Decisions

In conversations about data, two terms often appear together: analysis and analytics.

Although they sound similar, they represent slightly different concepts.

Understanding this difference is important before exploring more advanced topics like predictive or prescriptive analytics.


🔍 What is Analysis?

Analysis refers to the detailed examination of something in order to understand its structure, components, or meaning.

It is usually focused on a specific problem or dataset.

In simple terms:

Analysis means breaking something down into smaller parts to understand it better.

Examples

  • Examining financial statements to understand company performance

  • Investigating why website traffic dropped last week

  • Studying customer feedback to identify common complaints

Analysis is often manual or investigative, and it answers questions like:

  • What happened?

  • What patterns exist in this data?


📈 What is Analytics?

Analytics is broader than analysis.

Analytics is the systematic computational analysis of data using tools, algorithms, and statistical methods to discover patterns and generate insights.

Unlike traditional analysis, analytics typically involves:

  • automated tools

  • statistical models

  • machine learning techniques

  • large datasets

Analytics aims to transform raw data into actionable insights for decision-making.


Example

A company might:

  • Analyze last quarter’s sales report manually

  • Use analytics tools to automatically detect trends and predict future demand

So while analysis is a process, analytics is often a system or discipline that uses data technologies to perform analysis at scale.


🧠 Simple Comparison

AspectAnalysisAnalytics
ScopeFocused investigationBroader discipline
ApproachManual or exploratorySystematic and computational
ToolsBasic tools or manual reviewStatistical models, AI, analytics platforms
GoalUnderstand a specific problemExtract insights and support decision-making

📊 The Four Types of Data Analytics

Once data is processed through analytics methods, organizations typically apply four levels of insight.

These levels represent increasing sophistication in how data is used.

1️⃣ Descriptive Analytics — What Happened?

Descriptive analytics summarizes historical data to understand past events.

Examples:

  • sales reports

  • website traffic dashboards

  • financial summaries

It provides a snapshot of past performance.




2️⃣ Diagnostic Analytics — Why Did It Happen?

Diagnostic analytics investigates causes and relationships within the data.

Techniques include:

  • correlation analysis

  • root cause investigation

  • drill-down reporting

Example:
Understanding why customer churn increased last month.

                                          





3️⃣ Predictive Analytics — What Will Happen?

Predictive analytics uses statistical models and machine learning to forecast future outcomes.

Examples:

  • sales forecasting

  • demand prediction

  • fraud detection models

This stage introduces data science techniques.




4️⃣ Prescriptive Analytics — What Should We Do?

Prescriptive analytics goes further by recommending optimal actions based on predictions.

Examples:

  • dynamic pricing recommendations

  • supply chain optimization

  • personalized product suggestions

Here analytics begins to guide decisions automatically.



📊 The Analytics Maturity Ladder

These four analytics types often represent an organization’s data maturity progression.

LevelQuestion Answered
DescriptiveWhat happened?
DiagnosticWhy did it happen?
PredictiveWhat will happen?
PrescriptiveWhat should we do?

Organizations gradually move from understanding past data to making future-oriented decisions.


🌱 Final Thoughts

While analysis focuses on understanding specific data problems, analytics represents a broader discipline that uses computational methods to extract insights from large datasets.

Together, they form the foundation of modern data-driven decision-making.

Understanding these concepts is the first step toward deeper fields such as data science, machine learning, and artificial intelligence.


You can checkout the related blogs here:

What is Data Science

Types of Data Explained





☁️ Cloud Service Models Explained: IaaS, PaaS, SaaS, DBaaS and More

When working with cloud technologies, we often hear terms like IaaS, PaaS, SaaS, and DBaaS . At first, they sound similar. But in reality, ...