Normalization

Normalization in Relational Databases

Normalization is the process of organizing data in a relational database to reduce redundancy and improve data integrity. It divides large tables into smaller, related tables and defines relationships between them.

1. Why Normalize?

✅ Eliminates Redundant Data – Avoids duplicate data storage.
✅ Ensures Data Integrity – Reduces inconsistency and anomalies.
✅ Improves Maintainability – Easier updates and modifications.
✅ Optimizes Queries – Reduces storage space and speeds up searches.

First Normal Form (1NF): Eliminating Repeating Groups

1NF ensures that a table follows the basic rules of a relational database by eliminating repeating groups and ensuring atomicity.

Rules of 1NF:

Atomicity – Each column must contain atomic (indivisible) values (no multiple values in a single field).
Uniqueness of Rows – Each row must have a unique identifier (Primary Key).
No Repeating Groups – Each column should contain a single value per row, meaning no arrays or lists inside a column.

Example of a Non-1NF Table (Repeating Groups Present)

OrderID	Customer	Products	Quantity
1	Alice	Laptop, Mouse	1, 2
2	Bob	Keyboard	1
3	Charlie	Monitor, Mouse, Cable	1, 1, 1

✅ Issues:
- The "Products" column contains multiple values per row.
- The "Quantity" column also has multiple values corresponding to "Products."
- Difficult to query individual products or quantities.

1NF-Compliant Table (Repeating Groups Removed)

OrderID	Customer	Product	Quantity
1	Alice	Laptop	1
1	Alice	Mouse	2
2	Bob	Keyboard	1
3	Charlie	Monitor	1
3	Charlie	Mouse	1
3	Charlie	Cable	1

✅ Benefits of 1NF:
- No multi-valued attributes—each field contains a single piece of data.
- Improved query performance and data integrity.
- Makes it easier to establish relationships with other tables.

Next step: Second Normal Form (2NF)—eliminating partial dependencies.

Second Normal Form (2NF): Removing Partial Dependencies

A table is in 2NF if:
1. It is already in 1NF (no repeating groups, atomic values).
2. All non-key attributes are fully functionally dependent on the entire primary key. (No partial dependencies)

What is a Partial Dependency?

A partial dependency occurs when a non-key attribute depends on only part of a composite primary key instead of the whole key.

Example of a Non-2NF Table (Partial Dependency Exists)

Consider a table that tracks orders and includes product details:

OrderID	ProductID	ProductName	CustomerID	CustomerName
1	101	Laptop	5001	Alice
1	102	Mouse	5001	Alice
2	103	Keyboard	5002	Bob

✅ Issues:
- Composite Primary Key = (OrderID, ProductID)
- ProductName only depends on ProductID, not on OrderID.
- CustomerName only depends on CustomerID, not on OrderID or ProductID.

This violates 2NF because ProductName and CustomerName should not depend on only part of the composite key.

2NF-Compliant Tables (Removing Partial Dependencies)

We split the table into three separate tables:

1️⃣ Orders Table
| OrderID | CustomerID |
|---------|-----------|
| 1 | 5001 |
| 2 | 5002 |

2️⃣ Customers Table
| CustomerID | CustomerName |
|-----------|--------------|
| 5001 | Alice |
| 5002 | Bob |

3️⃣ OrderDetails Table
| OrderID | ProductID |
|---------|----------|
| 1 | 101 |
| 1 | 102 |
| 2 | 103 |

4️⃣ Products Table
| ProductID | ProductName |
|----------|------------|
| 101 | Laptop |
| 102 | Mouse |
| 103 | Keyboard |

✅ Benefits of 2NF:
- Eliminates partial dependencies.
- Reduces data redundancy.
- Improves data consistency.

Next step: Third Normal Form (3NF)—removing transitive dependencies.

Third Normal Form (3NF): Removing Transitive Dependencies

A table is in 3NF if:
1. It is already in 2NF (no partial dependencies).
2. There are no transitive dependencies, meaning non-key attributes must depend only on the primary key and not on other non-key attributes.

What is a Transitive Dependency?

A transitive dependency occurs when a non-key attribute depends on another non-key attribute, rather than directly on the primary key.

Example of a Non-3NF Table (Transitive Dependency Exists)

OrderID	CustomerID	CustomerName	CustomerAddress
1	5001	Alice	123 Maple St.
2	5002	Bob	456 Oak St.
3	5001	Alice	123 Maple St.

✅ Issues:
- Primary Key = OrderID
- CustomerName and CustomerAddress depend on CustomerID, not directly on OrderID.
- This means CustomerName and CustomerAddress should be in a separate Customers table.

3NF-Compliant Tables (Removing Transitive Dependencies)

We separate the Customers data from the Orders table.

1️⃣ Orders Table
| OrderID | CustomerID |
|---------|-----------|
| 1 | 5001 |
| 2 | 5002 |
| 3 | 5001 |

2️⃣ Customers Table
| CustomerID | CustomerName | CustomerAddress |
|-----------|--------------|-----------------|
| 5001 | Alice | 123 Maple St. |
| 5002 | Bob | 456 Oak St. |

✅ Benefits of 3NF:
- Eliminates transitive dependencies.
- Reduces data redundancy.
- Improves data integrity by preventing inconsistencies.

Next step: Boyce-Codd Normal Form (BCNF)—handling edge cases where 3NF is not strict enough.

Boyce-Codd Normal Form (BCNF): A More Rigorous 3NF

A table is in BCNF if:
1. It is already in 3NF (no transitive dependencies).
2. Every determinant is a candidate key.

What is a Determinant?

A determinant is any attribute (or set of attributes) that uniquely determines another attribute.

BCNF ensures that no non-trivial functional dependency exists where a non-candidate key attribute determines another attribute.

Example of a Non-BCNF Table (Violating BCNF)

StudentID	Course	Instructor
1001	Math	Dr. Smith
1002	Science	Dr. Brown
1003	Math	Dr. Smith
1004	Science	Dr. Brown

✅ Issues:
- Primary Key = (StudentID, Course)
- Instructor depends on Course, not on StudentID
- Course → Instructor is a functional dependency, but Course is not a candidate key

This violates BCNF because Course determines Instructor, but Course is not a superkey.

BCNF-Compliant Tables (Fixing the Violation)

We split the table into two:

1️⃣ StudentCourses Table
| StudentID | Course |
|-----------|--------|
| 1001 | Math |
| 1002 | Science |
| 1003 | Math |
| 1004 | Science |

2️⃣ Courses Table
| Course | Instructor |
|--------|------------|
| Math | Dr. Smith |
| Science | Dr. Brown |

✅ Benefits of BCNF:
- Eliminates all redundancy caused by functional dependencies.
- Ensures every determinant is a candidate key.
- Leads to a more robust and normalized schema.

Downside: BCNF can sometimes cause more table fragmentation, requiring more joins for queries.

Next step: Fourth Normal Form (4NF)—handling multi-valued dependencies.

Fourth Normal Form (4NF): Eliminating Multi-Valued Dependencies

A table is in 4NF if:
1. It is already in BCNF.
2. It has no multi-valued dependencies (MVDs).

What is a Multi-Valued Dependency (MVD)?

A multi-valued dependency (A →→ B) means that for a single value of A, there exist multiple independent values of B and C.

Unlike functional dependencies (where one column uniquely determines another), an MVD allows multiple independent values for one key.
This leads to redundancy and data anomalies.

Example of a Non-4NF Table (Violating 4NF)

StudentID	Course	Hobby
1001	Math	Chess
1001	Math	Music
1001	Science	Chess
1001	Science	Music

✅ Issues:
- One student can take multiple courses.
- One student can have multiple hobbies.
- Courses and hobbies are independent of each other (but both depend on StudentID).
- This causes redundancy because the same student’s hobbies repeat for each course.

4NF-Compliant Tables (Fixing the Violation)

We split the table into two:

1️⃣ StudentCourses Table
| StudentID | Course |
|-----------|--------|
| 1001 | Math |
| 1001 | Science |

2️⃣ StudentHobbies Table
| StudentID | Hobby |
|-----------|--------|
| 1001 | Chess |
| 1001 | Music |

✅ Now:
- Each table represents a single independent relationship.
- No multi-valued dependencies exist.

Benefits of 4NF:
- Removes redundant data caused by independent multi-values.
- Improves data consistency by avoiding duplication.
- Optimizes queries by reducing unnecessary joins.

Next step: Fifth Normal Form (5NF)—handling join dependencies.

Fifth Normal Form (5NF): Eliminating Join Dependencies

A table is in 5NF if:
1. It is already in 4NF.
2. It has no join dependencies that are not implied by candidate keys.

What is a Join Dependency?

A join dependency occurs when a table can be decomposed into smaller tables, but rejoining them does not produce extra or incorrect data.

If a table can be split into multiple tables without losing data and without introducing redundancy, it may need further normalization into 5NF.
This is also called "Projection-Join Normal Form (PJNF)".

Example of a Non-5NF Table (Violating 5NF)

Consider a company assigning projects to suppliers with specific parts.

Supplier	Part	Project
S1	P1	A
S1	P2	A
S2	P1	A
S2	P3	B

✅ Issues:
- This table captures three independent relationships:
1. Supplier ↔ Part
2. Supplier ↔ Project
3. Part ↔ Project
- If we update or delete a row, we may lose independent relationships.

5NF-Compliant Tables (Fixing the Violation)

We decompose the table into three separate tables:

1️⃣ SupplierParts Table
| Supplier | Part |
|----------|------|
| S1 | P1 |
| S1 | P2 |
| S2 | P1 |
| S2 | P3 |

2️⃣ SupplierProjects Table
| Supplier | Project |
|----------|---------|
| S1 | A |
| S2 | A |
| S2 | B |

3️⃣ PartProjects Table
| Part | Project |
|------|---------|
| P1 | A |
| P2 | A |
| P3 | B |

✅ Now:
- The three independent relationships are correctly stored.
- We eliminate redundant data while ensuring no extra tuples appear on joins.

Benefits of 5NF:

Removes redundancy caused by complex many-to-many relationships.
Prevents data anomalies when inserting, updating, or deleting records.
Improves efficiency in large, complex databases with multiple relationships.

The next step is Sixth Normal Form (6NF)—handling temporal data.

Sixth Normal Form (6NF): Handling Temporal Data & Irreducible Relations

6NF is the highest level of normalization and focuses on decomposing tables into irreducible relations while handling temporal data (time-variant information).

Key Concepts of 6NF

Eliminates all forms of redundancy, including temporal dependencies.
Breaks data into irreducible relations (relations that cannot be decomposed further without losing meaning).
Often used in data warehousing, event logging, and versioned databases where historical changes matter.
No join dependencies remain after decomposition.

Why 6NF?

Standard normalization (1NF → 5NF) removes redundancy but does not fully handle time-dependent data (e.g., tracking changes over time).
6NF solves this by storing each version of data as a separate tuple instead of updating rows.

Example Use Case:
- Healthcare records: A patient's medical condition changes over time. Instead of overwriting the latest status, 6NF stores historical versions.
- Finance transactions: Stock prices, interest rates, and account balances change daily and must be stored with timestamps.

Example: Moving from 5NF to 6NF

5NF Table (Tracks Employee Salaries by Department)

employee_id	department	salary	start_date	end_date
101	HR	50000	2024-01-01	2025-01-01
101	HR	52000	2025-01-02	NULL
102	IT	60000	2024-03-15	NULL

🔹 Problems:
- If an employee changes salary, we must update the row (or insert a new one).
- Potential redundancy when multiple attributes change independently (e.g., salary changes but department stays the same).

6NF Decomposition (Separating Temporal Data)

Each attribute is stored in a separate relation with time tracking.

Employee-Department Table (Only Tracks Department Changes)

employee_id	department	start_date	end_date
101	HR	2024-01-01	NULL
102	IT	2024-03-15	NULL

Employee-Salary Table (Only Tracks Salary Changes)

employee_id	salary	start_date	end_date
101	50000	2024-01-01	2025-01-01
101	52000	2025-01-02	NULL
102	60000	2024-03-15	NULL

🔹 Benefits of 6NF:
✔ No redundant updates—each attribute changes independently.
✔ Better historical tracking—we can reconstruct past records.
✔ Efficient querying—pulling only necessary attributes.

Trade-offs of 6NF

Factor	Pros	Cons
Redundancy	Eliminates all data duplication	Requires more storage for historical data
Query Speed	Faster for time-based queries	Slower for reconstructing full records
Data Integrity	Ensures independent attribute changes	Complex to manage across multiple tables

When to Use 6NF?

✅ Historical data storage (e.g., finance, healthcare, compliance).
✅ Event-driven systems (e.g., logging, audit trails).
✅ Temporal databases (e.g., tracking time-sensitive data changes).
❌ Not needed for simple CRUD applications.

6NF is rarely required for most transactional systems but is essential in historical and event-driven databases.

Denormalization for Performance Optimization

Denormalization is the process of intentionally introducing redundancy into a database to improve read performance at the cost of higher storage usage and potential update anomalies.

Why Denormalize?

Normalization (1NF → 5NF) reduces redundancy but can lead to:
- Complex queries with multiple joins, slowing down retrieval.
- High CPU and memory usage due to frequent table joins.
- Scalability issues in high-read environments.

Denormalization pre-joins data to improve read speeds at the expense of storage and update complexity.

When to Use Denormalization

✅ High-read, low-write workloads (e.g., analytics, reporting)
✅ Avoiding complex joins that impact performance
✅ Caching frequently accessed data
✅ Reducing network calls in distributed systems

❌ Not recommended for high-write, transactional databases where data integrity is crucial.

Common Denormalization Techniques

1. Precomputed Aggregate Columns

Store total counts or sums in a table instead of recalculating them.
Example: Instead of summing all order amounts every time, store total_order_amount in the customers table.

2. Storing Derived Data

Example: Store the full name (first_name + last_name) in a column rather than computing it at query time.

3. Duplicating Data Across Tables

Example: Instead of joining orders and customers, store customer_name inside orders for faster retrieval.

4. Adding Redundant Tables (Materialized Views)

Create read-optimized tables that pre-join data from multiple normalized tables.

5. Using Key-Value or Document Stores

Some NoSQL databases (MongoDB, Redis) use denormalized structures natively for performance.

Trade-offs of Denormalization

Factor	Pros	Cons
Read Speed	Faster query execution	Increased storage overhead
Write Speed	Slower (redundant updates)	More complex data consistency
Data Integrity	Can be compromised	Harder to enforce constraints
Storage	Higher (redundant data)	More disk space required

When to Use Normalization vs. Denormalization

Scenario	Normalization	Denormalization
Transactional systems (OLTP)	✅ Best choice	❌ Not ideal
Analytical/Reporting (OLAP)	❌ Can slow down	✅ Improves performance
High write operations	✅ Preferred	❌ Can cause anomalies
High read operations	❌ May require many joins	✅ Faster queries

Example: Normalized vs. Denormalized Approach

Normalized Schema (3NF)

Customers Table
| customer_id | name |
|------------|-------|
| 1 | Alice |
| 2 | Bob |
Orders Table
| order_id | customer_id | total_amount |
|---------|-------------|--------------|
| 101 | 1 | 100.00 |
| 102 | 2 | 250.00 |

🔹 Fetching an order with a customer’s name requires a JOIN:

SELECT orders.order_id, customers.name, orders.total_amount
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id;

Denormalized Schema

Orders Table (with customer name stored)
| order_id | customer_id | customer_name | total_amount |
|---------|-------------|--------------|--------------|
| 101 | 1 | Alice | 100.00 |
| 102 | 2 | Bob | 250.00 |

🔹 Now, no need for a JOIN:

SELECT order_id, customer_name, total_amount FROM orders;

✅ Faster reads but ❌ Updates are harder (if customer name changes, multiple rows must be updated).

Best Practices for Denormalization

✔ Use selectively—only for performance bottlenecks.
✔ Leverage caching (Redis, Memcached) before denormalizing.
✔ Automate redundancy updates via triggers or application logic.
✔ Test query performance before and after denormalization.

Denormalization trades storage and consistency for speed—use it wisely.

Why Most Applications Don’t Need 6NF

While normalization is crucial for designing efficient and scalable databases, most real-world applications do not need to go beyond 3NF or BCNF. Higher normal forms, such as 4NF, 5NF, and 6NF, are useful in specific cases but introduce complexity that may not be necessary for many applications.

Understanding Where Your Application Stands

Most applications fall into one of these categories:

1. Simple CRUD Applications (1NF - 3NF)

Examples: Small business apps, blogs, personal finance apps, inventory tracking systems.
Why? These applications store structured, well-defined data (e.g., customers, orders, products).
Normalization Level: 3NF or BCNF is usually enough to eliminate redundancy and ensure data integrity.

2. Large Enterprise Systems (BCNF - 4NF)

Examples: ERPs, large-scale financial systems, healthcare records.
Why? They require more complex relationships and data consistency while avoiding anomalies.
Normalization Level: BCNF or 4NF is typically applied to handle dependencies effectively.

3. High-Performance or Distributed Systems (Denormalization, 3NF, or Hybrid Approaches)

Examples: Social media platforms, real-time analytics, recommendation systems.
Why? These systems prioritize performance over strict normalization to optimize read-heavy operations.
Normalization Level: Partial normalization + denormalization (mix of normalized and denormalized tables).

4. Temporal & Historical Data Applications (6NF)

Examples: Financial auditing, compliance tracking, medical records.
Why? These systems need accurate historical data without modifying past records.
Normalization Level: 6NF to separate attributes into temporal relations for tracking changes over time.

Trade-offs: Simplicity vs. Normalization Depth

Normalization Level	Best For	Pros	Cons
1NF - 3NF	Most CRUD apps, business applications	Simple, easy to query, avoids redundant data	Can have minor redundancy
BCNF - 4NF	Enterprise apps, structured business data	Ensures strong integrity, prevents anomalies	Queries become more complex
5NF - 6NF	Time-sensitive, versioned databases	Handles historical data & eliminates redundancy completely	Harder to manage, performance-heavy
Denormalized Models	Read-heavy apps (social media, analytics)	Faster reads, easier queries	Increases redundancy, risk of inconsistency

Key Takeaway: Choose the Right Approach for Your Needs

For most applications, 3NF or BCNF is enough.
Go beyond 4NF only if your system requires advanced data integrity (e.g., complex dependencies, temporal tracking).
Denormalization is useful when read performance matters more than strict consistency.

It’s crucial to balance normalization and performance based on your use case rather than strictly following every normal form.