How Do Relational Databases Work?

Relational databases are the most widely used database technology for structured data. They power everything from enterprise business systems to mobile apps. But how do they actually work under the hood? This article provides an in-depth look.

History

Relational databases emerged in the 1970s based on Edgar Codd’s relational model formalized in his influential paper titled “A Relational Model of Data for Large Shared Data Banks.” The model provided a solid theoretical foundation for representing and manipulating structured data.

Early Implementations

Early relational database implementations include System R in 1974 at IBM and Ingres in 1976 at UC Berkeley. These pioneers demonstrated the viability of the relational model.

The 1980s saw commercial successes for Oracle and DB2 establishing relational databases as the de facto standard for business applications. SQL was adopted as the primary language for defining and manipulating relational data.

Relational Concepts

Relational databases are based on key theoretical concepts defined by Codd:

Relations

Data is organized into tables called relations.
Each relation represents an entity like Employees, Products.

Attributes

Table columns are called attributes. They represent properties of that entity.

Tuples

Table rows are called tuples. Each tuple represents one instance of the entity like a specific employee.

Keys

Unique identifiers like employee_id allow uniquely identifying tuples.
These unique keys are called candidate keys. One candidate key is designated as the primary key.

Schema

The overall design of relations and their attributes is called the database schema.

Referential Integrity

Relations are linked to each other via foreign key constraints that reference primary keys.
This establishes relationships while ensuring referential integrity.

These core concepts form the foundation of all relational databases.

Architecture

While relational theory defines what a relational database is conceptually, the physical architecture enables efficient storage and querying at scale. The core components of a relational database’s architecture include the following items:

Storage Engine

The storage engine manages physical storage and retrieval of data. These are the popular storage engines.

InnoDB: Default in MySQL. Uses row-level locking for transactions.
MyISAM: Legacy MySQL engine. Table-level locking increases speed.
PostgreSQL: Advanced storage techniques in default PostgreSQL engine.

Buffer Pool

To reduce disk I/O, frequently accessed data is cached in memory in the buffer pool. All queries check buffer pool before looking for data on disk.

Query Optimizer

When a query is received, the query optimizer determines the most efficient way to lookup and retrieve data using indexes, joins etc. This provides huge performance gains.

Query Processor

The query processor executes the optimized query plan by looking up rows in the storage engine’s data files or buffer pool based on access path determined.

Indexing

Indexes like B-Trees and hash indexes allow quickly locating rows by column values without scanning entire tables. Drastically improves query performance.

Transactions

The transaction manager uses locking and other techniques to ensure ACID properties. Atomicity guarantees all-or-nothing execution to maintain consistency.

These components work together to enable storing, querying and manipulating relational data efficiently even at scale.

Data Model

At the core of any relational database is its schema which formally defines data storage structure according to the relational model. It includes,

Tables

Data is stored in two-dimensional tables composed of rows and columns similar to spreadsheets. Multiple related tables store the entire dataset.

Columns

Columns define the attributes of the entity represented by the table. For example, Employees table has columns id, name, salary, dept.

Rows

Each row represents one instance of the entity, like a specific employee with unique ID, name and other attributes stored in the columns.

Keys

A key is one or more columns that uniquely identify a row. The primary key is the main unique identifier for table rows, usually a synthetic id column.

Relationships

Tables are related to each other using special columns called foreign keys that reference primary keys in other tables. Joins merge data across related tables.

This schema of multiple, interconnected tables provides a structured, rigid and portable data model.

Data Manipulation

While the data model determines how data is stored, languages like SQL provide a standard way to access and manipulate that data.

SQL

SQL is the primary language for relational databases that allows creating, querying, updating and deleting data:

DDL statements like CREATE TABLE define the schema.
DML statements like SELECT, INSERT, UPDATE manipulate data itself.
Complex joins, aggregations, subqueries enable powerful analysis.

Stored Procedures

Procedural code can be bundled into stored procedures to modularize SQL operations. Allow parameterization and encapsulation.

Triggers

Triggers enable automatically invoking SQL statements on data change events like updates or deletes to specific tables. Help enforce rules.

Through these mechanisms, relational databases provide a flexible yet consistent way to model, store, access and update structured data at scale.

How does a relational database store its data?

Relational databases store data in tables with rows and columns. These are the key points on how relational databases physically store data:

Data is persisted in fixed tables with predefined columns and data types.
Each row in a table represents a distinct record or entity.
Columns store attributes for that entity. The columns have a defined data type like integer, string, date, etc.
Tables are linked together via primary key and foreign key relationships to model connections between entities.
Behind the scenes, relational database management systems (RDBMS) store data in pages on disk. Data in tables is organized into pages for efficient retrieval via indexes.
Indexes enhance performance by storing a fast lookup mapping between column values and the physical page where that row is stored. This avoids scanning all rows.
To optimize space utilization and access speed, databases use techniques like clustering, partitioning, compression, and caching.
Memory caching mechanisms keep frequently accessed data in RAM to avoid slower disk access.
Database files containing table data are stored on disk in operating system files managed by the RDBMS software.
Storage engines within the RDBMS handle concurrency, transactions, and how data is laid out at the lowest level.

In short, relational databases provide an abstracted view of data as tables, rows and columns to the developer. Underneath, the database system implements complex physical storage and indexing to optimize how this relational data is persisted on disk and in memory. The physical storage is managed entirely by the database engine.

Final Words

While many exciting non-relational and NewSQL datastores have emerged, relational databases based on Codd’s mathematical principles remain the workhorses of the software world. Their proven ability to model real world entities, preserve integrity through transactions and enable complex querying using SQL continues to power thousands of business critical applications today.