Relational databases are the most widely used database technology for structured data. They power everything from enterprise business systems to mobile apps. But how do they actually work under the hood? This article provides an in-depth look.
Relational databases emerged in the 1970s based on Edgar Codd’s relational model formalized in his influential paper titled “A Relational Model of Data for Large Shared Data Banks.” The model provided a solid theoretical foundation for representing and manipulating structured data.
Early relational database implementations include System R in 1974 at IBM and Ingres in 1976 at UC Berkeley. These pioneers demonstrated the viability of the relational model.
The 1980s saw commercial successes for Oracle and DB2 establishing relational databases as the de facto standard for business applications. SQL was adopted as the primary language for defining and manipulating relational data.
Relational databases are based on key theoretical concepts defined by Codd:
- Data is organized into tables called relations.
- Each relation represents an entity like Employees, Products.
- Table columns are called attributes. They represent properties of that entity.
- Table rows are called tuples. Each tuple represents one instance of the entity like a specific employee.
- Unique identifiers like employee_id allow uniquely identifying tuples.
- These unique keys are called candidate keys. One candidate key is designated as the primary key.
- The overall design of relations and their attributes is called the database schema.
- Relations are linked to each other via foreign key constraints that reference primary keys.
- This establishes relationships while ensuring referential integrity.
These core concepts form the foundation of all relational databases.
While relational theory defines what a relational database is conceptually, the physical architecture enables efficient storage and querying at scale. The core components of a relational database’s architecture include the following items:
The storage engine manages physical storage and retrieval of data. These are the popular storage engines.
- InnoDB: Default in MySQL. Uses row-level locking for transactions.
- MyISAM: Legacy MySQL engine. Table-level locking increases speed.
- PostgreSQL: Advanced storage techniques in default PostgreSQL engine.
To reduce disk I/O, frequently accessed data is cached in memory in the buffer pool. All queries check buffer pool before looking for data on disk.
When a query is received, the query optimizer determines the most efficient way to lookup and retrieve data using indexes, joins etc. This provides huge performance gains.
The query processor executes the optimized query plan by looking up rows in the storage engine’s data files or buffer pool based on access path determined.
Indexes like B-Trees and hash indexes allow quickly locating rows by column values without scanning entire tables. Drastically improves query performance.
The transaction manager uses locking and other techniques to ensure ACID properties. Atomicity guarantees all-or-nothing execution to maintain consistency.
These components work together to enable storing, querying and manipulating relational data efficiently even at scale.
At the core of any relational database is its schema which formally defines data storage structure according to the relational model. It includes,
Data is stored in two-dimensional tables composed of rows and columns similar to spreadsheets. Multiple related tables store the entire dataset.
Columns define the attributes of the entity represented by the table. For example, Employees table has columns id, name, salary, dept.
Each row represents one instance of the entity, like a specific employee with unique ID, name and other attributes stored in the columns.
A key is one or more columns that uniquely identify a row. The primary key is the main unique identifier for table rows, usually a synthetic id column.
Tables are related to each other using special columns called foreign keys that reference primary keys in other tables. Joins merge data across related tables.
This schema of multiple, interconnected tables provides a structured, rigid and portable data model.
While the data model determines how data is stored, languages like SQL provide a standard way to access and manipulate that data.
SQL is the primary language for relational databases that allows creating, querying, updating and deleting data:
- DDL statements like CREATE TABLE define the schema.
- DML statements like SELECT, INSERT, UPDATE manipulate data itself.
- Complex joins, aggregations, subqueries enable powerful analysis.
Procedural code can be bundled into stored procedures to modularize SQL operations. Allow parameterization and encapsulation.
Triggers enable automatically invoking SQL statements on data change events like updates or deletes to specific tables. Help enforce rules.
Through these mechanisms, relational databases provide a flexible yet consistent way to model, store, access and update structured data at scale.
How does a relational database store its data?
Relational databases store data in tables with rows and columns. These are the key points on how relational databases physically store data:
- Data is persisted in fixed tables with predefined columns and data types.
- Each row in a table represents a distinct record or entity.
- Columns store attributes for that entity. The columns have a defined data type like integer, string, date, etc.
- Tables are linked together via primary key and foreign key relationships to model connections between entities.
- Behind the scenes, relational database management systems (RDBMS) store data in pages on disk. Data in tables is organized into pages for efficient retrieval via indexes.
- Indexes enhance performance by storing a fast lookup mapping between column values and the physical page where that row is stored. This avoids scanning all rows.
- To optimize space utilization and access speed, databases use techniques like clustering, partitioning, compression, and caching.
- Memory caching mechanisms keep frequently accessed data in RAM to avoid slower disk access.
- Database files containing table data are stored on disk in operating system files managed by the RDBMS software.
- Storage engines within the RDBMS handle concurrency, transactions, and how data is laid out at the lowest level.
In short, relational databases provide an abstracted view of data as tables, rows and columns to the developer. Underneath, the database system implements complex physical storage and indexing to optimize how this relational data is persisted on disk and in memory. The physical storage is managed entirely by the database engine.
While many exciting non-relational and NewSQL datastores have emerged, relational databases based on Codd’s mathematical principles remain the workhorses of the software world. Their proven ability to model real world entities, preserve integrity through transactions and enable complex querying using SQL continues to power thousands of business critical applications today.