Reading the internals of Postgres: Database cluster, databases, and tables
Logical Structure of Database Cluster
A database cluster is not a collection of database servers in the context of PostgreSQL (the SQL standard uses the term catalog cluster). It means a group of databases managed by a single PostgreSQL instance. That is most probably a correct definition in dictionary terms, but usually when you hear "database cluster" you would assume multiple nodes/instances of databases that act as a single system.
A database is a collection of database objects such as tables, indexes, views, etc. In PostgreSQL, databases are also database objects and they are represented by Oid, which is an unsigned int object identifier.
SELECT oid, datname FROM pg_database ORDER BY oid;
| oid | datname |
|---|---|
| 1 | template1 |
| 4 | template0 |
| 5 | postgres |
Built-in objects (databases in this query) have hardcoded low values. Other user-created tables/objects start having OIDs from 16384 (OIDs 1–16383 are reserved or used by init objects).
These objects and their relations are stored in system catalogs, which are just regular tables in PostgreSQL. Some examples are:
| Table | Description |
|---|---|
pg_class |
Tables and other objects that are similar to tables (views, indexes, TOAST tables, etc.) |
pg_database |
Stores information about available databases. |
pg_index |
Index information |
| ... | ... |
Some trivia about catalogs:
pg_databaseis shared across all databases of a cluster (one table per cluster), whereas most system catalogs are created per database.pg_classstores indexes and there is apg_indexcatalog as well. The reason is thatpg_classis for generic relational information.pg_indexand other respective catalogs have their own customized schema. This helps with separating concerns and not having apg_class2table in the future (overly exaggerated, but remembered the famousmerchants2table: https://jimmyhmiller.com/ugliest-beautiful-codebase).
As mentioned above, these are just regular tables - you can run queries on them (at your own risk!). Many built-in objects such as types, functions, and operators are stored in these tables, and user-defined ones are added to them the same way. These OIDs are automatically created when new rows are added to catalog tables. For example, when registering an extension (e.g., pgvector), what happens is that pgvector is added to the pg_extension table with an automatically created OID.
This behavior was true for user-defined tables previously. The chronological history is:
- PG <= 11: User-defined tables automatically got an OID.
- PG >= 12: The feature is completely removed.
Physical Structure of DB Cluster
A Postgres cluster stores everything in the data directory. Its path is set by the PGDATA environment variable. Common default locations are /var/lib/pgsql/data and /var/lib/postgresql/ /main. initdb is responsible for setting up and creating this directory, and it is automated by Postgres installers.
When brew install postgresql@18 is called, postgresql@18.rb runs the line below in its post-install method, after Postgres itself is installed:
system bin/"initdb", "--locale=en_US.UTF-8", "-E", "UTF-8", postgresql_datadir unless pg_version_exists?
Similar logic is implemented across other Postgres installation methods (EDB for Windows, apt/deb, etc.).
Inside $PGDATA there are many subdirectories. The full list is in https://www.postgresql.org/docs/current/storage-file-layout.html. Subdirectory changes seem to be very rare. The table here shows some naming changes and new additions in PG9 and PG10. I also checked the Postgres source code and saw that the current_logfiles subdirectory was added in the PG10 release. It is added in 19dc233 which is included starting from PG10:
git tag --contains 19dc233c32f | grep -E '^REL' | sort -V | head
REL_10_0
REL_10_1
REL_10_2
...
Database Subdirectory Layout
As said above, each database has its own subdirectory in the base directory, named after its OID (/base/{OID}). Tables and indexes are stored in a single file under the database subdirectory if their size is under 1GB. Similar to OID, physical files are identified by relfilenode and this information is stored in the pg_class row of the table and/or index.
Now let's explore these layouts a bit. I'm using a Mac. I've installed PostgreSQL 18 with brew install postgresql@18 and run it as a service. Then, connect in a terminal with psql -d postgres.
SHOW data_directory;
| data_directory |
|---|
| /opt/homebrew/var/postgresql@18 |
As described above, the data directory path is /opt/homebrew/var/postgresql@{VERSION}.
Now let's create the shop database and explore it.
CREATE DATABASE shop;
SELECT oid, datname FROM pg_database WHERE datname = 'shop';
| oid | datname |
|---|---|
| 16384 | shop |
A new directory with that OID is now on disk:
ls /opt/homebrew/var/postgresql@18/base/
1 16384 4 5
One surprise is that the shop directory is not empty. The reason is that the CREATE DATABASE command creates the database by copying an existing one, and template1 is the default source database for it. Additional details on templates are in the official documentation.
ls /opt/homebrew/var/postgresql@18/base/16384 | head
112 112_fsm 112_vm 113 113_fsm 113_vm 1247 1247_fsm 1247_vm 1249
Create a basic table:
\c shop
CREATE TABLE orders (
id bigserial PRIMARY KEY,
customer_id bigint NOT NULL,
total_cents bigint NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);
Now, this table will have OID for itself and its index for primary key. Let's examine them by querying pg_class:
SELECT oid, relname, relfilenode, relkind FROM pg_class WHERE relname IN ('orders', 'orders_pkey');
| oid | relname | relfilenode | relkind |
|---|---|---|---|
| 16386 | orders | 16386 | r |
| 16395 | orders_pkey | 16395 | i |
As you can see, relfilenodes (identifier for physical location) and OIDs are identical. We can use pg_relation_filepath which is a built-in function to compute paths:
SELECT pg_relation_filepath('orders'), pg_relation_filepath('orders_pkey');
| pg_relation_filepath | pg_relation_filepath |
|---|---|
| base/16384/16386 | base/16384/16395 |
Our table and index files are stored in base/{database_oid}/{table|index_relfilenode}.
Insert some rows and trigger VACUUM:
INSERT INTO orders (customer_id, total_cents)
SELECT (random() * 1000)::bigint, (random() * 100000)::bigint
FROM generate_series(1, 1000);
VACUUM orders;
VACUUM is mainly used to reclaim storage. Its details can be checked in the documentation, and I'll delve into it in future posts.
The resulting files are:
ls -l /opt/homebrew/var/postgresql@18/base/16384/{16386*,16395}
-rw-------@ 1 burak admin 65536 Jun 15 10:43 .../base/16384/16386
-rw-------@ 1 burak admin 24576 Jun 15 10:43 .../base/16384/16386_fsm
-rw-------@ 1 burak admin 8192 Jun 15 10:43 .../base/16384/16386_vm
-rw-------@ 1 burak admin 40960 Jun 15 10:43 .../base/16384/16395
For our table, 16386 is the main relation data; 16386_fsm and 16386_vm are its forks. They are called auxiliary files and are used for tracking free space in the table's pages (FSM) and the visibility status of each page (VM).
Now, to demonstrate that OID == relfilenode does not necessarily hold, trigger a VACUUM FULL, which does the following (simplified):
- get an exclusive lock on the table
- copy rows to new files (new relfilenode/s)
- swap the pointer to the new relfilenode and commit
VACUUM FULL orders;
SELECT oid, relname, relfilenode FROM pg_class WHERE relname IN ('orders', 'orders_pkey');
SELECT pg_relation_filepath('orders');
| oid | relname | relfilenode |
|---|---|---|
| 16386 | orders | 16397 |
| 16395 | orders_pkey | 16400 |
| pg_relation_filepath |
|---|
| base/16384/16397 |
The rewrite caused by VACUUM FULL updated the relfilenodes, and they are not equal to the OIDs at the latest state.
Tablespaces
Postgres also supports tablespaces. With tablespaces, a user can specify other directories to store/use database files. It allows the following:
- extending the partition/volume that Postgres was initially configured on.
- placing frequently-accessed indexes on fast disks and cold tables on slower ones.
We've talked about the base/ and global/ directories. pg_default is the default tablespace for ordinary database objects, and its path is $PGDATA/base, whereas pg_global is the tablespace for cluster-wide objects.
Let's create a directory and add it as a tablespace.
mkdir -p /Users/burak/pg-tblspc-extra
CREATE TABLESPACE extra_space LOCATION '/Users/burak/pg-tblspc-extra';
SELECT oid, spcname FROM pg_tablespace WHERE spcname = 'extra_space';
| oid | spcname |
|---|---|
| 16401 | extra_space |
As expected, our tablespace has an OID as well, and it's 16401. Now, instead of going to the raw path of this tablespace, Postgres creates a symlink between the file paths.
ls -la /opt/homebrew/var/postgresql@18/pg_tblspc/
lrwx------@ 1 burak admin 28 Jun 20 15:56 16401 -> /Users/burak/pg-tblspc-extra
When creating a tablespace, Postgres runs symlink("/Users/burak/pg-tblspc-extra","$PGDATA/pg_tblspc/16401").
How does Postgres know if the file it's searching for is in base/, global/, or symlinked in pg_tblspc/? Internally, there is a mapping below:
spcOid == 0→$PGDATA/global/spcOid == 1663→$PGDATA/base/spcOid == other→$PGDATA/pg_tblspc/{spcOid}/{PG_xx_yyy}/
where spcOid is the tablespace OID. You may have noticed PG_xx_yyy, which is PG_{major version}_{catalog version}, and for my local setup, it's:
ls -la /Users/burak/pg-tblspc-extra/
drwx------@ 2 burak staff 64 Jun 20 15:56 PG_18_202506291
Create a table in this tablespace with an index and check their paths:
\c shop
CREATE TABLE archived_orders (LIKE orders INCLUDING ALL) TABLESPACE extra_space;
SELECT oid, relname, reltablespace, relfilenode FROM pg_class WHERE relname IN ('archived_orders', 'archived_orders_pkey');
| oid | relname | reltablespace | relfilenode |
|---|---|---|---|
| 16402 | archived_orders | 16401 | 16402 |
| 16411 | archived_orders_pkey | 0 | 16411 |
Our tablespace's OID was 16401, and the table's catalog row reflects that, but the index has reltablespace as 0, which stands for using the database's default tablespace. By default, indexes use the database's default tablespace and need to be overridden with USING INDEX TABLESPACE.
\db+
| Name | Owner | Location | Access privileges | Options | Size | Description |
|---|---|---|---|---|---|---|
| extra_space | burak | /Users/burak/pg-tblspc-extra | 96 bytes | |||
| pg_default | burak | 30 MB | ||||
| pg_global | burak | 548 kB |
Heap Table Structure
Previously, we've talked about data hierarchy and where data is actually located on disk. Now, this section explores how they are stored instead of where.
A heap table is an unordered collection of pages with tuples (~= physical representation of rows) that are stored in random order. Pages are numbered from 0th to Nth in a file, and each is 8192 bytes. This 8192 is configurable only by recompiling Postgres from the source code (--with-blocksize=BLOCKSIZE).
A page has the following data:
- Headers: Contain general information about the page that Postgres looks at on every access (free-space pointers, checksum, flags, etc.)
- Line Pointers: Point to tuples. They behave like indexes to tuples and any access to tuples goes through line pointers. They are numbered from 1 to N and sequentially increase.
- Tuples: Where data is actually stored. Tuples are numbered 1 to N as well, but they are stored starting from the end of the page. Tuples also have headers, a null bitmap, and user data.
Now let's examine how they work and what the actual values are in a Postgres cluster. To do that, I'll utilize a built-in Postgres extension named pageinspect. It introduces functions to inspect the contents of pages at a low level.
Let's create a database, a table, and the extension.
CREATE DATABASE notes_lab;
\c notes_lab
CREATE EXTENSION pageinspect;
CREATE TABLE notes (
id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
note text NOT NULL
);
INSERT INTO notes (note) VALUES ('small row'), ('second row'), ('third row');
First examine where table data is located as we've checked in the previous section:
SELECT oid, relname, relfilenode, relkind FROM pg_class WHERE relname IN ('notes', 'notes_pkey') ORDER BY relname;
| oid | relname | relfilenode | relkind |
|---|---|---|---|
| 16461 | notes | 16461 | r |
| 16468 | notes_pkey | 16468 | i |
Note that relkind r stands for an ordinary table and i stands for an index (docs).
Let's check the heap table's location and size:
SELECT pg_relation_filepath('notes') AS heap_path, pg_relation_size('notes') AS heap_bytes;
| heap_path | heap_bytes |
|---|---|
| base/16413/16461 | 8192 |
Location is base/{database_oid}/{relfilenode_oid} as expected, and it consists of 1 page of 8KB at the moment.
We've added three rows to the table and in any SELECT query we can also get the physical location of a row with ctid. ctid is a system column that represents (page|block_number, offset_number) of any row.
SELECT ctid, id, note FROM notes ORDER BY id;
| ctid | id | note |
|---|---|---|
| (0,1) | 1 | small row |
| (0,2) | 2 | second row |
| (0,3) | 3 | third row |
Now we've only 3 rows and one 8KB page, and ctid effectively represents this. For the first row, (0,1) means block 0 and line pointer 1.
SELECT * FROM page_header(get_raw_page('notes', 0));
| lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid |
|---|---|---|---|---|---|---|---|---|
| 0/20BBC40 | 0 | 0 | 36 | 8072 | 8192 | 8192 | 4 | 0 |
lower = 36: 24 bytePageHeaderData+ 3 × 4 byte line pointers.upper = 8072: first byte of tuple storage.- free space is
upper - lower = 8036bytes. special = 8192: no special area since this is a simple heap page.
Comments
No comments yet. Start the discussion.