Hacker News

Reading the internals of Postgres: Database cluster, databases, and tables

Logical Structure of Database Cluster

A database cluster is not a collection of database servers in the context of PostgreSQL (the SQL standard uses the term catalog cluster). It means a group of databases managed by a single PostgreSQL instance. That is most probably a correct definition in dictionary terms, but usually when you hear "database cluster" you would assume multiple nodes/instances of databases that act as a single system.

A database is a collection of database objects such as tables, indexes, views, etc. In PostgreSQL, databases are also database objects and they are represented by Oid, which is an unsigned int object identifier.

SELECT oid, datname FROM pg_database ORDER BY oid;
oid datname
1 template1
4 template0
5 postgres

Built-in objects (databases in this query) have hardcoded low values. Other user-created tables/objects start having OIDs from 16384 (OIDs 1–16383 are reserved or used by init objects).

These objects and their relations are stored in system catalogs, which are just regular tables in PostgreSQL. Some examples are:

Table Description
pg_class Tables and other objects that are similar to tables (views, indexes, TOAST tables, etc.)
pg_database Stores information about available databases.
pg_index Index information
... ...

Some trivia about catalogs:

  • pg_database is shared across all databases of a cluster (one table per cluster), whereas most system catalogs are created per database.
  • pg_class stores indexes and there is a pg_index catalog as well. The reason is that pg_class is for generic relational information. pg_index and other respective catalogs have their own customized schema. This helps with separating concerns and not having a pg_class2 table in the future (overly exaggerated, but remembered the famous merchants2 table: https://jimmyhmiller.com/ugliest-beautiful-codebase).

As mentioned above, these are just regular tables - you can run queries on them (at your own risk!). Many built-in objects such as types, functions, and operators are stored in these tables, and user-defined ones are added to them the same way. These OIDs are automatically created when new rows are added to catalog tables. For example, when registering an extension (e.g., pgvector), what happens is that pgvector is added to the pg_extension table with an automatically created OID.

This behavior was true for user-defined tables previously. The chronological history is:

  • PG <= 11: User-defined tables automatically got an OID.
  • PG >= 12: The feature is completely removed.

Physical Structure of DB Cluster

A Postgres cluster stores everything in the data directory. Its path is set by the PGDATA environment variable. Common default locations are /var/lib/pgsql/data and /var/lib/postgresql/ /main. initdb is responsible for setting up and creating this directory, and it is automated by Postgres installers.

When brew install postgresql@18 is called, postgresql@18.rb runs the line below in its post-install method, after Postgres itself is installed:

system bin/"initdb", "--locale=en_US.UTF-8", "-E", "UTF-8", postgresql_datadir unless pg_version_exists?

Similar logic is implemented across other Postgres installation methods (EDB for Windows, apt/deb, etc.).

Inside $PGDATA there are many subdirectories. The full list is in https://www.postgresql.org/docs/current/storage-file-layout.html. Subdirectory changes seem to be very rare. The table here shows some naming changes and new additions in PG9 and PG10. I also checked the Postgres source code and saw that the current_logfiles subdirectory was added in the PG10 release. It is added in 19dc233 which is included starting from PG10:

git tag --contains 19dc233c32f | grep -E '^REL' | sort -V | head
REL_10_0
REL_10_1
REL_10_2
...

Database Subdirectory Layout

As said above, each database has its own subdirectory in the base directory, named after its OID (/base/{OID}). Tables and indexes are stored in a single file under the database subdirectory if their size is under 1GB. Similar to OID, physical files are identified by relfilenode and this information is stored in the pg_class row of the table and/or index.

Now let's explore these layouts a bit. I'm using a Mac. I've installed PostgreSQL 18 with brew install postgresql@18 and run it as a service. Then, connect in a terminal with psql -d postgres.

SHOW data_directory;
data_directory
/opt/homebrew/var/postgresql@18

As described above, the data directory path is /opt/homebrew/var/postgresql@{VERSION}.

Now let's create the shop database and explore it.

CREATE DATABASE shop;
SELECT oid, datname FROM pg_database WHERE datname = 'shop';
oid datname
16384 shop

A new directory with that OID is now on disk:

ls /opt/homebrew/var/postgresql@18/base/
1  16384  4  5

One surprise is that the shop directory is not empty. The reason is that the CREATE DATABASE command creates the database by copying an existing one, and template1 is the default source database for it. Additional details on templates are in the official documentation.

ls /opt/homebrew/var/postgresql@18/base/16384 | head
112  112_fsm  112_vm  113  113_fsm  113_vm  1247  1247_fsm  1247_vm  1249

Create a basic table:

\c shop
CREATE TABLE orders (
    id bigserial PRIMARY KEY,
    customer_id bigint NOT NULL,
    total_cents bigint NOT NULL,
    created_at timestamptz NOT NULL DEFAULT now()
);

Now, this table will have OID for itself and its index for primary key. Let's examine them by querying pg_class:

SELECT oid, relname, relfilenode, relkind FROM pg_class WHERE relname IN ('orders', 'orders_pkey');
oid relname relfilenode relkind
16386 orders 16386 r
16395 orders_pkey 16395 i

As you can see, relfilenodes (identifier for physical location) and OIDs are identical. We can use pg_relation_filepath which is a built-in function to compute paths:

SELECT pg_relation_filepath('orders'), pg_relation_filepath('orders_pkey');
pg_relation_filepath pg_relation_filepath
base/16384/16386 base/16384/16395

Our table and index files are stored in base/{database_oid}/{table|index_relfilenode}.

Insert some rows and trigger VACUUM:

INSERT INTO orders (customer_id, total_cents)
SELECT (random() * 1000)::bigint, (random() * 100000)::bigint
FROM generate_series(1, 1000);
VACUUM orders;

VACUUM is mainly used to reclaim storage. Its details can be checked in the documentation, and I'll delve into it in future posts.

The resulting files are:

ls -l /opt/homebrew/var/postgresql@18/base/16384/{16386*,16395}
-rw-------@ 1 burak admin 65536 Jun 15 10:43 .../base/16384/16386
-rw-------@ 1 burak admin 24576 Jun 15 10:43 .../base/16384/16386_fsm
-rw-------@ 1 burak admin  8192 Jun 15 10:43 .../base/16384/16386_vm
-rw-------@ 1 burak admin 40960 Jun 15 10:43 .../base/16384/16395

For our table, 16386 is the main relation data; 16386_fsm and 16386_vm are its forks. They are called auxiliary files and are used for tracking free space in the table's pages (FSM) and the visibility status of each page (VM).

Now, to demonstrate that OID == relfilenode does not necessarily hold, trigger a VACUUM FULL, which does the following (simplified):

  • get an exclusive lock on the table
  • copy rows to new files (new relfilenode/s)
  • swap the pointer to the new relfilenode and commit
VACUUM FULL orders;
SELECT oid, relname, relfilenode FROM pg_class WHERE relname IN ('orders', 'orders_pkey');
SELECT pg_relation_filepath('orders');
oid relname relfilenode
16386 orders 16397
16395 orders_pkey 16400
pg_relation_filepath
base/16384/16397

The rewrite caused by VACUUM FULL updated the relfilenodes, and they are not equal to the OIDs at the latest state.

Tablespaces

Postgres also supports tablespaces. With tablespaces, a user can specify other directories to store/use database files. It allows the following:

  • extending the partition/volume that Postgres was initially configured on.
  • placing frequently-accessed indexes on fast disks and cold tables on slower ones.

We've talked about the base/ and global/ directories. pg_default is the default tablespace for ordinary database objects, and its path is $PGDATA/base, whereas pg_global is the tablespace for cluster-wide objects.

Let's create a directory and add it as a tablespace.

mkdir -p /Users/burak/pg-tblspc-extra
CREATE TABLESPACE extra_space LOCATION '/Users/burak/pg-tblspc-extra';
SELECT oid, spcname FROM pg_tablespace WHERE spcname = 'extra_space';
oid spcname
16401 extra_space

As expected, our tablespace has an OID as well, and it's 16401. Now, instead of going to the raw path of this tablespace, Postgres creates a symlink between the file paths.

ls -la /opt/homebrew/var/postgresql@18/pg_tblspc/
lrwx------@ 1 burak admin 28 Jun 20 15:56 16401 -> /Users/burak/pg-tblspc-extra

When creating a tablespace, Postgres runs symlink("/Users/burak/pg-tblspc-extra","$PGDATA/pg_tblspc/16401").

How does Postgres know if the file it's searching for is in base/, global/, or symlinked in pg_tblspc/? Internally, there is a mapping below:

  • spcOid == 0$PGDATA/global/
  • spcOid == 1663$PGDATA/base/
  • spcOid == other$PGDATA/pg_tblspc/{spcOid}/{PG_xx_yyy}/

where spcOid is the tablespace OID. You may have noticed PG_xx_yyy, which is PG_{major version}_{catalog version}, and for my local setup, it's:

ls -la /Users/burak/pg-tblspc-extra/
drwx------@ 2 burak staff 64 Jun 20 15:56 PG_18_202506291

Create a table in this tablespace with an index and check their paths:

\c shop
CREATE TABLE archived_orders (LIKE orders INCLUDING ALL) TABLESPACE extra_space;
SELECT oid, relname, reltablespace, relfilenode FROM pg_class WHERE relname IN ('archived_orders', 'archived_orders_pkey');
oid relname reltablespace relfilenode
16402 archived_orders 16401 16402
16411 archived_orders_pkey 0 16411

Our tablespace's OID was 16401, and the table's catalog row reflects that, but the index has reltablespace as 0, which stands for using the database's default tablespace. By default, indexes use the database's default tablespace and need to be overridden with USING INDEX TABLESPACE.

\db+
Name Owner Location Access privileges Options Size Description
extra_space burak /Users/burak/pg-tblspc-extra 96 bytes
pg_default burak 30 MB
pg_global burak 548 kB

Heap Table Structure

Previously, we've talked about data hierarchy and where data is actually located on disk. Now, this section explores how they are stored instead of where.

A heap table is an unordered collection of pages with tuples (~= physical representation of rows) that are stored in random order. Pages are numbered from 0th to Nth in a file, and each is 8192 bytes. This 8192 is configurable only by recompiling Postgres from the source code (--with-blocksize=BLOCKSIZE).

A page has the following data:

  • Headers: Contain general information about the page that Postgres looks at on every access (free-space pointers, checksum, flags, etc.)
  • Line Pointers: Point to tuples. They behave like indexes to tuples and any access to tuples goes through line pointers. They are numbered from 1 to N and sequentially increase.
  • Tuples: Where data is actually stored. Tuples are numbered 1 to N as well, but they are stored starting from the end of the page. Tuples also have headers, a null bitmap, and user data.

Now let's examine how they work and what the actual values are in a Postgres cluster. To do that, I'll utilize a built-in Postgres extension named pageinspect. It introduces functions to inspect the contents of pages at a low level.

Let's create a database, a table, and the extension.

CREATE DATABASE notes_lab;
\c notes_lab
CREATE EXTENSION pageinspect;
CREATE TABLE notes (
    id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    note text NOT NULL
);
INSERT INTO notes (note) VALUES ('small row'), ('second row'), ('third row');

First examine where table data is located as we've checked in the previous section:

SELECT oid, relname, relfilenode, relkind FROM pg_class WHERE relname IN ('notes', 'notes_pkey') ORDER BY relname;
oid relname relfilenode relkind
16461 notes 16461 r
16468 notes_pkey 16468 i

Note that relkind r stands for an ordinary table and i stands for an index (docs).

Let's check the heap table's location and size:

SELECT pg_relation_filepath('notes') AS heap_path, pg_relation_size('notes') AS heap_bytes;
heap_path heap_bytes
base/16413/16461 8192

Location is base/{database_oid}/{relfilenode_oid} as expected, and it consists of 1 page of 8KB at the moment.

We've added three rows to the table and in any SELECT query we can also get the physical location of a row with ctid. ctid is a system column that represents (page|block_number, offset_number) of any row.

SELECT ctid, id, note FROM notes ORDER BY id;
ctid id note
(0,1) 1 small row
(0,2) 2 second row
(0,3) 3 third row

Now we've only 3 rows and one 8KB page, and ctid effectively represents this. For the first row, (0,1) means block 0 and line pointer 1.

SELECT * FROM page_header(get_raw_page('notes', 0));
lsn checksum flags lower upper special pagesize version prune_xid
0/20BBC40 0 0 36 8072 8192 8192 4 0
  • lower = 36: 24 byte PageHeaderData + 3 × 4 byte line pointers.
  • upper = 8072: first byte of tuple storage.
  • free space is upper - lower = 8036 bytes.
  • special = 8192: no special area since this is a simple heap page.

Comments

No comments yet. Start the discussion.