Notes on Postgresql Internals
logical structure
- every database object has its own objects, can be queried in pg_database under catalog
- OID is the id of tables, database, indexes etc
physical structure
- initdb to create physical files under a folder, referred as base-directory. PATH is set with PGDATA
- base directory holds the files of tables, indexes etc
- PG_VERSION env variable
heap table file
Heap Table File
The data file (heap table and index, as well as the free space map and visibility map), it is divided into pages (or blocks) of fixed length, the default is 8192 byte (8 KB)

- heap tuple(s) – A heap tuple is a record data itself. They are stacked in order from the bottom of the page.
- line pointer(s) – a pointer to each heap tuple. It is also called an item pointer. Line pointers form a simple array, which plays the role of index(offset number) to the tuples. When a new tuple is added to the page, a new line pointer is also pushed onto the array to point to the new one.
- header data – It is 24 byte long and contains general information about the page.
- pd_lsn – This variable stores the LSN of XLOG record written by the last change of this page. It is an 8-byte unsigned integer, related to the WAL (Write-Ahead Logging) mechanism.
- pd_checksum – This variable stores the checksum value of this page. (Note that this variable is supported in version 9.3 or later; in earlier versions, this part had stored the timelineId of the page.)
- pd_lower, pd_upper – pd_lower points to the end of line pointers, and pd_upper to the beginning of the newest heap tuple.
- pd_special – This variable is for indexes. In the page within tables, it points to the end of the page.
- tuple identifier (TID) comprises block number and offset number of line pointer
Reading Heap Tuples
Two typical access methods, sequential scan and B-tree index scan, are outlined here:
Sequential scan — All tuples in all pages are sequentially read by scanning all line pointers in each page. See Fig. 1.6(a).
B-tree index scan — An index file contains index tuples, each of which is composed of an index key and a TID pointing to the target heap tuple. If the index tuple with the key that you are looking for has been found, PostgreSQL reads the desired heap tuple using the obtained TID value. (The description of the way to find the index tuples in B-tree index is not explained here as it is very common and the space here is limited. See the relevant materials.) For example, in Fig. 1.6(b), TID value of the obtained index tuple is ‘(block = 7, Offset = 2)’. It means that the target heap tuple is 2nd tuple in the 7th page within the table, so PostgreSQL can read the desired heap tuple without unnecessary scanning in the pages.

Query processing
- parser
produces parse tree from plain text sql statement. Only syntax check is carried out.
- Analyzer
produes query tree from parse tree
- Rewriter
realize the rule system with the query tree
- Planner
it is based on cost-based optimization. no rule-based optimization or hints. there is an extension for hints: pg_hint_plan.
Explain command shows the plan tree generated by planner.
- Executor
uses plan tree to execute.
Cost Estimation in Single-Table Query
- dimensionless. not absolute performance indicators, but indicators for relative performance of operations.
- there are 3 kind of costs: start-up, run and total.
- start-up cost: cost before the first tuple is fetched.
- run cost: cost to fetch all tuples.
3.2.1. Sequential Scan
- does not consider if the scanned page is in the shared buffer or not
3.2.3. Sort
- In the sorting operation, if all tuples to be sorted can be stored in work_mem, the quicksort algorithm is used. Otherwise, a temporary file is created and the file merge sort algorithm is used.
5. Concurrency Control
- There are three broad concurrency control techniques, i.e. Multi-version Concurrency Control (MVCC), Strict Two-Phase Locking (S2PL), and Optimistic Concurrency Control (OCC), and each technique has many variations
- In MVCC, each modification creates its own version of data item while retaining the old version of it
- when a transaction reads a data item, the system selects the correct version of the data item to ensure isolation of the individual transaction
- In MVCC, writers don’t block readers, readers don’t block writers
- In S2PL, writers acquire an exclusive lock which blocks readers
- Postgres uses the variation of MVCC, called Snaption Isolation(SI)
- SI does not allow some anomalies such as Dirty reads, Non-repeatable reads, and Phantom Reads. But it cannot achieve true serializability because it allows serialization anomalies such as Write skew, Read-only transaction
- Serializable Snapshot Isolation(SSI) is introduced with version 9.1 to prevent these
- Postgresql supports these transaction isolation levels: READ COMMITTED, REPEATABLE READS, SERIALIZABLE (https://www.postgresql.org/docs/9.1/transaction-iso.html)

- Postgresql uses SSI for DML(Select, insert, delete…), while 2PL for DDL (create table…)
5.1. Transaction ID(txid)
- txids are assigned once the command following the BEGIN command is issued
- at the viewpoint of txid 100, the txids that are greater than 100 are invisible, as they are “in the future”, while txids that are less than 100 are visible, as they are in the past.

5. 2. Tuple Structure

- t_xmin: the transaction which created the tuple
- t_xmax: the transaction which deleted or updated the tuple. otherwise 0.
- t_cid: hold the command id, which means how many SQL commands are executed before this command in the current transaction beginning from 0. Assume that we execute three INSERT commands within a single transaction: ‘BEGIN; INSERT; INSERT; INSERT; COMMIT;’. If the first command inserts this tuple, t_cid is set to 0. If the second command inserts this, t_cid is set to 1, and so on.
- t_ctid: holds the tuple id which points to itself or a new tuple. When this tuple is updated, the t_ctid of this tuple points to the new tuple; otherwise, the t_ctid points to itself.
5.3. Inserting, deleting and updating tuples

- deleted tuples are regarded as dead tuples and eventually removed in VACUUM processing
- every update operation creates a new tuple which is pointed by the old updated tuple.
5.3.1. Insertion

- t_xmin is set to 99 because this tuple is inserted by txid 99.
- t_xmax is set to 0 because this tuple has not been deleted or updated.
- t_cid is set to 0 because this tuple is the first tuple inserted by txid 99.
- t_ctid is set to (0,1), which points to itself because this is the latest tuple.
5.3.2. Deletion

If txid 111 is committed, Tuple_1 is no longer required. Generally, unneeded tuples are referred to as dead tuples in PostgreSQL.
5.3.3. Update

5.4 Commit Log (clog)
- postgres retains a commit log which holds the status of the current transaction in the shared memory
- transaction status: IN_PROGRESS, COMMITTED, ABORTED, SUB_COMMITTED
- the clog is basically an array whose each item holds the status of a transaction

- transaction logs are maintained by storing in files during server shutdown
5.5. Transaction Snapshot
Transaction snapshot is a dataset that stored information about whether all transactions are active or not, at a certain point in time for an individual transaction. An active transaction means either it is in progress or has not yet started.
Example transaction snapshot: “100:104:100,102” “xmin:xmax:xip_list”,
- txids that are less than 100 are not active.
- txids that are equal to or greater than 104 are active.
- txids 100 and 102 are active since they exist in the xip list, whereas txids 101 and 103 are not active.

- Transaction snapshots are provided by the transaction manager. In the READ COMMITTED isolation level, the transaction obtains a snapshot whenever an SQL command is executed; otherwise (REPEATABLE READ or SERIALIZABLE), the transaction only gets a snapshot when the first SQL command is executed. The obtained transaction snapshot is used for a visibility check of tuples, which is described in Section 5.7.

5.6. Visibility Check Rules
a