Implementation of MVCC (Multi-Version Concurrency Control) in PostgreSQL is different and special when compared with other RDBMS. MVCC in PostgreSQL controls which tuples can be visible to transactions via versioning.
What is versioning in PostgreSQL?
Let’s consider the case of an Oracle or a MySQL Database. What happens when you perform a DELETE or an UPDATE of a row? You see an UNDO record maintained in a global UNDO Segment. This UNDO segment contains the past image of a row, to help database achieve consistency. (the “C” in A.C.I.D). For example, if there is an old transaction that depends on the row that got deleted, the row may still be visible to it because the past image is still maintained in the UNDO. If you are an Oracle DBA reading this blog post, you may quickly recollect the error ORA–01555snapshot too old . What this error means is—you may have a smaller undo_retention or not a huge UNDO segment that could retain all the past images (versions) needed by the existing or old transactions.
You may not have to worry about that with PostgreSQL.
Then how does PostgreSQL manage UNDO ?
In simple terms, PostgreSQL maintains both the past image and the latest image of a row in its own Table. It means, UNDO is maintained within each table. And this is done through versioning. Now, we may get a hint that, every row of PostgreSQL table has a version number. And that is absolutely correct. In order to understand how these versions are maintained within each table, you should understand the hidden columns of a table (especially xmin) in PostgreSQL.
Understanding the Hidden Columns of a Table
When you describe a table, you would only see the columns you have added, like you see in the following log.
1
2
3
4
5
6
7
|
percona=# \d scott.employee
Table “scott.employee”
Column | Type | Collation | Nullable | Default
—————+————————————+—————–+—————+————————————————————————
emp_id | integer | | not null | nextval(‘scott.employee_emp_id_seq’::regclass)
emp_name | character varying(100) | | |
dept_id | integer | | |
|
However, if you look at all the columns of the table in pg_attribute, you should see several hidden columns as you see in the following log.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
percona=# SELECT attname, format_type (atttypid, atttypmod)
FROM pg_attribute
WHERE attrelid::regclass::text=‘scott.employee’
ORDER BY attnum;
attname | format_type
—————+————————————
tableoid | oid
cmax | cid
xmax | xid
cmin | cid
xmin | xid
ctid | tid
emp_id | integer
emp_name | character varying(100)
dept_id | integer
(9 rows)
|
Let’s understand a few of these hidden columns in detail.
tableoid : Contains the OID of the table that contains this row. Used by queries that select from inheritance hierarchies.
More details on table inheritance can be found here : https://www.postgresql.org/docs/10/static/ddl-inherit.html
xmin : The transaction ID(xid) of the inserting transaction for this row version. Upon update, a new row version is inserted. Let’s see the following log to understand the xmin more.
1
2
3
4
5
6
7
8
9
10
11
12
13
|
percona=# select txid_current();
txid_current
———————
646
(1 row)
percona=# INSERT into scott.employee VALUES (9,’avi’,9);
INSERT 0 1
percona=# select xmin,xmax,cmin,cmax,* from scott.employee where emp_id = 9;
xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
———+———+———+———+————+—————+————–
647 | 0 | 0 | 0 | 9 | avi | 9
(1 row)
|
As you see in the above log, the transaction ID was 646 for the command => select txid_current().
Thus, the immediate INSERT statement got a transaction ID 647. Hence, the record was assigned an xmin of 647. This means, no transaction ID that has started before the ID 647, can see this row. In other words, already running transactions with txid less than 647 cannot see the row inserted by txid 647.
With the above example, you should now understand that every tuple has an xmin that is assigned the txid that inserted it.
Note: the behavior may change depending on the isolation levels you choose, would be discussed later in another blog post.
xmax : This values is 0 if it was not a deleted row version. Before the DELETE is committed, the xmax of the row version changes to the ID of the transaction that has issued the DELETE. Let’s observe the following log to understand that better.
On Terminal A : We open a transaction and delete a row without committing it.
1
2
3
4
5
6
7
8
9
10
|
percona=# BEGIN;
BEGIN
percona=# select txid_current();
txid_current
———————
655
(1 row)
percona=# DELETE from scott.employee where emp_id = 10;
DELETE 1
|
On Terminal B : Observe the xmax values before and after the delete (that has not been committed).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
Before the Delete
—————————
percona=# select xmin,xmax,cmin,cmax,* from scott.employee where emp_id = 10;
xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
———+———+———+———+————+—————+————–
649 | 0 | 0 | 0 | 10 | avi | 10
After the Delete
—————————
percona=# select xmin,xmax,cmin,cmax,* from scott.employee where emp_id = 10;
xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
———+———+———+———+————+—————+————–
649 | 655 | 0 | 0 | 10 | avi | 10
(1 row)
|
As you see in the above logs, the xmax value changed to the transaction ID that has issued the delete. If you have issued a ROLLBACK, or if the transaction got aborted, xmax remains at the transaction ID that tried to DELETE it (which is 655) in this case.
Now that we understand the hidden columns xmin and xmax, let’s observe what happens after a DELETE or an UPDATE in PostgreSQL. As we discussed earlier, through the hidden columns in PostgreSQL for every table, we understand that there are multiple versions of rows maintained within each table. Let’s see the following example to understand this better.
We’ll insert 10 records to the table : scott.employee
1
2
|
percona=# INSERT into scott.employee VALUES (generate_series(1,10),’avi’,1);
INSERT 0 10
|
Now, let’s DELETE 5 records from the table.
1
2
3
4
5
6
7
|
percona=# DELETE from scott.employee where emp_id > 5;
DELETE 5
percona=# select count(*) from scott.employee;
count
———–
5
(1 row)
|
Now, when you check the count after DELETE, you would not see the records that have been DELETED. To see any row versions that exist in the table but are not visible, we have an extension called pageinspect. The pageinspect
module provides functions that allow you to inspect the contents of database pages at a low level, which is useful for debugging purposes. Let’s create this extension to see the older row versions those have been deleted.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
percona=# CREATE EXTENSION pageinspect;
CREATE EXTENSION
percona=# SELECT t_xmin, t_xmax, tuple_data_split(‘scott.employee’::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page(‘scott.employee’, 0));
t_xmin | t_xmax | tuple_data_split
————+————+——————————————————————–
668 | 0 | {“\\x01000000”,“\\x09617669”,“\\x01000000”}
668 | 0 | {“\\x02000000”,“\\x09617669”,“\\x01000000”}
668 | 0 | {“\\x03000000”,“\\x09617669”,“\\x01000000”}
668 | 0 | {“\\x04000000”,“\\x09617669”,“\\x01000000”}
668 | 0 | {“\\x05000000”,“\\x09617669”,“\\x01000000”}
668 | 669 | {“\\x06000000”,“\\x09617669”,“\\x01000000”}
668 | 669 | {“\\x07000000”,“\\x09617669”,“\\x01000000”}
668 | 669 | {“\\x08000000”,“\\x09617669”,“\\x01000000”}
668 | 669 | {“\\x09000000”,“\\x09617669”,“\\x01000000”}
668 | 669 | {“\\x0a000000”,“\\x09617669”,“\\x01000000”}
(10 rows)
|
Now, we could still see 10 records in the table even after deleting 5 records from it. Also, you can observe here that t_xmax is set to the transaction ID that has deleted them. These deleted records are retained in the same table to serve any of the older transactions that are still accessing them.
We’ll take a look at what an UPDATE would do in the following Log.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
percona=# DROP TABLE scott.employee ;
DROP TABLE
percona=# CREATE TABLE scott.employee (emp_id INT, emp_name VARCHAR(100), dept_id INT);
CREATE TABLE
percona=# INSERT into scott.employee VALUES (generate_series(1,10),’avi’,1);
INSERT 0 10
percona=# UPDATE scott.employee SET emp_name = ‘avii’;
UPDATE 10
percona=# SELECT t_xmin, t_xmax, tuple_data_split(‘scott.employee’::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page(‘scott.employee’, 0));
t_xmin | t_xmax | tuple_data_split
————+————+———————————————————————–
672 | 673 | {“\\x01000000”,“\\x09617669”,“\\x01000000”}
672 | 673 | {“\\x02000000”,“\\x09617669”,“\\x01000000”}
672 | 673 | {“\\x03000000”,“\\x09617669”,“\\x01000000”}
672 | 673 | {“\\x04000000”,“\\x09617669”,“\\x01000000”}
672 | 673 | {“\\x05000000”,“\\x09617669”,“\\x01000000”}
672 | 673 | {“\\x06000000”,“\\x09617669”,“\\x01000000”}
672 | 673 | {“\\x07000000”,“\\x09617669”,“\\x01000000”}
672 | 673 | {“\\x08000000”,“\\x09617669”,“\\x01000000”}
672 | 673 | {“\\x09000000”,“\\x09617669”,“\\x01000000”}
672 | 673 | {“\\x0a000000”,“\\x09617669”,“\\x01000000”}
673 | 0 | {“\\x01000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x02000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x03000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x04000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x05000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x06000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x07000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x08000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x09000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x0a000000”,“\\x0b61766969”,“\\x01000000”}
(20 rows)
|
An UPDATE in PostgreSQL would perform an insert and a delete. Hence, all the records being UPDATED have been deleted and inserted back with the new value. Deleted records have non-zero t_xmax value.
Records for which you see a non-zero value for t_xmax may be required by the previous transactions to ensure consistency based on appropriate isolation levels.
We discussed about xmin and xmax. What are these hidden columns cmin and cmax ?
cmax : The command identifier within the deleting transaction or zero. (As per the documentation). However, both cmin and cmax are always the same as per the PostgreSQL source code.
cmin : The command identifier within the inserting transaction. You could see the cmin of the 3 insert statements starting with 0, in the following log.
See the following log to understand how the cmin and cmax values change through inserts and deletes in a transaction.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
On Terminal A
———————–
percona=# BEGIN;
BEGIN
percona=# INSERT into scott.employee VALUES (1,’avi’,2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (2,’avi’,2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (3,’avi’,2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (4,’avi’,2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (5,’avi’,2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (6,’avi’,2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (7,’avi’,2);
INSERT 0 1
percona=# INSERT into scott.employee VALUES (8,’avi’,2);
INSERT 0 1
percona=# COMMIT;
COMMIT
percona=# select xmin,xmax,cmin,cmax,* from scott.employee;
xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
———+———+———+———+————+—————+————–
644 | 0 | 0 | 0 | 1 | avi | 2
644 | 0 | 1 | 1 | 2 | avi | 2
644 | 0 | 2 | 2 | 3 | avi | 2
644 | 0 | 3 | 3 | 4 | avi | 2
644 | 0 | 4 | 4 | 5 | avi | 2
644 | 0 | 5 | 5 | 6 | avi | 2
644 | 0 | 6 | 6 | 7 | avi | 2
644 | 0 | 7 | 7 | 8 | avi | 2
(8 rows)
|
If you observe the above output log, you see cmin and cmax values incrementing for each insert.
Now let’s delete 3 records from Terminal A and observe how the values appear in Terminal B before COMMIT.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
On Terminal A
———————–
percona=# BEGIN;
BEGIN
percona=# DELETE from scott.employee where emp_id = 4;
DELETE 1
percona=# DELETE from scott.employee where emp_id = 5;
DELETE 1
percona=# DELETE from scott.employee where emp_id = 6;
DELETE 1
On Terminal B, before issuing COMMIT on Terminal A
——————————————————————————
percona=# select xmin,xmax,cmin,cmax,* from scott.employee;
xmin | xmax | cmin | cmax | emp_id | emp_name | dept_id
———+———+———+———+————+—————+————–
644 | 0 | 0 | 0 | 1 | avi | 2
644 | 0 | 1 | 1 | 2 | avi | 2
644 | 0 | 2 | 2 | 3 | avi | 2
644 | 645 | 0 | 0 | 4 | avi | 2
644 | 645 | 1 | 1 | 5 | avi | 2
644 | 645 | 2 | 2 | 6 | avi | 2
644 | 0 | 6 | 6 | 7 | avi | 2
644 | 0 | 7 | 7 | 8 | avi | 2
(8 rows)
|
Now, in the above log, you see that the cmax and cmin values have incrementally started from 0 for the records being deleted. Their values where different before the delete, as we have seen earlier. Even if you ROLLBACK, the values remain the same.
After understanding the hidden columns and how PostgreSQL maintains UNDO as multiple versions of rows, the next question would be—what would clean up this UNDO from a table? Doesn’t this increase the size of a table continuously? In order to understand that better, we need to know about VACUUM in PostgreSQL.
VACUUM in PostgreSQL
As seen in the above examples, every such record that has been deleted but is still taking some space is called a dead tuple. Once there is no dependency on those dead tuples with the already running transactions, the dead tuples are no longer needed. Thus, PostgreSQL runs VACUUM on such Tables. VACUUM reclaims the storage occupied by these dead tuples. The space occupied by these dead tuples may be referred to as Bloat. VACUUM scans the pages for dead tuples and marks them to the freespace map (FSM). Each relation apart from hash indexes has an FSM stored in a separate file called <relation_oid>_fsm.
Here, relation_oid is the oid of the relation that is visible in pg_class.
1
2
3
4
5
|
percona=# select oid from pg_class where relname = ’employee’;
oid
———–
24613
(1 row)
|
Upon VACUUM, this space is not reclaimed to disk but can be re-used by future inserts on this table. VACUUM stores the free space available on each heap (or index) page to the FSM file.
Running a VACUUM is a non-blocking operation. It never causes exclusive locks on tables. This means VACUUM can run on a busy transactional table in production while there are several transactions writing to it.
As we discussed earlier, an UPDATE of 10 records has generated 10 dead tuples. Let us see the following log to understand what happens to those dead tuples after a VACUUM.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
percona=# VACUUM scott.employee ;
VACUUM
percona=# SELECT t_xmin, t_xmax, tuple_data_split(‘scott.employee’::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page(‘scott.employee’, 0));
t_xmin | t_xmax | tuple_data_split
————+————+———————————————————————–
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
673 | 0 | {“\\x01000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x02000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x03000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x04000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x05000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x06000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x07000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x08000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x09000000”,“\\x0b61766969”,“\\x01000000”}
673 | 0 | {“\\x0a000000”,“\\x0b61766969”,“\\x01000000”}
(20 rows)
|
In the above log, you might notice that the dead tuples are removed and the space is available for re-use. However, this space is not reclaimed to filesystem after VACUUM. Only the future inserts can use this space.
VACUUM does an additional task. All the rows that are inserted and successfully committed in the past are marked as frozen, which indicates that they are visible to all the current and future transactions. We will be discussing this in detail in our future blog post “Transaction ID Wraparound in PostgreSQL”.
VACUUM does not usually reclaim the space to filesystem unless the dead tuples are beyond the high water mark.
Let’s consider the following example to see when a VACUUM could release the space to filesystem.
Create a table and insert some sample records. The records are physically ordered on the disk based on the primary key index.
1
2
3
4
|
percona=# CREATE TABLE scott.employee (emp_id int PRIMARY KEY, name varchar(20), dept_id int);
CREATE TABLE
percona=# INSERT INTO scott.employee VALUES (generate_series(1,1000), ‘avi’, 1);
INSERT 0 1000
|
Now, run ANALYZE on the table to update its statistics and see how many pages are allocated to the table after the above insert.
1
2
3
4
5
6
7
8
9
|
percona=# ANALYZE scott.employee ;
ANALYZE
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size(‘scott.employee’) as relsize
FROM pg_class
WHERE relname = ’employee’;
relpages | total_bytes | relsize
————–+——————–+————–
6 | 49152 | 49152
(1 row)
|
Let’s now see how VACUUM behaves when you delete the rows with emp_id > 500
1
2
3
4
5
6
7
8
9
10
11
|
percona=# DELETE from scott.employee where emp_id > 500;
DELETE 500
percona=# VACUUM ANALYZE scott.employee ;
VACUUM
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size(‘scott.employee’) as relsize
FROM pg_class
WHERE relname = ’employee’;
relpages | total_bytes | relsize
————–+——————–+————–
3 | 24576 | 24576
(1 row)
|
In the above log, you see that the VACUUM has reclaimed half the space to filesystem. Earlier, it occupied 6 pages (8KB each or as set to parameter : block_size). After VACUUM, it has released 3 pages to filesystem.
Now, let’s repeat the same exercise by deleting the rows with emp_id < 500
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
percona=# DELETE from scott.employee ;
DELETE 500
percona=# INSERT INTO scott.employee VALUES (generate_series(1,1000), ‘avi’, 1);
INSERT 0 1000
percona=# DELETE from scott.employee where emp_id < 500;
DELETE 499
percona=# VACUUM ANALYZE scott.employee ;
VACUUM
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size(‘scott.employee’) as relsize
FROM pg_class
WHERE relname = ’employee’;
relpages | total_bytes | relsize
—————+——————–+————–
6 | 49152 | 49152
(1 row)
|
In the above example, you see that the number of pages still remain same after deleting half the records from the table. This means, VACUUM has not released the space to filesystem this time.
As explained earlier, if there are pages with no more live tuples after the high water mark, the subsequent pages can be flushed away to the disk by VACUUM. In the first case, it is understandable that there are no more live tuples after the 3rd page. So, the 4th, 5th and 6th page have been flushed to disk.
However, If you would need to reclaim the space to filesystem in the scenario where we deleted all the records with emp_id < 500, you may run VACUUM FULL. VACUUM FULL rebuilds the entire table and reclaims the space to disk.
1
2
3
4
5
6
7
8
9
10
11
|
percona=# VACUUM FULL scott.employee ;
VACUUM
percona=# VACUUM ANALYZE scott.employee ;
VACUUM
percona=# select relpages, relpages*8192 as total_bytes, pg_relation_size(‘scott.employee’) as relsize
FROM pg_class
WHERE relname = ’employee’;
relpages | total_bytes | relsize
—————+——————–+————–
3 | 24576 | 24576
(1 row)
|
Please note that VACUUM FULL is not an ONLINE operation. It is a blocking operation. You cannot read from or write to the table while VACUUM FULL is in progress. We will discuss about the ways to rebuild a table online without blocking in our future blog post.