It is common to express the commutative property of multiplication in real life scenarios outside maths to state: “Whatever comes first, doesn’t matter as long as it’s complete”, so it is often said “the order of the factors does not alter the product”.
But in programming, not all things are “factors”, let’s see some of them.
Because in the following scenarios we’ll create and manipulate a big set of rows, i feel in the need to share my PostgreSQL configuration.
SHOW SERVER_VERSION = 13.0 (Debian 13.0-1.pgdg100+1)
;SHOW CONFIG_FILE = /var/lib/postgresql/data/postgresql.conf;
You can find my full config (postgresql.conf) file here.
We need to create a table to insert data on it, right?
In this case, our table will contain the players
data, to keep things simple we’ll omit the primary key, so, PostgreSQL won’t create an underlying index to handle said primary key.
CREATE TABLE player_1
(
id INT4 GENERATED BY DEFAULT AS IDENTITY ,
name TEXT ,
alive BOOLEAN ,
is_in_combat BOOLEAN ,
level INT2 ,
xp INT4 ,
pos_x FLOAT4 ,
pos_y FLOAT4 ,
created TIMESTAMP
);
Let’s create some millions of rows, in fact, 31,536,001
, we’ll use GENERATE_SERIES
to achieve that.
SELECT COUNT( * )
FROM ( SELECT GENERATE_SERIES( CURRENT_TIMESTAMP , CURRENT_TIMESTAMP + '1 year'::INTERVAL , '1 second'::INTERVAL ) ) series;
-- Should return: 31,536,001 (without commas)
The following query will actually create the entries.
Note: Be aware, this query could take a long time to run!
INSERT INTO player_1( name , alive , is_in_combat , level , xp , pos_x , pos_y , created )
SELECT *
FROM (
SELECT MD5( RANDOM( )::TEXT ),
CASE
WHEN RANDOM( ) > 0.5 THEN TRUE
ELSE FALSE
END,
CASE
WHEN RANDOM( ) > 0.5 THEN TRUE
ELSE FALSE
END,
TRUNC( RANDOM( ) * 100 ),
TRUNC( RANDOM( ) * 3e6 ),
RANDOM( ) * 1000,
RANDOM( ) * 1000,
GENERATE_SERIES( CURRENT_TIMESTAMP , CURRENT_TIMESTAMP + '1 year'::INTERVAL , '1 second'::INTERVAL )
) entries;
So, to actually test if the selection order matters let’s do the following.
Note: Run it 2~3 times and pick the last result to ensure the data is in the cache.
EXPLAIN ( ANALYZE, FORMAT YAML ) SELECT * FROM player_1 LIMIT 10e6 ;
The execution plan i got is the following
- Plan:
Node Type: "Limit"
Parallel Aware: false
Startup Cost: 0.00
Total Cost: 213636.36
Plan Rows: 10000000
Plan Width: 61
Actual Startup Time: 0.596
Actual Total Time: 840.089 # Important!
Actual Rows: 10000000
Actual Loops: 1
Plans:
- Node Type: "Seq Scan"
Parent Relationship: "Outer"
Parallel Aware: false
Relation Name: "player_1"
Alias: "player_1"
Startup Cost: 0.00
Total Cost: 673724.32
Plan Rows: 31536032
Plan Width: 61
Actual Startup Time: 0.014
Actual Total Time: 436.907 # Important!
Actual Rows: 10000000
Actual Loops: 1
Execution Time: 1039.702 # Important!
Let’s try again, but instead of using *
we’ll specify all the columns
EXPLAIN ( ANALYZE, FORMAT YAML )
SELECT id,
name,
alive,
is_in_combat,
level,
xp,
pos_x,
pos_y,
created
FROM player_1
LIMIT 10e6;
The execution plan i got is the following
- Plan:
Node Type: "Limit"
Parallel Aware: false
Startup Cost: 0.00
Total Cost: 213636.36
Plan Rows: 10000000
Plan Width: 61
Actual Startup Time: 0.496
Actual Total Time: 859.440 # Important!
Actual Rows: 10000000
Actual Loops: 1
Plans:
- Node Type: "Seq Scan"
Parent Relationship: "Outer"
Parallel Aware: false
Relation Name: "player_1"
Alias: "player_1"
Startup Cost: 0.00
Total Cost: 673724.32
Plan Rows: 31536032
Plan Width: 61
Actual Startup Time: 0.008
Actual Total Time: 440.953 # Important!
Actual Rows: 10000000
Actual Loops: 1
Execution Time: 1064.141 # Important!
So, in fact, is the same plan, and the same execution time.
But actually, same plan = same execution time
?
Now, we’ll change the ordering of a few columns.
EXPLAIN ( ANALYZE, FORMAT YAML )
SELECT
id,
name,
alive,
is_in_combat,
level,
xp,
created,
pos_x,
pos_y
FROM player_1
LIMIT 10e6;
This is the execution plan for the previous query:
- Plan:
Node Type: "Limit"
Parallel Aware: false
Startup Cost: 0.00
Total Cost: 213636.36
Plan Rows: 10000000
Plan Width: 61
Actual Startup Time: 1.797
Actual Total Time: 1792.148 # Important!
Actual Rows: 10000000
Actual Loops: 1
Plans:
- Node Type: "Seq Scan"
Parent Relationship: "Outer"
Parallel Aware: false
Relation Name: "player_1"
Alias: "player_1"
Startup Cost: 0.00
Total Cost: 673724.32
Plan Rows: 31536032
Plan Width: 61
Actual Startup Time: 0.011
Actual Total Time: 1379.079 # Important!
Actual Rows: 10000000
Actual Loops: 1
Execution Time: 1992.985 # Important!
We just changed few columns and our Actual total time
duplicated.
All those plans are the same, but the difference is just order of the columns.
Let’s try some more queries, now, we’ll select just the first column.
EXPLAIN ( ANALYZE, FORMAT YAML ) SELECT id FROM player_1 LIMIT 10e6;
- Plan:
Node Type: "Limit"
Parallel Aware: false
Startup Cost: 0.00
Total Cost: 213636.36
Plan Rows: 10000000
Plan Width: 4
Actual Startup Time: 0.968
Actual Total Time: 1234.556 # Important!
Actual Rows: 10000000
Actual Loops: 1
Plans:
- Node Type: "Seq Scan"
Parent Relationship: "Outer"
Parallel Aware: false
Relation Name: "player_1"
Alias: "player_1"
Startup Cost: 0.00
Total Cost: 673724.32
Plan Rows: 31536032
Plan Width: 4
Actual Startup Time: 0.011
Actual Total Time: 831.063 # Important!
Actual Rows: 10000000
Actual Loops: 1
Execution Time: 1435.405 # Important!
And let’s pick just the last column.
EXPLAIN ( ANALYZE, FORMAT YAML ) SELECT created FROM player_1 LIMIT 10e6;
- Plan:
Node Type: "Limit"
Parallel Aware: false
Startup Cost: 0.00
Total Cost: 213636.36
Plan Rows: 10000000
Plan Width: 8
Actual Startup Time: 1.818
Actual Total Time: 1505.432 # Important!
Actual Rows: 10000000
Actual Loops: 1
Plans:
- Node Type: "Seq Scan"
Parent Relationship: "Outer"
Parallel Aware: false
Relation Name: "player_1"
Alias: "player_1"
Startup Cost: 0.00
Total Cost: 673724.32
Plan Rows: 31536032
Plan Width: 8
Actual Startup Time: 0.016
Actual Total Time: 1097.864 # Important!
Actual Rows: 10000000
Actual Loops: 1
Execution Time: 1710.916 # Important!
As you can see, even selecting fewer rows slows downs the performance for PostgreSQL.
Why is this happening? Because for PostgreSQL, it needs to read the column from the disk/cache, great, but actually we don’t want the full row (or not in their current column order) so PostgreSQL need to pick-n-transform all the selected rows to match the output schema, and that transformation take a lot of time if the dataset is large.
So, it’s actually better to always use *
? Nope, because even if that please PostgreSQL, we are basically sending all that information over the network, so it’ll naturally consume more bandwidth AND the client who receives the data need to process all that data, probably taking more than just PostgreSQL doing the transformation and then sending you the data you actually NEED.
But this scenario proved that ORDER MATTERS, and sometimes a LOT!
What about the order we define our table layout? Can it impact the performance of our application?
Let’s create another table with the same columns but in a different layout, this time using the following query
CREATE table player_2
(
created TIMESTAMP, -- 8 Bytes.
id INT4 GENERATED BY DEFAULT AS IDENTITY , -- 4 Bytes
xp INT4 , -- 4 bytes
pos_x FLOAT4 , -- 4 bytes
pos_y FLOAT4 , -- 4 bytes
level INT2 , -- 2 Bytes
alive BOOLEAN , -- 1 Byte
is_in_combat BOOLEAN , -- 1 Byte
name TEXT
-- Type: Text is "Variable length". This is handled internally by TOAST (just google it).
);
And to make it fair, let’s fill with the same data as in our player_1
table.
INSERT INTO player_2 SELECT
created,
id,
xp,
pos_x,
pos_y,
level,
alive,
is_in_combat,
name
FROM player_1;
Both tables must actually have the same number of rows
SELECT ( SELECT COUNT( * ) count1 FROM player_1 ), ( SELECT COUNT( * ) count2 FROM player_2 );
If you issue that query, we got this:
count1 | count2 |
---|---|
31536001 | 31536001 |
If we were to compare the size that both tables, we could issue the following query:
WITH size_1 AS ( SELECT PG_SIZE_PRETTY( PG_TABLE_SIZE( 'player_1'::regclass ) ) size ),
size_2 AS ( SELECT PG_SIZE_PRETTY( PG_TABLE_SIZE( 'player_2'::regclass ) ) size )
SELECT size_1.size , size_2.size FROM size_1, size_2;
We get the following.
size | size |
---|---|
2801 MB | 2801 MB |
So, apparently the size is actually the same…but let’s try another way to look up that.
WITH per_row_size_1 AS ( SELECT pg_size_pretty( SUM( pg_column_size( player_1.* )::bigint ) ) as size FROM player_1 ),
per_row_size_2 AS ( SELECT pg_size_pretty( SUM( pg_column_size( player_2.* )::bigint ) ) as size FROM player_2 )
SELECT per_row_size_1.size as player_1_size,
per_row_size_2.size as player_2_size
FROM per_row_size_1, per_row_size_2;
…And we got:
player_1_size | player_2_size |
---|---|
2647 MB | 2556 MB |
So, actually players_2
take less (2556 MB) size than player_1
(2647 MB).
Its happens that PostgreSQL actually allocate disk pages, those pages are a given size and because of that usually that size will match
is the table have more or less the same amount of data and the same layout (unordered).
But the actual space the tables are using is less than PostgreSQL allocated (when he run out of space, PostgreSQL will allocate more disk pages). So, the cost of reading and manipulating that chunk of data is ~100 MB.
We did it because we sorted the columns in decreasing (byte) size.
Well, it is because ‘aligned’ data is preferred by CPU as a matter of performance and consistency. The CPU operates per-cycle, so in a given cycle they can read X numbers of bytes, that number of bytes matches a power-of-two number (4,8,16,32,64...)
. Also, CPU will often read an address that matches a fixed multiplier of two (usually 4 or 8). Suppose your CPU is capable of reading each cycle 8 bytes worth of data and can jump to any address that is multiple of 8.
You have then this data in memory.
|id_1 = int4|id_2 = int8|flag_1 = bool1|id_3 = int4|
Let represent one byte of data with 1
s.
| 1 1 1 1 | 1 1 1 1 1 1 1 1 | 1 | 1 1 1 1 |
So, to repeat the rules: Your CPU will always read 8-bytes per cycle in any address that is a multiplier of 8. So if the previous set started at the memory address, 1000
it would end at 1000 + 3 + 8 + 1 + 4
= 1016
.
This is, the first bit of id_1
started at 1000
and the last bit of id_3
is at 1016
.
And say you actually want to read id_2
, the memory address that hold id_2
is 1004 ~ 1011
.
So what’s the first address multiple of 8 that is before or at id_2
? The answer is 1000
.
So the CPU will read from 1000 ~ 1007
and in the next cycle to 1008 ~ 1015
.
It will need to discard the bits from 1000 ~ 1003
and those from 1012 ~ 15
.
So to read 8 bytes we need to employ 2 CPU cycles and 16 bytes just to get 8 of them.
What is the solution to this “align” or “pad” the memory, so bigger types will be aligned to a certain CPU bound to be read more efficiently.
The memory layout aligned (to 8th byte) is the following:
Note: We’ll represent padding bytes with 0
s.
| 1 1 1 1 | 0 0 0 0 1 1 1 1 1 1 1 1 | 1 | 0 0 0 1 1 1 1 |
So if the previous set started at the memory address, 1000
it would end at 1000 + 3 + ( 4 + 8 ) + 1 + ( 3 + 4 )
= 1023
.
Since id_2
begins at 1008
(a multiple of 8) it can read all the 8 bytes in just 1 cycle without wasting memory.
So to reduce the padding, we will just sort the members in the following in decreasing size. Longer types will require longer padding bytes, so if we sort is as the following:
|id_2 = int8|id_1 = int4|id_3 = int4|flag_1 = bool1|
We end up with the following memory layout:
| 1 1 1 1 1 1 1 1 | 1 1 1 1 | 1 1 1 1 | 1 |
We ended up without padding bytes, that’s because any value can be fully read from an address that is multiple of 8.
id_1
can be read in one cycle from 1008 to 1015
id_2
can be read in one cycle from 1000 to 1007
id_3
can be read in one cycle from 1008 to 1015
flag_1
can be read in one cycle from 1016 to 1023
You can investigate further about this topic in the following article.