Does columns order matters?

It is common to express the commutative property of multiplication in real life scenarios outside maths to state: “Whatever comes first, doesn’t matter as long as it’s complete”, so it is often said “the order of the factors does not alter the product”.

But in programming, not all things are “factors”, let’s see some of them.

Because in the following scenarios we’ll create and manipulate a big set of rows, i feel in the need to share my PostgreSQL configuration.

SHOW SERVER_VERSION = 13.0 (Debian 13.0-1.pgdg100+1) ;
SHOW CONFIG_FILE = /var/lib/postgresql/data/postgresql.conf;

You can find my full config (postgresql.conf) file here.

Column selection order matters.

We need to create a table to insert data on it, right?
In this case, our table will contain the players data, to keep things simple we’ll omit the primary key, so, PostgreSQL won’t create an underlying index to handle said primary key.

CREATE TABLE player_1
(
  id           INT4 GENERATED BY DEFAULT AS IDENTITY ,
  name         TEXT ,
  alive        BOOLEAN ,
  is_in_combat BOOLEAN ,
  level        INT2 ,
  xp           INT4 ,
  pos_x        FLOAT4 ,
  pos_y        FLOAT4 ,
  created      TIMESTAMP
);

Let’s create some millions of rows, in fact, 31,536,001, we’ll use GENERATE_SERIES to achieve that.

SELECT COUNT( * )
FROM ( SELECT GENERATE_SERIES( CURRENT_TIMESTAMP , CURRENT_TIMESTAMP + '1 year'::INTERVAL , '1 second'::INTERVAL ) ) series;
-- Should return: 31,536,001 (without commas)

The following query will actually create the entries.
Note: Be aware, this query could take a long time to run!

INSERT INTO player_1( name , alive , is_in_combat , level , xp , pos_x , pos_y , created )
SELECT *
FROM (
       SELECT MD5( RANDOM( )::TEXT ),
              CASE
                WHEN RANDOM( ) > 0.5 THEN TRUE
                ELSE FALSE
                END,
              CASE
                WHEN RANDOM( ) > 0.5 THEN TRUE
                ELSE FALSE
                END,
              TRUNC( RANDOM( ) * 100 ),
              TRUNC( RANDOM( ) * 3e6 ),
              RANDOM( ) * 1000,
              RANDOM( ) * 1000,
              GENERATE_SERIES( CURRENT_TIMESTAMP , CURRENT_TIMESTAMP + '1 year'::INTERVAL , '1 second'::INTERVAL )
     ) entries;

So, to actually test if the selection order matters let’s do the following.
Note: Run it 2~3 times and pick the last result to ensure the data is in the cache.

EXPLAIN ( ANALYZE, FORMAT YAML ) SELECT * FROM player_1 LIMIT 10e6 ;

The execution plan i got is the following

- Plan: 
    Node Type: "Limit"
    Parallel Aware: false
    Startup Cost: 0.00
    Total Cost: 213636.36
    Plan Rows: 10000000
    Plan Width: 61
    Actual Startup Time: 0.596
    Actual Total Time: 840.089 # Important!
    Actual Rows: 10000000
    Actual Loops: 1
    Plans: 
      - Node Type: "Seq Scan"
        Parent Relationship: "Outer"
        Parallel Aware: false
        Relation Name: "player_1"
        Alias: "player_1"
        Startup Cost: 0.00
        Total Cost: 673724.32
        Plan Rows: 31536032
        Plan Width: 61
        Actual Startup Time: 0.014
        Actual Total Time: 436.907 # Important!
        Actual Rows: 10000000
        Actual Loops: 1
  Execution Time: 1039.702 # Important!

Let’s try again, but instead of using * we’ll specify all the columns

EXPLAIN ( ANALYZE, FORMAT YAML )
SELECT id,
       name,
       alive,
       is_in_combat,
       level,
       xp,
       pos_x,
       pos_y,
       created
FROM player_1
LIMIT 10e6;

The execution plan i got is the following

- Plan: 
    Node Type: "Limit"
    Parallel Aware: false
    Startup Cost: 0.00
    Total Cost: 213636.36
    Plan Rows: 10000000
    Plan Width: 61
    Actual Startup Time: 0.496
    Actual Total Time: 859.440 # Important!
    Actual Rows: 10000000
    Actual Loops: 1
    Plans: 
      - Node Type: "Seq Scan"
        Parent Relationship: "Outer"
        Parallel Aware: false
        Relation Name: "player_1"
        Alias: "player_1"
        Startup Cost: 0.00
        Total Cost: 673724.32
        Plan Rows: 31536032
        Plan Width: 61
        Actual Startup Time: 0.008
        Actual Total Time: 440.953 # Important!
        Actual Rows: 10000000
        Actual Loops: 1
  Execution Time: 1064.141 # Important!

So, in fact, is the same plan, and the same execution time.

But actually, same plan = same execution time?

Now, we’ll change the ordering of a few columns.

EXPLAIN ( ANALYZE, FORMAT YAML ) 
SELECT
  id,
  name,
  alive,
  is_in_combat,
  level,
  xp,
  created,
  pos_x,
  pos_y
FROM player_1
LIMIT 10e6;

This is the execution plan for the previous query:

- Plan: 
    Node Type: "Limit"
    Parallel Aware: false
    Startup Cost: 0.00
    Total Cost: 213636.36
    Plan Rows: 10000000
    Plan Width: 61
    Actual Startup Time: 1.797
    Actual Total Time: 1792.148 # Important!
    Actual Rows: 10000000
    Actual Loops: 1
    Plans: 
      - Node Type: "Seq Scan"
        Parent Relationship: "Outer"
        Parallel Aware: false
        Relation Name: "player_1"
        Alias: "player_1"
        Startup Cost: 0.00
        Total Cost: 673724.32
        Plan Rows: 31536032
        Plan Width: 61
        Actual Startup Time: 0.011
        Actual Total Time: 1379.079 # Important!
        Actual Rows: 10000000
        Actual Loops: 1
  Execution Time: 1992.985 # Important!

We just changed few columns and our Actual total time duplicated.

All those plans are the same, but the difference is just order of the columns.

Let’s try some more queries, now, we’ll select just the first column.

EXPLAIN ( ANALYZE, FORMAT YAML ) SELECT id FROM player_1 LIMIT 10e6;

- Plan: 
    Node Type: "Limit"
    Parallel Aware: false
    Startup Cost: 0.00
    Total Cost: 213636.36
    Plan Rows: 10000000
    Plan Width: 4
    Actual Startup Time: 0.968
    Actual Total Time: 1234.556 # Important! 
    Actual Rows: 10000000
    Actual Loops: 1
    Plans: 
      - Node Type: "Seq Scan"
        Parent Relationship: "Outer"
        Parallel Aware: false
        Relation Name: "player_1"
        Alias: "player_1"
        Startup Cost: 0.00
        Total Cost: 673724.32
        Plan Rows: 31536032
        Plan Width: 4
        Actual Startup Time: 0.011
        Actual Total Time: 831.063 # Important! 
        Actual Rows: 10000000
        Actual Loops: 1
  Execution Time: 1435.405 # Important!

And let’s pick just the last column.

EXPLAIN ( ANALYZE, FORMAT YAML ) SELECT created FROM player_1 LIMIT 10e6;

- Plan: 
    Node Type: "Limit"
    Parallel Aware: false
    Startup Cost: 0.00
    Total Cost: 213636.36
    Plan Rows: 10000000
    Plan Width: 8
    Actual Startup Time: 1.818
    Actual Total Time: 1505.432 # Important! 
    Actual Rows: 10000000
    Actual Loops: 1
    Plans: 
      - Node Type: "Seq Scan"
        Parent Relationship: "Outer"
        Parallel Aware: false
        Relation Name: "player_1"
        Alias: "player_1"
        Startup Cost: 0.00
        Total Cost: 673724.32
        Plan Rows: 31536032
        Plan Width: 8
        Actual Startup Time: 0.016
        Actual Total Time: 1097.864 # Important! 
        Actual Rows: 10000000
        Actual Loops: 1
  Execution Time: 1710.916 # Important!

As you can see, even selecting fewer rows slows downs the performance for PostgreSQL.

Why is this happening? Because for PostgreSQL, it needs to read the column from the disk/cache, great, but actually we don’t want the full row (or not in their current column order) so PostgreSQL need to pick-n-transform all the selected rows to match the output schema, and that transformation take a lot of time if the dataset is large.

So, it’s actually better to always use *? Nope, because even if that please PostgreSQL, we are basically sending all that information over the network, so it’ll naturally consume more bandwidth AND the client who receives the data need to process all that data, probably taking more than just PostgreSQL doing the transformation and then sending you the data you actually NEED.

But this scenario proved that ORDER MATTERS, and sometimes a LOT!

Column declaration order matters.

What about the order we define our table layout? Can it impact the performance of our application?

Let’s create another table with the same columns but in a different layout, this time using the following query

CREATE table player_2
(
  created      TIMESTAMP, -- 8 Bytes.
  id           INT4 GENERATED BY DEFAULT AS IDENTITY , -- 4 Bytes
  xp           INT4 , -- 4 bytes
  pos_x        FLOAT4 , -- 4 bytes
  pos_y        FLOAT4 , -- 4 bytes
  level        INT2 , -- 2 Bytes
  alive        BOOLEAN , -- 1 Byte
  is_in_combat BOOLEAN , -- 1 Byte
  name         TEXT
  -- Type: Text is "Variable length". This is handled internally by TOAST (just google it).
);

And to make it fair, let’s fill with the same data as in our player_1 table.

INSERT INTO player_2 SELECT
  created,
  id,
  xp,
  pos_x,
  pos_y,
  level,
  alive,
  is_in_combat,
  name
FROM player_1;

Both tables must actually have the same number of rows

SELECT ( SELECT COUNT( * ) count1 FROM player_1 ), ( SELECT COUNT( * ) count2 FROM player_2 );

If you issue that query, we got this:

count1	count2
31536001	31536001

If we were to compare the size that both tables, we could issue the following query:

WITH size_1 AS ( SELECT PG_SIZE_PRETTY( PG_TABLE_SIZE( 'player_1'::regclass ) ) size ),
     size_2 AS ( SELECT PG_SIZE_PRETTY( PG_TABLE_SIZE( 'player_2'::regclass ) ) size )
SELECT size_1.size , size_2.size FROM size_1, size_2;

We get the following.

size	size
2801 MB	2801 MB

So, apparently the size is actually the same…but let’s try another way to look up that.

WITH per_row_size_1 AS ( SELECT pg_size_pretty( SUM( pg_column_size( player_1.* )::bigint ) ) as size FROM player_1 ),
     per_row_size_2 AS ( SELECT pg_size_pretty( SUM( pg_column_size( player_2.* )::bigint ) ) as size FROM player_2 )
SELECT per_row_size_1.size as player_1_size,
       per_row_size_2.size as player_2_size
FROM per_row_size_1, per_row_size_2;

…And we got:

player_1_size	player_2_size
2647 MB	2556 MB

So, actually players_2 take less (2556 MB) size than player_1 (2647 MB). Its happens that PostgreSQL actually allocate disk pages, those pages are a given size and because of that usually that size will match is the table have more or less the same amount of data and the same layout (unordered).

But the actual space the tables are using is less than PostgreSQL allocated (when he run out of space, PostgreSQL will allocate more disk pages). So, the cost of reading and manipulating that chunk of data is ~100 MB.

How did we achieve that?

We did it because we sorted the columns in decreasing (byte) size.

How does it works?

Well, it is because ‘aligned’ data is preferred by CPU as a matter of performance and consistency. The CPU operates per-cycle, so in a given cycle they can read X numbers of bytes, that number of bytes matches a power-of-two number (4,8,16,32,64...). Also, CPU will often read an address that matches a fixed multiplier of two (usually 4 or 8). Suppose your CPU is capable of reading each cycle 8 bytes worth of data and can jump to any address that is multiple of 8.

You have then this data in memory.
|id_1 = int4|id_2 = int8|flag_1 = bool1|id_3 = int4|
Let represent one byte of data with 1s.
| 1 1 1 1 | 1 1 1 1 1 1 1 1 | 1 | 1 1 1 1 |

So, to repeat the rules: Your CPU will always read 8-bytes per cycle in any address that is a multiplier of 8. So if the previous set started at the memory address, 1000 it would end at 1000 + 3 + 8 + 1 + 4 = 1016.
This is, the first bit of id_1 started at 1000 and the last bit of id_3 is at 1016.
And say you actually want to read id_2, the memory address that hold id_2 is 1004 ~ 1011.
So what’s the first address multiple of 8 that is before or at id_2? The answer is 1000.
So the CPU will read from 1000 ~ 1007 and in the next cycle to 1008 ~ 1015.
It will need to discard the bits from 1000 ~ 1003 and those from 1012 ~ 15.
So to read 8 bytes we need to employ 2 CPU cycles and 16 bytes just to get 8 of them.

What is the solution to this “align” or “pad” the memory, so bigger types will be aligned to a certain CPU bound to be read more efficiently.

The memory layout aligned (to 8th byte) is the following:
Note: We’ll represent padding bytes with 0s.
| 1 1 1 1 | 0 0 0 0 1 1 1 1 1 1 1 1 | 1 | 0 0 0 1 1 1 1 |
So if the previous set started at the memory address, 1000 it would end at 1000 + 3 + ( 4 + 8 ) + 1 + ( 3 + 4 ) = 1023.
Since id_2 begins at 1008 (a multiple of 8) it can read all the 8 bytes in just 1 cycle without wasting memory.

id_1 can be read in one cycle from 1008 to 1015
id_2 can be read in one cycle from 1000 to 1007
id_3 can be read in one cycle from 1008 to 1015
flag_1 can be read in one cycle from 1016 to 1023

You can investigate further about this topic in the following article.