PostgreSQL's query optimization is based on cost. Costs are dimensionless values, and these are not absolute performance indicators but are indicators to compare the relative performance of operations.
Costs are estimated by the functions defined in costsize.c. All of operations executed by the executor have the corresponding cost functions. For example, the costs of sequential scans and index scans are estimated by cost_seqscan() and cost_index(), respectively.
In PostgreSQL, there are three kinds of costs: start-up, run and total. The total cost is the sum of start-up and run costs; thus, only the start-up and run costs are independently estimated.
The EXPLAIN command shows both of start-up and total costs in each operation. The simplest example is shown below:
testdb=# EXPLAIN SELECT * FROM tbl; QUERY PLAN --------------------------------------------------------- Seq Scan on tbl (cost=0.00..145.00 rows=10000 width=8) (1 row)
In Line 4, the command shows information about the sequential scan. In the cost section, there are two values; 0.00 and 145.00. In this case, the start-up and total costs are 0.00 and 145.00, respectively.
In this section, we explore how to estimate the sequential scan, index scan and sort operation in detail.
In the following explanations, we use a specific table and an index that are shown below:
testdb=# CREATE TABLE tbl (id int PRIMARY KEY, data int); testdb=# CREATE INDEX tbl_data_idx ON tbl (data); testdb=# INSERT INTO tbl SELECT generate_series(1,10000),generate_series(1,10000); testdb=# ANALYZE; testdb=# \d tbl Table "public.tbl" Column | Type | Modifiers --------+---------+----------- id | integer | not null data | integer | Indexes: "tbl_pkey" PRIMARY KEY, btree (id) "tbl_data_idx" btree (data)
The cost of the sequential scan is estimated by the cost_seqscan() function. In this subsection, we explore how to estimate the sequential scan cost of the following query.
testdb=# SELECT * FROM tbl WHERE id < 8000;
In the sequential scan, the start-up cost is equal to 0, and the run cost is defined by the following equation:
\begin{align} \verb|‘run cost’| &= \verb|‘cpu run cost’| + \verb|‘disk run cost’| \\ &= (\verb|cpu_tuple_cost| + \verb|cpu_operator_cost|) \times N_{tuple} + \verb|seq_page_cost| \times N_{page}, \end{align}where seq_page_cost, cpu_tuple_cost and cpu_operator_cost are set in the postgresql.conf file, and the default values are 1.0, 0.01, and 0.0025, respectively; \(N_{tuple}\) and \(N_{page}\) are the numbers of all tuples and all pages of this table, respectively, and these numbers can be shown using the following query:
testdb=# SELECT relpages, reltuples FROM pg_class WHERE relname = 'tbl'; relpages | reltuples ----------+----------- 45 | 10000 (1 row)\begin{align} N_{tuple} &= 10000, \tag{1} \\ N_{page} &= 45. \tag{2} \\ \end{align}
Thus,
\begin{align} \verb|‘run cost’| &= (0.01 + 0.0025) \times 10000 + 1.0 \times 45 = 170.0. \end{align}Finally,
\begin{align} \verb|‘total cost’| = 0.0 + 170.0 = 170. \end{align}For confirmation, the result of the EXPLAIN command of the above query is shown below:
testdb=# EXPLAIN SELECT * FROM tbl WHERE id < 8000; QUERY PLAN -------------------------------------------------------- Seq Scan on tbl (cost=0.00..170.00 rows=8000 width=8) Filter: (id < 8000) (2 rows)
In Line 4, we can find that the start-up and total costs are 0.00 and 170.00, respectively, and it is estimated that 8000 rows (tuples) will be selected by scanning all rows.
In Line 5, a filter ‘Filter:(id < 8000)’ of the sequential scan is shown. More precisely, it is called a table level filter predicate. Note that this type of filter is used when reading all the tuples in the table, and it does not narrow the scanned range of table pages.
As understood from the run-cost estimation, PostgreSQL assumes that all pages will be read from storages; that is, PostgreSQL does not consider whether the scanned page is in the shared buffers or not.
Although PostgreSQL supports some index methods, such as BTree, GiST, GIN and BRIN, the cost of the index scan is estimated using the common cost function: cost_index().
In this subsection, we explore how to estimate the index scan cost of the following query:
testdb=# SELECT id, data FROM tbl WHERE data < 240;
Before estimating the cost, the numbers of the index pages and index tuples, \(N_{index,page}\) and \(N_{index,tuple}\), are shown below:
testdb=# SELECT relpages, reltuples FROM pg_class WHERE relname = 'tbl_data_idx'; relpages | reltuples ----------+----------- 30 | 10000 (1 row)\begin{align} N_{index,tuple} &= 10000, \tag{3} \\ N_{index,page} &= 30. \tag{4} \end{align}
The start-up cost of the index scan is the cost to read the index pages to access to the first tuple in the target table, and it is defined by the following equation:
\begin{align} \verb|‘start-up cost’| = \{\mathrm{ceil}(\log_2 (N_{index,tuple})) + (H_{index} + 1) \times 50\} \times \verb|cpu_operator_cost|, \end{align}where \(H_{index}\) is the height of the index tree.
In this case, according to (3), \(N_{index,tuple}\) is 10000; \(H_{index}\) is 1; \(\verb|cpu_operator_cost|\) is 0.0025 (by default). Thus,
\begin{align} \verb|‘start-up cost’| = \{\mathrm{ceil}(\log_2(10000)) + (1 + 1) \times 50\} \times 0.0025 = 0.285. \tag{5} \end{align}The run cost of the index scan is the sum of the cpu costs and the IO (input/output) costs of both the table and the index:
\begin{align} \verb|‘run cost’| &= (\verb|‘index cpu cost’| + \verb|‘table cpu cost’|) + (\verb|‘index IO cost’| + \verb|‘table IO cost’|). \end{align}If the Index-Only Scans, which is described in Section 7.2, can be applied, \(\verb|‘table cpu cost’|\) and \(\verb|‘table IO cost’|\) are not estimated.
The first three costs (i.e. index cpu cost, table cpu cost and index IO cost) are shown in below:
\begin{align} \verb|‘index cpu cost’| &= \verb|Selectivity| \times N_{index,tuple} \times (\verb|cpu_index_tuple_cost| + \verb|qual_op_cost|), \\ \verb|‘table cpu cost’| &= \verb|Selectivity| \times N_{tuple} \times \verb|cpu_tuple_cost|, \\ \verb|‘index IO cost’| &= \mathrm{ceil}(\verb|Selectivity| \times N_{index,page}) \times \verb|random_page_cost|, \end{align}where cpu_index_tuple_cost and random_page_cost are set in the postgresql.conf file (the defaults are 0.005 and 4.0, respectively); \(\verb|qual_op_cost|\) is, roughly speaking, the evaluating cost of the index, and it is shown without much explanation here: \(\verb|qual_op_cost| = 0.0025\). \(\verb|Selectivity|\) is the proportion of the search range of the index by the specified WHERE clause; it is a floating point number from 0 to 1, and it is described in detail in below. For example, \((\verb|Selectivity| \times N_{tuple})\) means the number of the table tuples to be read, \((\verb|Selectivity| \times N_{index,page})\) means the number of the index pages to be read and so on.
The selectivity of query predicates is estimated using histogram_bounds or the MCV (Most Common Value), both of which are stored in the statistics information pg_stats. Here, the calculation of the selectivity is briefly described using specific examples. More details are provided in the official document.
The MCV of each column of a table is stored in the pg_stats view as a pair of columns of most_common_vals and most_common_freqs.
A simple example is shown as follows. The table "countries" has two columns: a column ‘country’ that stores the country name and a column ‘continent’ that stores the continent name to which the country belongs.
testdb=# \d countries Table "public.countries" Column | Type | Modifiers -----------+------+----------- country | text | continent | text | Indexes: "continent_idx" btree (continent) testdb=# SELECT continent, count(*) AS "number of countries", testdb-# (count(*)/(SELECT count(*) FROM countries)::real) AS "number of countries / all countries" testdb-# FROM countries GROUP BY continent ORDER BY "number of countries" DESC; continent | number of countries | number of countries / all countries ---------------+---------------------+------------------------------------- Africa | 53 | 0.274611398963731 Europe | 47 | 0.243523316062176 Asia | 44 | 0.227979274611399 North America | 23 | 0.119170984455959 Oceania | 14 | 0.0725388601036269 South America | 12 | 0.0621761658031088 (6 rows)
Let us consider the following query which has a WHERE clause, ‘continent = 'Asia'’:
testdb=# SELECT * FROM countries WHERE continent = 'Asia';
In this case, the planner estimates the index scan cost using the MCV of the ‘continent’ column. The most_common_vals and most_common_freqs of this column are shown below:
testdb=# \x Expanded display is on. testdb=# SELECT most_common_vals, most_common_freqs FROM pg_stats testdb-# WHERE tablename = 'countries' AND attname='continent'; -[ RECORD 1 ]-----+------------------------------------------------------------- most_common_vals | {Africa,Europe,Asia,"North America",Oceania,"South America"} most_common_freqs | {0.274611,0.243523,0.227979,0.119171,0.0725389,0.0621762}
The value of the most_common_freqs corresponding to ‘Asia’ of the most_common_vals is 0.227979. Therefore, 0.227979 is used as the selectivity in this estimation.
If the MCV cannot be used, e.g., the target column type is integer or double precision, the value of the histogram_bounds of the target column is used to estimate the cost.
A specific example is shown. This is the value of histogram_bounds of the column ‘data’ in the table ‘tbl’:
testdb=# SELECT histogram_bounds FROM pg_stats WHERE tablename = 'tbl' AND attname = 'data'; histogram_bounds --------------------------------------------------------------------------------------------------- {1,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000,2100, 2200,2300,2400,2500,2600,2700,2800,2900,3000,3100,3200,3300,3400,3500,3600,3700,3800,3900,4000,4100, 4200,4300,4400,4500,4600,4700,4800,4900,5000,5100,5200,5300,5400,5500,5600,5700,5800,5900,6000,6100, 6200,6300,6400,6500,6600,6700,6800,6900,7000,7100,7200,7300,7400,7500,7600,7700,7800,7900,8000,8100, 8200,8300,8400,8500,8600,8700,8800,8900,9000,9100,9200,9300,9400,9500,9600,9700,9800,9900,10000} (1 row)
By default, the histogram_bounds is divided into 100 buckets. Figure 3.7 illustrates the buckets and the corresponding histogram_bounds in this example. Buckets are numbered starting from 0, and every bucket stores (approximately) the same number of tuples. The values of histogram_bounds are the bounds of the corresponding buckets. For example, the 0th value of histogram_bounds is 1, which means that it is the minimum value of the tuples stored in bucket_0; the 1st value is 100 and this is the minimum value of the tuples stored in bucket_1, and so on.
Fig. 3.7. Buckets and histogram_bounds.Next, the calculation of the selectivity using the example in this subsection will be shown. The query has a WHERE clause ‘data < 240’ and the value ‘240’ is in the second bucket. In this case, the selectivity can be derived by applying linear interpolation; thus, the selectivity of the column ‘data’ in this query is calculated using the following equation:
\begin{align} \verb|Selectivity| &= \frac{2 + (240-\verb|hb[2]|)/(\verb|hb[3]|-\verb|hb[2]|)}{100} = \frac{2 + (240-200) / (300-200)}{100} = \frac{2 + 40/100}{100} = 0.024. \tag{6} \end{align}Thus, according to (1),(3),(4) and (6),
\begin{align} \verb|‘index cpu cost’| &= 0.024 \times 10000 \times (0.005 + 0.0025) = 1.8, \tag{7} \\ \verb|‘table cpu cost’| &= 0.024 \times 10000 \times 0.01 = 2.4, \tag{8} \\ \verb|‘index IO cost’| &= \mathrm{ceil}(0.024 \times 30) \times 4.0 = 4.0. \tag{9} \end{align}‘table IO cost’ is defined by the following equation:
\begin{align} \verb|‘table IO cost’| = \verb|max_IO_cost| + \verb|indexCorrelation|^2 \times (\verb|min_IO_cost| - \verb|max_IO_cost|). \end{align}\(\verb|max_IO_cost|\) is the worst case of the IO cost, that is, the cost of randomly scanning all table pages; this cost is defined by the following equation:
\begin{align} \verb|max_IO_cost| = N_{page} \times \verb|random_page_cost|. \end{align}In this case, according to (2), \(N_{page} = 45\), and thus
\begin{align} \verb|max_IO_cost| = 45 \times 4.0 = 180.0. \tag{10} \end{align}\(\verb|min_IO_cost|\) is the best case of the IO cost, that is, the cost of sequentially scanning the selected table pages; this cost is defined by the following equation:
\begin{align} \verb|min_IO_cost| = 1 \times \verb|random_page_cost| + (\mathrm{ceil}(\verb|Selectivity| \times N_{page}) - 1) \times \verb|seq_page_cost|. \end{align}In this case,
\begin{align} \verb|min_IO_cost| = 1 \times 4.0 + (\mathrm{ceil}(0.024 \times 45)) - 1) \times 1.0 = 5.0. \tag{11} \end{align}\(\verb|indexCorrelation|\) is described in detail in below, and in this example,
\begin{align} \verb|indexCorrelation| = 1.0. \tag{12} \end{align}Thus, according to (10),(11) and (12),
\begin{align} \verb|‘table IO cost’| = 180.0 + 1.0^2 \times (5.0 - 180.0) = 5.0. \tag{13} \end{align}Finally, according to (7),(8),(9) and (13),
\begin{align} \verb|‘run cost’| = (1.8 + 2.4) + (4.0 + 5.0) = 13.2. \tag{14} \end{align}Index correlation is a statistical correlation between physical row ordering and logical ordering of the column values (cited from the official document). This ranges from \(-1\) to \(+1\). To understand the relation between the index scan and the index correlation, a specific example is shown in the following.
The table tbl_corr has five columns: two columns are text type and three columns are integer type. The three integer columns store numbers from 1 to 12. Physically, tbl_corr is composed of three pages, and each page has four tuples. Each integer type column has an index with a name is such as index_col_asc and so on.
testdb=# \d tbl_corr Table "public.tbl_corr" Column | Type | Modifiers ----------+---------+----------- col | text | col_asc | integer | col_desc | integer | col_rand | integer | data | text | Indexes: "tbl_corr_asc_idx" btree (col_asc) "tbl_corr_desc_idx" btree (col_desc) "tbl_corr_rand_idx" btree (col_rand)
testdb=# SELECT col,col_asc,col_desc,col_rand testdb-# FROM tbl_corr; col | col_asc | col_desc | col_rand ----------+---------+----------+---------- Tuple_1 | 1 | 12 | 3 Tuple_2 | 2 | 11 | 8 Tuple_3 | 3 | 10 | 5 Tuple_4 | 4 | 9 | 9 Tuple_5 | 5 | 8 | 7 Tuple_6 | 6 | 7 | 2 Tuple_7 | 7 | 6 | 10 Tuple_8 | 8 | 5 | 11 Tuple_9 | 9 | 4 | 4 Tuple_10 | 10 | 3 | 1 Tuple_11 | 11 | 2 | 12 Tuple_12 | 12 | 1 | 6 (12 rows)
The index correlations of these columns are shown below:
testdb=# SELECT tablename,attname, correlation FROM pg_stats WHERE tablename = 'tbl_corr'; tablename | attname | correlation -----------+----------+------------- tbl_corr | col_asc | 1 tbl_corr | col_desc | -1 tbl_corr | col_rand | 0.125874 (3 rows)
When the following query is executed, PostgreSQL reads only the first page because all target tuples are stored in the first page. Refer to Fig. 3.8(a).
testdb=# SELECT * FROM tbl_corr WHERE col_asc BETWEEN 2 AND 4;
On the other hand, when the following query is executed, PostgreSQL has to read all pages. Refer to Fig. 3.8(b).
testdb=# SELECT * FROM tbl_corr WHERE col_rand BETWEEN 2 AND 4;
This way, the index correlation is a statistical correlation that reflects the influence of random access caused by the twist between the index ordering and the physical tuple ordering in the table in estimating the index scan cost.
Fig. 3.8. Index correlation.According to (3) and (14),
\begin{align} \verb|‘total cost’| = 0.285 + 13.2 = 13.485. \tag{15} \end{align}For confirmation, the result of the EXPLAIN command of the above SELECT query is shown below:
testdb=# EXPLAIN SELECT id, data FROM tbl WHERE data < 240; QUERY PLAN --------------------------------------------------------------------------- Index Scan using tbl_data_idx on tbl (cost=0.29..13.49 rows=240 width=8) Index Cond: (data < 240) (2 rows)
In Line 4, we can find that the start-up and total costs are 0.29 and 13.49, respectively, and it is estimated that 240 rows (tuples) will be scanned.
In Line 5, an index condition ‘Index Cond:(data < 240)’ of the index scan is shown. More precisely, this condition is called an access predicate, and it expresses the start and stop conditions of the index scan.
According to this post, EXPLAIN command in PostgreSQL does not distinguish between the access predicate and index filter predicate. Therefore, if you analyze the output of EXPLAIN, pay attention not only to the index conditions but also to the estimated value of rows.
This document does not explain indexes in details. To understand them, I recommend to read the valuable posts shown below:
The default values of seq_page_cost and random_page_cost are 1.0 and 4.0, respectively. This means that PostgeSQL assumes that the random scan is four times slower than the sequential scan; that is, obviously, the default value of PostgreSQL is based on using HDDs.
On the other hand, in recent days, the default value of random_page_cost is too large because SSDs are mostly used. If the default value of random_page_cost is used despite using an SSD, the planner may select ineffective plans. Therefore, when using an SSD, it is better to change the value of random_page_cost to 1.0.
This blog reported the problem when using the default value of random_page_cost.
The sort path is used for sorting operations, such as ORDER BY, the preprocessing of merge join operations and other functions. The cost of sorting is estimated using the cost_sort() function.
In the sorting operation, if all tuples to be sorted can be stored in work_mem, the quicksort algorithm is used. Otherwise, a temporary file is created and the file merge sort algorithm is used.
The start-up cost of the sort path is the cost of sorting the target tuples, therefore, the cost is \(O(N_{sort} \times \log_2(N_{sort})) \), where \(N_{sort}\) is the number of the tuples to be sorted. The run cost of the sort path is the cost of reading the sorted tuples, therefore the cost is \(O(N_{sort}) \).
In this subsection, we explore how to estimate the sorting cost of the following query. Assume that this query will be sorted in work_mem, without using temporary files.
testdb=# SELECT id, data FROM tbl WHERE data < 240 ORDER BY id;
In this case, the start-up cost is defined in the following equation:
\begin{align} \verb|‘start-up cost’| = C + \verb|comparison_cost| \times N_{sort} \times \log_2(N_{sort}), \end{align}where \(C\) is the total cost of the last scan, that is, the total cost of the index scan; according to (15), it is \(13.485\); \(N_{sort} = 240\); \(\verb|comparison_cost|\) is defined in \(2 \times \verb|cpu_operator_cost|\). Thus,
\begin{align} \verb|‘start-up cost’| = 13.485 + (2 \times 0.0025) \times 240.0 \times \log_2(240.0) = 22.973. \end{align}The run cost is the cost to read sorted tuples in the memory; thus,
\begin{align} \verb|‘run cost’| = \verb|cpu_operator_cost| \times N_{sort} = 0.0025 \times 240 = 0.6. \end{align}Finally,
\begin{align} \verb|‘total cost’| = 22.973 + 0.6 = 23.573. \end{align}For confirmation, the result of the EXPLAIN command of the above SELECT query is shown below:
testdb=# EXPLAIN SELECT id, data FROM tbl WHERE data < 240 ORDER BY id; QUERY PLAN --------------------------------------------------------------------------------- Sort (cost=22.97..23.57 rows=240 width=8) Sort Key: id -> Index Scan using tbl_data_idx on tbl (cost=0.29..13.49 rows=240 width=8) Index Cond: (data < 240) (4 rows)
In Line 4, we can find that the start-up and total costs are 22.97 and 23.57, respectively.
As the processing of the planner is very complicated, this section describes the simplest process, that is, how a plan tree of a single-table query is created. More complex processing, namely, how a plan tree of a multiple-table query is created is described in Section 3.6.
The planner in PostgreSQL performs three steps, as shown below:
An access path is a unit of processing for estimating the cost; for example, the sequential scan, index scan, sort and various join operations have their corresponding paths. Access paths are used only inside the planner to create the plan tree. The most fundamental data structure of access paths is the Path structure defined in relation.h, and it corresponds to the sequential scan. All other access paths are based on it. Details will be described in the following explanations.
To process the above steps, the planner internally creates a PlannerInfo structure, and holds the query tree, the information about the relations contained in the query, the access paths, and so on.
In this section, how plan trees are created from query trees is described using specific examples.
Before creating a plan tree, the planner carries out some preprocessing of the query tree stored in the PlannerInfo structure.
Although preprocessing involves many steps, we only discuss the main preprocessing for the single-table query in this subsection. The other preprocessing operations are described in Section 3.6.
For example, ‘2 + 2’ is rewritten to ‘4’ by the eval_const_expressions() function defined in clauses.c.
For example, ‘NOT (NOT a)’ is rewritten to ‘a’.
AND and OR in the SQL standard are binary operators, however, in PostgreSQL internals, they are n-ary operators and the planner always assumes that all nested AND and OR expressions are to be flattened.
A specific example is shown. Consider a Boolean expression ‘(id = 1) OR (id = 2) OR (id = 3)’. Figure 3.9(a) shows part of the query tree when using the binary operator. The operator simplified this tree by flattening using a ternary operator. Refer to Fig. 3.9(b).
To get the cheapest access path, the planner estimates the costs of all possible access paths and choices the cheapest one. More specifically, the planner performs the following operations:
A RelOptInfo structure is created by the make_one_rel() function and is stored in the simple_rel_array of the PlannerInfo structure. Refer to Fig. 3.10. In its initial state, the RelOptInfo holds the baserestrictinfo and the indexlist if related indexes exist; the baserestrictinfo stores the WHERE clauses of the query, and the indexlist stores the related indexes of the target table.
Details of this processing are as follows:
To understand how the planner performs clearly, two specific examples are shown below.
First, we explore a simple-single table query without indexes; this query contains both WHERE and ORDER BY clauses.
testdb=# \d tbl_1 Table "public.tbl_1" Column | Type | Modifiers --------+---------+----------- id | integer | data | integer | testdb=# SELECT * FROM tbl_1 WHERE id < 300 ORDER BY data;
Figures 3.10 and 3.11 depict how the planner performs in this example.
Fig. 3.10. How to get the cheapest path of Example 1.A WHERE clause ‘id < 300’ is added to the baserestrictinfo by the distribute_restrictinfo_to_rels() function defined in initsplan.c. In addition, the indexlist of the RelOptInfo is NULL because there are no related indexes of the target table.
Pathkey is a data structure representing the sort ordering for the path. In this example, the column "data" is added to the sort_pathkeys as a pathkey because this query contains an ORDER BY clause and its column is ‘data’.
As mentioned before, the Path structure contains both of the start-up and the total costs which are estimated by the cost_seqscan function, and so on.
In this example, the planner only estimates the sequential scan cost because there are no indexes of the target table; therefore, the cheapest access path is automatically determined.
Fig. 3.11. How to get the cheapest path of Example 1 (continued from Fig. 3.10).Note that the new RelOptInfo does not have the baserestrictinfo, that is, the information of the WHERE clause.
The SortPath structure is composed of two path structures: path and subpath; the path stores information about the sort operation itself, and the subpath stores the cheapest path.
Note that the item ‘parent’ of the sequential scan path holds the link to the old RelOptInfo which stores the WHERE clause in its baserestrictinfo. Therefore, in the next stage, that is, creating a plan tree, the planner can create a sequential scan node that contains the WHERE clause as the ‘Filter’, even though the new RelOptInfo does not have the baserestrictinfo.
Based on the cheapest access path obtained here, a plan tree is generated. Details are described in Section 3.3.3.
Next, we explore another single-table query with two indexes; this query contains a WHERE clause.
testdb=# \d tbl_2 Table "public.tbl_2" Column | Type | Modifiers --------+---------+----------- id | integer | not null data | integer | Indexes: "tbl_2_pkey" PRIMARY KEY, btree (id) "tbl_2_data_idx" btree (data) testdb=# SELECT * FROM tbl_2 WHERE id < 240;
Figures 3.12 to 3.14 depict how the planner performs in this example.
In this example, a WHERE clause ‘id < 240’ is added to the baserestrictinfo, and two indexes, tbl_2_pkey and tbl_2_data_idx, are added to the indexlist of the RelOptInfo.
In this example, as there are two indexes, tbl_2_pkey and tbl_2_data_idx, these indexes are processed in order. tbl_2_pkey is processed first.
An IndexPath is created for tbl_2_pkey, and both the start-up and the total costs are estimated. In this example, tbl_2_pkey is the index related to the column ‘id’, and the WHERE clause contains the column ‘id’; therefore, the WHERE clause is stored in the indexclauses of the IndexPath.
Note that when adding access paths to the pathlist, the add_path() function adds paths in the sort order of the total cost. In this example, the total cost of this index scan is smaller than the sequential total cost; thus, this index path is inserted before the sequential scan path.
Next, an IndexPath is created for tbl_2_data_idx, the costs are estimated and this IndexPath is added to the pathlist. In this example, there is no WHERE clause related to the tbl_2_data_idx index; thus, the index clauses are NULL.
Note that the add_path() function does not always add the path. The details are omitted because of the complicated nature of this operation. For details, refer to the comment of the add_path() function.
In this example, the cheapest path is the index path using the index tbl_2_pkey; thus, its path is added to the pathlist of the new RelOptInfo.
At the last stage, the planner generates a plan tree from the cheapest path.
The root of the plan tree is a PlannedStmt structure defined in plannodes.h. While it contains nineteen fields, here are four representative fields.
As mentioned above, a plan tree is composed of various plan nodes. The PlanNode structure is the base node, and other nodes always contain it. For example, SeqScanNode, which is for sequential scanning, is composed of a PlanNode and an integer variable ‘scanrelid’. A PlanNode contains fourteen fields. The following are seven representative fields.
In the following, two plan trees, which will be generated from the cheapest paths shown in the examples in the previous subsection, are described.
The first example is the plan tree of the example in Section 3.3.2.1. The cheapest path shown in Fig. 3.11 is the tree composed of a sort path and a sequential scan path; the root path is the sort path, and the child path is the sequential scan path. Although detailed explanations are omitted, it will be easy to understand that the plan tree can be almost trivially generated from the cheapest path. In this example, a SortNode is added to the plantree of the PlannedStmt structure, and a SeqScanNode is added to the lefttree of the SortNode. Refer to Fig. 3.15(a).
Fig. 3.15. Examples of plan trees.In the SortNode, the lefttree points to the SeqScanNode.
In the SeqScanNode, the qual holds the WHERE clause ‘id < 300’.
The second example is the plan tree of the example in Section 3.3.2.2. The cheapest path shown in Fig. 3.14 is the index scan path; thus, the plan tree is composed of an IndexScanNode structure alone. Refer to Fig. 3.15(b).
In this example, the WHERE clause ‘id < 240’ is an access predicate; it is therefore stored in the indexqual of the IndexScanNode.
In single-table queries, the executor takes the plan nodes in an order from the end of the plan tree to the root and then invokes the functions that perform the processing of the corresponding nodes.
Each plan node has functions that are meant for executing the respective operation, and they are located in the src/backend/executor/ directory. For example, the functions for executing the sequential scan (ScanScan) are defined in nodeSeqscan.c; the functions for executing the index scan (IndexScanNode) are defined in nodeIndexscan.c; the functions for sorting SortNode are defined in nodeSort.c and so on.
Of course, the best way to understand how the executor performs is to read the output of the EXPLAIN command because PostgreSQL's EXPLAIN shows the plan tree almost as it is. It will be explained using Example 1 in Section 3.3.3.
testdb=# EXPLAIN SELECT * FROM tbl_1 WHERE id < 300 ORDER BY data; QUERY PLAN --------------------------------------------------------------- Sort (cost=182.34..183.09 rows=300 width=8) Sort Key: data -> Seq Scan on tbl_1 (cost=0.00..170.00 rows=300 width=8) Filter: (id < 300) (4 rows)
Let us explore how the executor performs. Read the result of the EXPLAIN command from the bottom line to the top line.
Although the executor uses work_men and temp_buffers, which are allocated in the memory, for query processing, it uses temporary files if processing cannot be performed within only the memory.
Using the ANALYZE option, the EXPLAIN command actually executes the query and displays the true row counts, true run time and the actual memory usage. A specific example is shown below:
testdb=# EXPLAIN ANALYZE SELECT id, data FROM tbl_25m ORDER BY id; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Sort (cost=3944070.01..3945895.01 rows=730000 width=4104) (actual time=885.648..1033.746 rows=730000 loops=1) Sort Key: id Sort Method: external sort Disk: 10000kB -> Seq Scan on tbl_25m (cost=0.00..10531.00 rows=730000 width=4104) (actual time=0.024..102.548 rows=730000 loops=1) Planning time: 1.548 ms Execution time: 1109.571 ms (6 rows)
In Line 6, the EXPLAIN command shows that the executor has used a temporary file whose size is 10000kB.
Temporary files are created in the base/pg_tmp subdirectory temporarily, and the naming method is shown below.
{"pgsql_tmp"} + {PID of the postgres process which creates the file} . {sequencial number from 0}
For example, the temporary file ‘pgsql_tmp8903.5’ is the 6th temporary file that is created by the postgres process whose pid is 8903.
$ ls -la /usr/local/pgsql/data/base/pgsql_tmp* -rw------- 1 postgres postgres 10240000 12 4 14:18 pgsql_tmp8903.5