Citus with BRIN index
In this weekend, I tried Citus. The latest version of Citus is for PostgreSQL 9.5, so it can be used BRIN index.
Installation
Refer to the document.
Set up
In this time, I used the append
mode to create distributed tables.
Refer to document.
Login and check
$ csql -h localhost -p 9700 -d postgres csql (9.5.3) Type "help" for help. postgres=# \d List of relations Schema | Name | Type | Owner --------+---------------+-------+---------- public | github_events | table | postgres (1 row) postgres=# \d github_events Table "public.github_events" Column | Type | Modifiers --------------+-----------------------------+----------- event_id | bigint | event_type | text | event_public | boolean | repo_id | bigint | payload | jsonb | repo | jsonb | actor | jsonb | org | jsonb | created_at | timestamp without time zone | postgres=# SELECT count(*) FROM github_events; count ------- 39599 (1 row)
Creating Indexes
B-Tree
At first, let’s create an index of event_id
with B-tree and do SELECT statement.
postgres=# CREATE INDEX github_events_event_id_idx ON github_events (event_id); CREATE INDEX postgres=# ANALYZE; ANALYZE postgres=# EXPLAIN ANALYZE SELECT event_id, event_type, created_at FROM github_events postgres-# WHERE event_id <= 2489378095; QUERY PLAN ------------------------------------------------------------------------------------------------------ Distributed Query into pg_merge_job_0049 Executor: Real-Time Task Count: 7 Tasks Shown: One of 7 -> Task Node: host=localhost port=9701 dbname=postgres -> Index Scan using github_events_event_id_idx_102008 on github_events_102008 github_events (cost=0.28..418. 48 rows=2640 width=27) (actual time=0.020..2.219 rows=2638 loops=1) Index Cond: (event_id <= '2489378095'::bigint) Planning time: 0.529 ms Execution time: 2.762 ms Master Query -> Seq Scan on pg_merge_job_0049 (cost=0.00..0.00 rows=0 width=0) (actual time=0.003..0.197 rows=2638 loops=1) Planning time: 0.260 ms Execution time: 0.312 ms (14 rows)
The created index is used.
BRIN
Next, let’s create BRIN index for created_at
column.
postgres=# CREATE INDEX github_events_created_at_brin_idx ON github_events postgres-# USING brin(created_at) WITH (pages_per_range = 10); CREATE INDEX postgres=# ANALYZE; ANALYZE postgres=# \d github_events Table "public.github_events" Column | Type | Modifiers --------------+-----------------------------+----------- event_id | bigint | event_type | text | event_public | boolean | repo_id | bigint | payload | jsonb | repo | jsonb | actor | jsonb | org | jsonb | created_at | timestamp without time zone | Indexes: "github_events_created_at_brin_idx" brin (created_at) WITH (pages_per_range='10') "github_events_event_id_idx" btree (event_id)
Do SELECT statement below:
postgres=# EXPLAIN ANALYZE SELECT event_id, event_type, created_at FROM github_events postgres-# WHERE event_id <= 2489378095 postgres-# AND created_at BETWEEN '2015-01-01 00:10:00' postgres-# AND '2015-01-01 00:12:00'; QUERY PLAN ------------------------------------------------------------------------------------------------------ Distributed Query into pg_merge_job_0051 Executor: Real-Time Task Count: 2 Tasks Shown: One of 2 -> Task Node: host=localhost port=9701 dbname=postgres -> Bitmap Heap Scan on github_events_102008 github_events (cost=71.30..366.45 rows=100 width=27) (actual tim e=0.394..0.648 rows=280 loops=1) Recheck Cond: ((created_at >= '2015-01-01 00:10:00'::timestamp without time zone) AND (created_at <= '20 15-01-01 00:12:00'::timestamp without time zone) AND (event_id <= '2489378095'::bigint)) Rows Removed by Index Recheck: 110 Heap Blocks: lossy=50 -> BitmapAnd (cost=71.30..71.30 rows=100 width=0) (actual time=0.286..0.286 rows=0 loops=1) -> Bitmap Index Scan on github_events_created_at_brin_idx_102008 (cost=0.00..14.92 rows=292 width =0) (actual time=0.170..0.170 rows=500 loops=1) Index Cond: ((created_at >= '2015-01-01 00:10:00'::timestamp without time zone) AND (created _at <= '2015-01-01 00:12:00'::timestamp without time zone)) -> Bitmap Index Scan on github_events_event_id_idx_102008 (cost=0.00..56.08 rows=2640 width=0) ( actual time=0.007..0.007 rows=2638 loops=1) Index Cond: (event_id <= '2489378095'::bigint) Planning time: 0.460 ms Execution time: 0.833 ms Master Query -> Seq Scan on pg_merge_job_0051 (cost=0.00..0.00 rows=0 width=0) (actual time=0.003..0.019 rows=280 loops=1) Planning time: 1.247 ms Execution time: 0.043 ms (21 rows)
The BRIN index works! Great!
BRIN vs GiST
Tresure Data is a big data management company, and they have a distributed analytical database called Plazma. Plazma is based on PostgreSQL, and the part of realtime storage uses a GiST Index of timestamp range column.
Note: When they created the Plazma, PostgreSQL did not support BRIN.
BRIN is easy-to-use for time sequential data, but it may not be effective in complicated tables and/or queries. On the other hand, GiST can be used for general purpose, and it already has a proven track record in the TD’s Plazma.
In any case, now I am very interested in capability of Citus with BRIN (and GiST). I am going to investigate the internal of Citus and the limit of BRIN.