Efficient Search on Massive Data With Multidimensional Attributes
Cut query time down significantly in a few, simple steps
Join the DZone community and get the full member experience.
Join For FreeBackground
Activities by people and other objects lead to massive amounts of spatio-temporal data. If technologies could bring us back to the past, we may ask what the status of the past world would be?
In fact, this need also exists in databases.
Object Data Classification
Object data can be classified into two types: static data (relatively static, for example, a building) and dynamic data (such as activities of people and activities of IoT sensors).
Data Classification by Research Requirements
Spatio-temporal Snapshot Search
Some objects generate data at a relatively low frequency. For example, inert objects such as buildings and roads may not have any change in years. If data generated for these objects is written into databases, and we query data by time range (for example, query data from 2017-07-01 to 2017-07-02), we may not find any data related to these objects. The reason is simply that no relevant data was written into databases at all within that time period.
Spatio-temporal Behavior Data Search
Spatio-temporal behavior data refers to feed data from dynamic objects like people's activities.
For example, analyze the features of a population group in a specific region within a specific time period, or analyze differences in the makeup of the crowds around universities during the week and on weekends.
Spatio-temporal snapshots are not within the scope of this article. For information about spatio-temporal snapshots, refer to the previously written article on this subject. Now, let's look at how to search for spatio-temporal behavior data.
Data Structure
Spatio-temporal behavior data contains three properties: time, space, and objects.
Unstructured data:
create table test(
id int8,
crt_time timestamp, -- Time
pos geometry, -- Location
obj jsonb -- Object description
);
In addition to JSON, structured data can also be used for object description. Example:
create table test(
id int8,
crt_time timestamp, -- Time
pos geometry, -- Location
c1 int, -- Some property examples
c2 int,
c3 text,
c4 float8,
c5 int,
c6 date,
c7 text,
c8 int,
c9 int,
c10 int
);
SQL query example of spatio-temporal behavior data
select * from test
where
pos <-> ? < ?
and crt_time between ? and ?
and ( (c1 = ? and c2 between ? and ?) or c10=?)
...
;
Optimization Methods
Consider the following knowledge:
Time Series BRIN Indexes
The crt_time field is a time-series field that represents the time when data is generated. Storage and the value of this field have a strong linear correlation in PostgreSQL heap storage.
Therefore, BRIN indexes are very suitable.
I used a BRIN index for a TPC-H test in place of a partitioned table. The performance of searching in a large range is even better than that of using a partitioned table.
create index idx_test_1 on test using brin(crt_time);
Spatial Indexes
Obviously, spatial retrieval requires spatial indexes. Three methods can be used in PostgreSQL to implement spatial retrieval.
1. GIST index for the geometry type
create index idx_test_2 on test using gist(pos);
This index supports features such as spatial KNN search and spatial position determination.
2. SP-GiST index for the geometry type
create index idx_test_2 on test using spgist(pos);
This index supports features such as spatial KNN search and spatial position determination.
3. Geohash and B-tree indexes (convert longitude and latitude into Geohash and create a B-tree index for hash values). Simply use an expression index.
create index idx_test_3 on test using btree( ST_GeoHash(pos,15) );
This index supports prefix search (which implements the relationship contained in the encoded geographic information grid). It belongs to the LOSSY index and requires secondary filtering.
The GiST and SPGiST spatial indexes can find accurate geographical location information and are better than the GEOHASH index. However, you need to pay special attention when querying information.
GIN Index
This index type targets the object property field JSONB or multiple structured object property fields. Simply use a GIN index.
Examples:
create extension btree_gin;
Unstructured index:
create index idx_test_4 on test using gin( obj );
Structured index:
create index idx_test_4 on test using gin( c1,c2,c3,c4,c5,c6,c7,c8,c9 );
BitmapAnd and BitmapOr
In the preceding section, different indexes are selected for different query dimensions based on data types and query requirements.
However, can these indexes be used at the same time? PostgreSQL provides the bitmapAnd and bitmapOr interfaces for multiple indexes. These can combine multiple indexes and reduce the number of databases that need to be scanned.
Heap, one square = one page:
+---------------------------------------------+
|c____u_____X___u___X_________u___cXcc______u_|
+---------------------------------------------+
Rows marked c match customers pkey condition.
Rows marked u match username condition.
Rows marked X match both conditions.
Bitmap scan from customers_pkey:
+---------------------------------------------+
|100000000001000000010000000000000111100000000| bitmap 1
+---------------------------------------------+
One bit per heap page, in the same order as the heap
Bits 1 when condition matches, 0 if not
Bitmap scan from ix_cust_username:
+---------------------------------------------+
|000001000001000100010000000001000010000000010| bitmap 2
+---------------------------------------------+
Once the bitmaps are created a bitwise AND is performed on them:
+---------------------------------------------+
|100000000001000000010000000000000111100000000| bitmap 1
|000001000001000100010000000001000010000000010| bitmap 2
&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
|000000000001000000010000000000000010000000000| Combined bitmap
+-----------+-------+--------------+----------+
| | |
v v v
Used to scan the heap only for matching pages:
+---------------------------------------------+
|___________X_______X______________X__________|
+---------------------------------------------+
The bitmap heap scan then seeks to the start of each page and reads the page:
+---------------------------------------------+
|___________X_______X______________X__________|
+---------------------------------------------+
seek------->^seek-->^seek--------->^
| | |
------------------------
only these pages read
Example:
select * from test where
c1 ...
and crt_time between ? and ?
and test->>'c1' in (?, ? ...);
Proper indexes are automatically used based on statistics. If necessary, bitmapAnd or bitmapOr merge scans will be automatically performed on multiple indexes. Skip the pages that don't need to be scanned and recheck the hit pages.
Heap Table Storage Grading and Partitioning
Storage can be separated into one level partition or multi-level partitions:
1. Single-level partitioning:
For example, partition by time.
create table test(
id int8,
crt_time timestamp, -- Time
pos geometry, -- Location
obj jsonb -- Object description
)
PARTITION BY range (crt_time)
;
create table test_201701 PARTITION OF test for values FROM ('2017-01-01') TO ('2017-02-01');
......
2. Multi-Level partitioning
For example, partition first by time and then by Geohash.
create table test_201701 PARTITION OF test for values
FROM ('2017-01-01') TO ('2017-02-01') partition by range(st_geohash(pos,15));
...
create table test_201701_prefix1 PARTITION OF test for values
FROM ('xxxx1') TO ('xxxx2');
-- Generate BOX (GRID) on a map, find corresponding boundaries and use
-- boundaries as partitioning conditions
When partitions are used, corresponding partitions are automatically located if query conditions include partition keys (such as time and space range). This is the amount of data that needs to be scanned.
Create GIN indexes targeting object properties to implement an extremely efficient query.
Index Grading and Partitioning
Like data, indexing also supports partitioning logic when partitioned tables are not used.
Spatial index + time partition
create index idx_20170101
on tbl using gist (pos)
where crt_time between '2017-01-01' and '2017-01-02';
...
create index idx_20170102
on tbl using gist (pos)
where crt_time between '2017-01-02' and '2017-01-03';
...
By using the preceding partitioned index, target data can be quickly located after a time range is entered to perform spatial search.
select * from tbl
where crt_time between '2017-01-01' and '2017-01-02' -- Time
and (pos <-> ?) < ? -- Distance to a point to be searched for
and ? -- Other conditions
order by pos <-> ? -- Sort by distance
limit ?; -- Number of results to be returned
More index partitions can be used, such as a dimension (object property) used as the search criterion and shop type (assume it is enumerable or in a relatively small range).
create index idx_20170101_mod0 on tbl using gist (pos) where crt_time between '2017-01-01' and '2017-01-02' and dtype=0;
...
create index idx_20170101_mod1 on tbl using gist (pos) where crt_time between '2017-01-01' and '2017-01-02' and dtype=1;
...
By using the preceding partitioned index, target data can be quickly located after a time range or specific conditions are entered to perform a spatial search.
select * from tbl
where crt_time between '2017-01-01' and '2017-01-02' -- Time
and (pos <-> ?) < ? -- Distance to a point to be searched for
and dtype=0 -- Object condition
and ? -- Other conditions
order by pos <-> ? -- Sort by distance
limit ?; -- Number of results to be returned
Note that the preceding SQL queries can implement an extremely significant performance optimization.
Index organization forms (or index structures) can be restructured by logic partitions to cover all conditions in a manner similar to the preceding index creation method.
CTID Intersect Array JOIN SCAN
As mentioned earlier, BitmapAnd and BitmapOr merge scan is automatically performed in multiple indexes or GIN indexes. In fact, this scan can also be explicitly performed in SQL.
Each condition filters a corresponding CTID.
Use Intersect or Union to generate CTIDs that meet the overall requirements. (Intersect corresponds to the "and" condition; union corresponds to the "or" condition.)
Generate an array of ctids.
Example
1. Create an Object Feed Data Table
postgres=# create table tbl (id int, info text, crt_time timestamp, pos point, c1 int , c2 int, c3 int );
CREATE TABLE
2. Write 50 Million Pieces of Test Data Into the Table
postgres=# insert into tbl select generate_series(1,50000000), md5(random()::text), clock_timestamp(), point(180-random()*180, 90-random()*90), random()*10000, random()*5000, random()*1000;
INSERT 0 50000000
3. Create an Object Index
postgres=# create index idx_tbl_1 on tbl using gin (info, c1, c2, c3);
CREATE INDEX
4. Create a Time Index
postgres=# create index idx_tbl_2 on tbl using btree (crt_time);
CREATE INDEX
5. Create a Spatial Index
postgres=# create index idx_tbl_3 on tbl using gist (pos);
CREATE INDEX
6. Generate Data Layout to Facilitate Subsequent Queries
postgres=# select min(crt_time),max(crt_time),count(*) from tbl;
min | max | count
----------------------------+----------------------------+----------
2017-07-22 17:59:34.136497 | 2017-07-22 18:01:27.233688 | 50000000
(1 row)
7. Create an Extreme KNN Query Function
create or replace function ff(point, float8, int) returns setof tid as
$$
declare
v_rec record;
v_limit int := $3;
begin
set local enable_seqscan=off; -- Force index that exits when scanned rows reach a specific number
for v_rec in
select *,
(pos <-> $1) as dist,
ctid
from tbl
order by pos <-> $1
loop
if v_limit <=0 then
-- raise notice "Sufficient data obtained"
return;
end if;
if v_rec.dist > $2 then
-- raise notice "All matching points returned"
return;
else
return next v_rec.ctid;
end if;
v_limit := v_limit -1;
end loop;
end;
$$
language plpgsql strict volatile;
postgres=# select * from ff(point '(100,100)',100,100) ;
ff
-------------
(407383,11)
(640740,9)
(26073,51)
(642750,34)
...
(100 rows)
Time: 1.061 ms
8. CTID Merge Retrieval
Display records that meet the following criteria
(
c1 in (1,2,3,4,100,200,99,88,77,66,55)
or
c2 < 10
)
and
pos <-> point '(0,0)' < 5
and
crt_time between '2017-07-22 17:59:34' and '2017-07-22 17:59:40';
First, look at each criterion separately, find how many records match a criterion and how much time overhead is incurred on an index scan.
1) 54,907 records
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on postgres.tbl (cost=820.07..65393.94 rows=54151 width=73) (actual time=23.842..91.911 rows=54907 loops=1)
Output: id, info, crt_time, pos, c1, c2, c3
Recheck Cond: (tbl.c1 = ANY ('{1,2,3,4,100,200,99,88,77,66,55}'::integer[]))
Heap Blocks: exact=52778
Buffers: shared hit=52866
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.264..14.264 rows=54907 loops=1)
Index Cond: (tbl.c1 = ANY ('{1,2,3,4,100,200,99,88,77,66,55}'::integer[]))
Buffers: shared hit=88
Planning time: 0.105 ms
Execution time: 94.606 ms
(10 rows)
2) 95,147 records
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c2<10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on postgres.tbl (cost=835.73..112379.10 rows=99785 width=73) (actual time=69.243..179.388 rows=95147 loops=1)
Output: id, info, crt_time, pos, c1, c2, c3
Recheck Cond: (tbl.c2 < 10)
Heap Blocks: exact=88681
Buffers: shared hit=88734
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=53.612..53.612 rows=95147 loops=1)
Index Cond: (tbl.c2 < 10)
Buffers: shared hit=53
Planning time: 0.094 ms
Execution time: 186.201 ms
(10 rows)
3) 149,930 records (PostgreSQL uses bitmapOr for merge scanning to quickly get results)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 <10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on postgres.tbl (cost=1694.23..166303.58 rows=153828 width=73) (actual time=98.988..266.852 rows=149930 loops=1)
Output: id, info, crt_time, pos, c1, c2, c3
Recheck Cond: ((tbl.c1 = ANY ('{1,2,3,4,100,200,99,88,77,66,55}'::integer[])) OR (tbl.c2 < 10))
Heap Blocks: exact=134424
Buffers: shared hit=134565
-> BitmapOr (cost=1694.23..1694.23 rows=153936 width=0) (actual time=73.763..73.763 rows=0 loops=1)
Buffers: shared hit=141
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=16.733..16.733 rows=54907 loops=1)
Index Cond: (tbl.c1 = ANY ('{1,2,3,4,100,200,99,88,77,66,55}'::integer[]))
Buffers: shared hit=88
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=57.029..57.029 rows=95147 loops=1)
Index Cond: (tbl.c2 < 10)
Buffers: shared hit=53
Planning time: 0.149 ms
Execution time: 274.548 ms
(15 rows)
4) 60,687 records (It still takes up to 195 milliseconds even if we use the excellent KNN performance optimization).
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ff(point '(0,0)',5,1000000);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Function Scan on postgres.ff (cost=0.25..10.25 rows=1000 width=6) (actual time=188.563..192.114 rows=60687 loops=1)
Output: ff
Function Call: ff('(0,0)'::point, '5'::double precision, 1000000)
Buffers: shared hit=61296
Planning time: 0.029 ms
Execution time: 195.097 ms
(6 rows)
Let's look at how long it takes if we don't use KNN optimization.
The result is very surprising — extreme optimization performance is improved by one order of magnitude.
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where pos<-> point '(0,0)' < 5 ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Seq Scan on postgres.tbl (cost=0.00..1416667.00 rows=16666667 width=73) (actual time=0.016..6393.542 rows=60687 loops=1)
Output: id, info, crt_time, pos, c1, c2, c3
Filter: ((tbl.pos <-> '(0,0)'::point) < '5'::double precision)
Rows Removed by Filter: 49939313
Buffers: shared hit=666667
Planning time: 0.090 ms
Execution time: 6397.087 ms
(7 rows)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where pos<-> point '(0,0)' < 5 order by pos<-> point '(0,0)';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_tbl_3 on postgres.tbl (cost=0.42..2623952.79 rows=16666667 width=81) (actual time=0.088..83076.718 rows=60687 loops=1)
Output: id, info, crt_time, pos, c1, c2, c3, (pos <-> '(0,0)'::point)
Order By: (tbl.pos <-> '(0,0)'::point)
Filter: ((tbl.pos <-> '(0,0)'::point) < '5'::double precision)
Rows Removed by Filter: 49939313
Buffers: shared hit=50454244
Planning time: 0.097 ms
Execution time: 83080.970 ms
(8 rows)
5) 2,640,751 records
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where crt_time between '2017-07-22 17:59:34' and '2017-07-22 17:59:40';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_tbl_2 on postgres.tbl (cost=0.56..90860.33 rows=2462443 width=73) (actual time=0.017..444.194 rows=2640751 loops=1)
Output: id, info, crt_time, pos, c1, c2, c3
Index Cond: ((tbl.crt_time >= '2017-07-22 17:59:34'::timestamp without time zone) AND (tbl.crt_time <= '2017-07-22 17:59:40'::timestamp without time zone))
Buffers: shared hit=42430
Planning time: 0.140 ms
Execution time: 567.451 ms
(6 rows)
Use all indexes to scan data conditions one-by-one to get ctids and perform ctid scans. Now, let's break down the procedure:
First, let's look at merged querying of time and object properties. The performance is excellent. When using bitmapAnd or bitmapOr, a query can skip most data blocks and scan time is shorter than that of a single-index scan.
Note that the number of records is reduced to 7847 in this step.
postgres=# explain (analyze,verbose,timing,costs,buffers) select ctid from tbl
where crt_time between '2017-07-22 17:59:34' and '2017-07-22 17:59:40'
and (
c1 in (1,2,3,4,100,200,99,88,77,66,55)
or
c2 < 10
);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on postgres.tbl (cost=35025.85..44822.94 rows=7576 width=6) (actual time=205.577..214.821 rows=7847 loops=1)
Output: ctid
Recheck Cond: (((tbl.c1 = ANY ('{1,2,3,4,100,200,99,88,77,66,55}'::integer[])) OR (tbl.c2 < 10)) AND (tbl.crt_time >= '2017-07-22 17:59:34'::timestamp without time zone) AND (tbl.crt_time <= '2017-07-22 17:59:40'::timestamp without time zone))
Heap Blocks: exact=6983
Buffers: shared hit=14343
-> BitmapAnd (cost=35025.85..35025.85 rows=7581 width=0) (actual time=204.048..204.048 rows=0 loops=1)
Buffers: shared hit=7360
-> BitmapOr (cost=1621.11..1621.11 rows=153936 width=0) (actual time=70.279..70.279 rows=0 loops=1)
Buffers: shared hit=141
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=15.860..15.860 rows=54907 loops=1)
Index Cond: (tbl.c1 = ANY ('{1,2,3,4,100,200,99,88,77,66,55}'::integer[]))
Buffers: shared hit=88
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=54.418..54.418 rows=95147 loops=1)
Index Cond: (tbl.c2 < 10)
Buffers: shared hit=53
-> Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=127.101..127.101 rows=2640751 loops=1)
Index Cond: ((tbl.crt_time >= '2017-07-22 17:59:34'::timestamp without time zone) AND (tbl.crt_time <= '2017-07-22 17:59:40'::timestamp without time zone))
Buffers: shared hit=7219
Planning time: 0.203 ms
Execution time: 216.697 ms
(20 rows)
Then, look at the scan time for KNN:
Note that 60,687 records meet the KNN distance condition, so I will explain the performance comparison between CTID merge scans and original scans.
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ff(point '(0,0)',5,1000000);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Function Scan on postgres.ff (cost=0.25..10.25 rows=1000 width=6) (actual time=188.563..192.114 rows=60687 loops=1)
Output: ff
Function Call: ff('(0,0)'::point, '5'::double precision, 1000000)
Buffers: shared hit=61296
Planning time: 0.029 ms
Execution time: 195.097 ms
(6 rows)
Finally, we combine these pieces into ctid.
select * from ff(point '(0,0)',5,1000000)
intersect
select ctid from tbl
where crt_time between '2017-07-22 17:59:34' and '2017-07-22 17:59:40'
and (
c1 in (1,2,3,4,100,200,99,88,77,66,55)
or
c2 < 10
);
ff
------------
(1394,8)
(3892,50)
(6124,45)
(7235,8)
(7607,45)
(11540,8)
(13397,31)
(14266,36)
(18149,7)
(19256,44)
(24671,62)
(26525,64)
(30235,48)
(13 rows)
Time: 463.012 ms
Obtain the final records.
select * from tbl where ctid = any
(
array( -- array start
select * from ff(point '(0,0)',5,1000000) intersect select ctid from tbl
where crt_time between '2017-07-22 17:59:34' and '2017-07-22 17:59:40'
and (
c1 in (1,2,3,4,100,200,99,88,77,66,55)
or
c2 < 10
)
) -- array end
);
id | info | crt_time | pos | c1 | c2 | c3
---------+----------------------------------+----------------------------+----------------------------------------+------+------+-----
104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543
291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360
459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8
542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245
570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561
865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875
1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208
1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355
1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232
1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820
1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321
1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74
2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917
(13 rows)
Time: 462.715 ms
It takes 462 milliseconds.
9. Test the Performance of the Original SQL Queries: PostgreSQL Multi-Index BitmapAnd and BitmapOr Skip Scan
Write a SQL query directly instead of using multi CTID scans.
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl
where
crt_time between '2017-07-22 17:59:34' and '2017-07-22 17:59:40'
and (
c1 in (1,2,3,4,100,200,99,88,77,66,55)
or
c2 < 10
)
and
pos <-> point '(0,0)' < 5;
Bitmap Heap Scan on postgres.tbl (cost=35022.06..44857.06 rows=2525 width=73) (actual time=205.542..214.547 rows=13 loops=1)
Output: id, info, crt_time, pos, c1, c2, c3
Recheck Cond: (((tbl.c1 = ANY ('{1,2,3,4,100,200,99,88,77,66,55}'::integer[])) OR (tbl.c2 < 10)) AND (tbl.crt_time >= '2017-07-22 17:59:34'::timestamp without time zone) AND (tbl.crt_time <= '2017-07-22 17:59:40'::timestamp without time zone))
Filter: ((tbl.pos <-> '(0,0)'::point) < '5'::double precision)
Rows Removed by Filter: 7834
Heap Blocks: exact=6983
Buffers: shared hit=14343
-> BitmapAnd (cost=35022.06..35022.06 rows=7581 width=0) (actual time=203.620..203.620 rows=0 loops=1)
Buffers: shared hit=7360
-> BitmapOr (cost=1618.58..1618.58 rows=153936 width=0) (actual time=71.660..71.660 rows=0 loops=1)
Buffers: shared hit=141
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.861..14.861 rows=54907 loops=1)
Index Cond: (tbl.c1 = ANY ('{1,2,3,4,100,200,99,88,77,66,55}'::integer[]))
Buffers: shared hit=88
-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=56.797..56.797 rows=95147 loops=1)
Index Cond: (tbl.c2 < 10)
Buffers: shared hit=53
-> Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=125.255..125.255 rows=2640751 loops=1)
Index Cond: ((tbl.crt_time >= '2017-07-22 17:59:34'::timestamp without time zone) AND (tbl.crt_time <= '2017-07-22 17:59:40'::timestamp without time zone))
Buffers: shared hit=7219
Planning time: 0.160 ms
Execution time: 216.797 ms
(22 rows)
The performance is excellent as expected. We explained the reason earlier. Conditions other than the KNN condition have already converged results to 7,000 records, so it is not necessary to use indexes containing KNN conditions. (It takes 195 milliseconds even a KNN index is used because 60,687 records meet the KNN condition.)
Result verification:
select * from tbl
where
crt_time between '2017-07-22 17:59:34' and '2017-07-22 17:59:40'
and (
c1 in (1,2,3,4,100,200,99,88,77,66,55)
or
c2 < 10
)
and
pos <-> point '(0,0)' < 5;
id | info | crt_time | pos | c1 | c2 | c3
---------+----------------------------------+----------------------------+----------------------------------------+------+------+-----
104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543
291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360
459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8
542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245
570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561
865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875
1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208
1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355
1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232
1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820
1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321
1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74
2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917
(13 rows)
Partitioned Index Example
Assume that the preceding query criteria remain unchanged. We use partitioned indexes to test the performance.
This is to demonstrate the extreme effects of partitioned indexes. In practical scenarios, convergence level may not be that high, (e.g. converge by day or by ID hash). As long as convergence is possible, we can implement excellent performance.
postgres=# create index idx_tbl_4 on tbl using gist (pos) where crt_time between '2017-07-22 17:59:34' and '2017-07-22 17:59:40'
and (
c1 in (1,2,3,4,100,200,99,88,77,66,55)
or
c2 < 10
) ;
CREATE INDEX
Time: 8359.330 ms (00:08.359)
Reconstruct the extreme KNN optimization function
create or replace function ff(point, float8, int) returns setof record as
$$
declare
v_rec record;
v_limit int := $3;
begin
set local enable_seqscan=off; -- Force index that exits when scanned rows reach a specific number
for v_rec in
select *,
(pos <-> $1) as dist
from tbl
where
crt_time between '2017-07-22 17:59:34' and '2017-07-22 17:59:40'
and (
c1 in (1,2,3,4,100,200,99,88,77,66,55)
or
c2 < 10
)
order by pos <-> $1
loop
if v_limit <=0 then
-- raise notice "Sufficient data obtained"
return;
end if;
if v_rec.dist > $2 then
-- raise notice "All matching points returned"
return;
else
return next v_rec;
end if;
v_limit := v_limit -1;
end loop;
end;
$$
language plpgsql strict volatile;
Query performance:
postgres=# select * from ff(point '(0,0)', 5, 10000000) as t(id int, info text, crt_time timestamp, pos point, c1 int, c2 int, c3 int, dist float8);
id | info | crt_time | pos | c1 | c2 | c3 | dist
---------+----------------------------------+----------------------------+----------------------------------------+------+------+-----+-------------------
1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 | 0.421309141034319
1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 | 0.49127323294376
1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 | 2.23004532710301
542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 | 2.23438404136508
291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 | 2.76586731309247
1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 | 2.78803520274409
2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 | 2.88931598221975
459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 | 3.22896754478952
570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 | 3.31688000783581
1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 | 3.47958123047986
865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 | 3.91188935630676
104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 | 4.86069100130757
1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 | 4.97877009299311
(13 rows)
Time: 0.592 ms
Great! Query time is reduced from over 200 milliseconds to less than 1 millisecond.
Summary of Optimization Methods
A retrospective look at the optimization methods:
1. Build different indexes for different data types.
For example, use GiST or SP-GiST indexes for space, B-tree or BRIN indexes for time, and GIN indexes for multiple object properties. The purpose of indexing is to reduce the scope of data scans.
2. Method five mentions data partitioning. The purpose of data partitioning is to organize data in an intentional manner, which means that data is intentionally organized to meet search requirements. For example, if time is a required query condition or a common query condition, then data can be split by time (partitions) to reduce the amount of data that needs to be scanned.
3. Method six describes index partitions. The purpose is similar to method 5. The difference between method 5 and method 6 is that partitions are used at the index level so that the data hit rate is improved directly when an index scan is performed.
4. The ctid merge scan in method seven is similar to multi-index bitmapAnd or bitmapOr scans in PostgreSQL. bitmapAnd/bitmapOr skips blocks that don't require scans, which ctid merge scan in method seven skips rows that don't require scans.
Merge the ctids obtained from multiple index scans. Skip the numbers of rows that don't require scans.
If a filtering condition can significantly reduce ctid (records) when other conditions are AND, it is unnecessary to use a ctid merge scan. Instead, use FILTER as one other condition. (This will slightly increase the CPU overhead.)
5. The best Kung Fu always features ultimate flexibility, freedom, and unlimited imagination of each movement.
PostgreSQL implements multi-index bitmapAnd or bitmapOr scans, which significantly improve the data hit rate for multiple conditions (indexes).
In addition, PostgreSQL features an excellent CBO estimation mechanism, which allows PostgreSQL to not always use all indexes for bitmap merge scans. This is also why the performance described in the section "Test the Performance of the Original SQL Queries — PostgreSQL Multi-index BitmapAnd BitmapOr Skip Scan" is better.
6. How to implement extreme optimization
Adopt method five or six and use fixable conditions as partition keys to partition data or indexes.
For other conditions, multi-index bitmapAnd or bitmapOr scans in PostgreSQL can be used to improve the data hit rate for multiple conditions (indexes).
As we can see, time needed for multidimensional retrieval from 50 million pieces of data by time, space, and object properties is reduced to 0.592 milliseconds.
7. For spatial data, in addition to using the GiST index, we can also use a BRIN index, which requires a lower cost. The performance of filtering is excellent after data is well organized.
Published at DZone with permission of Leona Zhang. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments