To address this, HeavyDB introduced a hash-join-based geospatial join processing. This approach eliminates the naive and expensive nested-loop-style join evaluation strategy by leveraging a more efficient hash join framework. The main benefit of this approach is reducing the number of matching tests needed between the two geometric tables. Basically, it performs a hash table lookup operation to find a set of candidate pair of geometries. And for each candidate pair, we compute a geospatial function described in the query. This optimization significantly reduces the number of expensive geospatial functions we need to perform. In terms of algorithmic complexity, the previous nested loop join approach must evaluate the expensive geospatial function N * M times when we assume the cardinality of two tables are N (> 0) and M (> 0), respectively. For the hash join approach, when we assume we build a hash table on a table with N rows, we need to compute the geospatial function O(N * M) times. Except for occasional edge cases like extremely skewed datasets, the actual number is usually much smaller than N * M so we can see a noticeable and significant performance improvement in most cases.
Let's consider an example where we have a dataset of 1 million geolocated tweets and want to determine the zip zones to which each tweet belongs. Using a simple nested-loop approach, we must test each of the 1 million tweet points against 33,000 zip polygons, resulting in over 3 billion checks using the expensive geospatial function.
In contrast, the tweet points are checked only against polygons with probable matches when we use hash-join-based geospatial join. For instance, the first tweet point may check against three polygons instead of entire polygons in the table. Or we can skip the point if it has no polygons matched when looking up the hash table.
The theory behind our hash-join-based geospatial join approach is based on exploiting the bounding box of geometry. In other words, performing hash join before computing the geospatial function can be seen as filtering unnecessary geometries. In the above example, it filters unnecessary polygons and computes the expensive geospatial function with matched polygons, which significantly reduces the number of geometry checks required.
We need to mention table ordering in the query's FROM clause. Until the 6.2 release, the table containing the POINTS is specified before the one having POLYGONS in the query's FROM clause to enable the hash-join acceleration for the geospatial join query between Point and Polygon. Also, sometimes the order of arguments of those geospatial join functions is important to enable such an optimization. After the 6.2 release, we improved the related logic, so our query optimizer supports automatic FROM table re-ordering and argument re-ordering when available. But the following is a guideline when writing a geospatial join query:
- Use the correct order of tables in the FROM clause.
- In joins involving polygons, the table with polygons needed to be placed on the right side in the query due to optimizer limitations before version 6.2.
- In joins between Point in Polygon, use the correct argument order of the geometric functions.
Here is an example query that falls back to the expensive nested loop join:
SELECT count(*)
FROM tweets
JOIN us_zipcodes
ON ST_Contains(tweets.geom, us_zipcodes.geom)
Hash join failed, reason(s): No equijoin expression found |
Cannot fall back to loop join for non-trivial inner table size
We have three types of geospatial join that use the hash join framework and introduce them one by one in the rest of this article:
-
Point in polygons using the ST_Intersects and ST_Contains geometric functions.
-
Point-to-point distance using the ST_Distance and ST_DWithin geometric functions.
-
Polygon in Polygon using ST_Intersects.
2. Point-in-Polygon (aka. PiP join)
This query type determines whether a point is located within one or more polygons using ST_Contains or ST_Intersections geospatial functions. For instance, we can use this join type when analyzing a list of geolocalized tweets and wanting to identify the specific neighborhood or building from which each tweet was sent.
Note that it is necessary to specify the table containing the POINTS in the FROM clause before the table containing the POLYGONS to enable hash-join acceleration, as described above.
Here is a schematic representation of the query that will be optimized by geospatial hash join:
SELECT ...
FROM points_table
JOIN polygons_table
ON ST_Contains|ST_Intersect(polygons_table.geom, points_table.geom)
Performance
Table 1 shows the benchmark results when varying # points and/or # polygons in the dataset. As you can see, the performance difference between loop join and accelerated geospatial join via hash join is noticeable, ranging from 3.8 to 1900 times.
One worth mentioning is the performance difference between cold-run and hot-run. HeavyDB supports various caching layers, and the main reason for the difference is related to the hash table recycler. When our optimizer decides to keep the hash table in the cache, we can recycle it in the next run, like the query containing the same join condition. Once we recycle it, we can skip the hash table-building step, improving performance. In Table 1, the difference between the second and third columns represents such an improvement. We also support compressed space for ST_Intersects geospatial function and can expect faster query execution.
Table 1. A performance benchmark result of geospatial join query using Point-in-Polygon
Cardinalities / Function | Loop Join (ms) |
Hash Join (ms) |
Hash join cold run (ms) |
Speed up | Speed up cold run |
---|---|---|---|---|---|
Points 5M Polys 400K ST_CONTAINS | 7,015,500 | 1,812 | 3,895 | 2871x | 1800x |
Points 5M Polys 400K ST_INTERSECTS | 8,239,000 | 2,463 | 4,523 | 3500x | 1900x |
Points 1K Polys 400K ST_CONTAINS | 1,504 | 400 | 2,520 | 3.9x | 0,6x |
Points 1K Polys 400K ST_INTERSECTS | 1,538 | 400 | 2,455 | 3.8x | 0,6x |
Points 13M Polys 300 ST_CONTAINS | 478,170 | 2,682 | 2,840 | 178x | 168x |
Points 13M Polys 300 ST_INTERSECTS | 1,030,998 | 4,750 | 5,748 | 217x | 179x |
Points 100K Polys 300 ST_CONTAINS | 3,518 | 59 | 295 | 60x | 12x |
Points 100K Polys 300 ST_INTERSECTS | 7,525 | 68 | 295 | 60x | 13x |
The timings were recorded on a GTX 1050 GPU with 768 CUDA cores, featuring an FP64 to FP32 ratio of 1:32 on HeavyDB version 6.4 on a 64GB system
3. Point-to-Point distance join (aka. range hash join)
The Point-to-Point distance join was introduced after supporting the Point-in-Polygon (PiP) hash join. In terms of geospatial function, ST_Distance and ST_DWithin fall into this category. The details of this optimization strategy are beyond the scope of this article. But briefly speaking, we support a specialized way to handle point coordinates as a hash key instead of using a bounding box.
As we mentioned in the Point-In-Polygon query, the optimizer will automatically re-order the tables so that we can build a hash table on the smaller table. But please check the argument order if you are using a prior version of HeavyDB 6.2 (we recommend manually placing the expression for a smaller table as the first argument in the ST_Distance or ST_Dwithin function to enable the hash-join acceleration).
Our hash join framework will accelerate the following example:
SELECT count(*)
FROM table_100k
JOIN table_10000k
ON ST_Distance(geom_field_100k,geom_field_10000k) < 0.001
Performance
Table 2 shows the speedup achieved over a loop join which ranges from 50 to 36,700 times when examining various distances between a dataset of 14 million points and 1 million points.
Table 2. A performance benchmark result of geospatial join query using Point-to-Point distance join
Distance (rad) / Returned Rows | Loop Join (ms) | Range hash Join (ms) | Range hash join cold run (ms) |
Speed up | Speed up cold run |
---|---|---|---|---|---|
0.0001 / 1M | 24,520,533 | 661 | 847 | 36700x | 29170x |
0.001 / 384M | 24,520,533 | 8,112 | 8,310 | 3013x | 2973x |
0.01 / 36737M | 24,520,533 | 486,885 | 487,003 | 50x | 50x |
One thing worth mentioning is the importance of selecting the appropriate range for the query. Queries used to benchmark in Table 2 calculate the distance between the drop-off points of NYC taxi trips and buildings. Therefore, it is more sensible to use an approximate distance range of 10 to 100 meters (0.001 to 0.0001 degrees) rather than ranges such as 1 kilometer (0.01 degrees).
The following are some general guidelines when dealing with range hash join to achieve better performance:
- Avoid relying on automatic table/column ordering. Instead, specify the larger table first in the FROM clause manually
- Use the smallest possible radius in the query because larger radii can significantly slow down the query
4. Polygon-in-Polygon Join
Disclaimer: to enable hash-join acceleration for this join type, a server configuration "enable-hashjoin-many-to-many" must be set to TRUE.
This join type checks the relationship between two polygons. As an example, we want to know which New York City buildings that span between two or more zip codes as the following query:
SELECT t1.id
FROM nyc_buildings t1
JOIN us_zip_codes t2
ON ST_Intersects(t2.geom,t1.geom)
Performance
Table 3 shows the benchmark result between 1M and {300, 3300} polygons. Note that we vary the complexity of the polygons to show how much the performance can differ when computing the query using the ST_Intersects geospatial function (please compare results between the first and third rows in Table 3). Compared with the first two join types, the speedup over the loop join is not that large (i.e., 36700 times). But we still get meaningful performance improvement in our benchmark scenarios. Specifically, in the second example, our hash-join acceleration was 9.5 times faster than the loop-join approach when matching multiple polygons' ZIP zones in NYC against the US ZIP database.
Table 3. A performance benchmark result of geospatial join query using Polygon-in-Polygon join
Cardinalities | Loop Join (ms) | Hash Join (ms) | Hsah Join cold run(ms) | Speed up | Speed up cold run |
---|---|---|---|---|---|
1M Polys on 300 Complex Polys | 674820 | 351700 | 378789 | 1.9x | 1.9x |
1M Polys on 33K Simple Polys (overlaps happen on 300 Polys only) | 268347 | 28240 | 31269 | 9.5x | 8.6x |
1M Polys on 300 Simple Polys (all overlapped) | 51878 | 22954 | 23079 | 2.3x | 2.2x |
5. Troubleshooting System Logs and parameters
Sometimes, a query using nested loop join can stall the entire system, so it's helpful to know the join strategy we used to process your query: nested loop join vs. hash join.
For this purpose, we can analyze a generated code that HeavyDB uses internally to process your query. We can get the code (written by LLVM IR) by using our EXPLAIN keyword (link).
Let's consider the following query with the "EXPLAIN" keyword:
EXPLAIN
SELECT count(*)
FROM tweets
JOIN us_zipcodes
ON ST_Contains(us_zipcodes.geom,tweets.geom)
And you can see lines of codes similar to the following:
explanation
IR for the GPU:
===============
...
; Function Attrs: mustprogress nounwind uwtable
... void @multifrag_query_hoisted_literals(...)
...
%3 = call i64 @get_bucket_key_for_range_compressed(...)
%5 = call i64 @get_bucket_key_for_range_compressed(...)
....
If HeavyDB accelerates your Point-to-Polygon and Point-to-Point geospatial join queries, you can find get_bucket_key_for_range_compressed or get_bucket_key_for_range_double in the codes.
For polygon-to-polygon geospatial hash join, the get_num_buckets_for_bounds or get_candidate_rows can be found.
Example of search result within a browser window using immerse
Also, we can automate the process using heavysql combined with the grep command as:
echo "explain select count(*) from tweets join us_zipcodes on st_contains(us_zipcodes.geom,tweets.geom);" \
| bin/heavysql -p HyperInteractive test_geom | grep "@get_bucket_key_for_range"
%3 = call i64 @get_bucket_key_for_range_compressed(...)
%5 = call i64 @get_bucket_key_for_range_compressed(...)
Analyzing system logs for more information about hash join acceleration
Examining the system logs can provide valuable insights, such as what type of joins is used and how much memory is used to process it. Those are particularly helpful for system administrators to diagnose problematic cases that may lead to inefficient resource usage and poor query performance.
To obtain such detailed information, we must set the "debug" logging server configuration parameter: `log-severity` to at least DEBUG1 (or DEBUG2). And here are some explanations for the system log and server configuration parameters in our documentation: link1 and link2.
Here is some part of the system logs that represent the join will be evaluated by nested loop join:
2023-05-23T12:50:10.670126 I 460823 0 7 ExpressionRewrite.cpp:857
Unable to rewrite ST_cContains_MultiPolygon_Point to ********** conjunction.
Cannot build hash table over LHS type. Check to join order.
* depending on the database version overlaps or boundingbox can be found here
OverlapsJoinHashTable.cpp:832 Final tuner output: Step Num: 9, Threshold: (0.179337, 0.024150), Step Size: 2.000000, Min: 0.000000 with properties entry_count: 2199892, emitted_keys 1357812, hash table size 58228656, keys per bin 1.234435
OverlapsJoinHashTable.cpp:834 Final bucket sizes: OverlapsJoinHashTable.cpp:836 dim[0]: 0.14347
OverlapsJoinHashTable.cpp:836 dim[1]: 0.0214668 OverlapsTuningParamRecycler.cpp:61 [Overlaps Join Hashtable's Auto Tuner's Parameters, CPU] Put auto tuner parameters for the overlaps hash join qual to cache
OverlapsJoinHashTable.cpp:1223 Building overlaps join hash table on CPU.
OverlapsJoinHashTable.cpp:1794 Checking CPU hash table cache.
BaselineHashTableBuilder.h:299 Initializing CPU Join Hash Table with 2199892 hash entries and 5757596 entries in the one to many buffers
BaselineHashTableBuilder.h:302 Total hash table size: 58228656 Bytes
HashtableRecycler.cpp:108 [Overlaps Join Hashtable, CPU] Put item to the cache
BaselineHashTableBuilder.h:485 Initializing GPU Hash Table for device 0 with 2199892 hash entries and 5757596 entries in the OneToMany buffer BaselineHashTableBuilder.h:488 Total hash table size: 58228656 Bytes
OverlapsJoinHashTable.cpp:148 Built geo hash table OneToMany in 375 ms
System parameters related to Geospatial hash joins
The system administrators can control the system's behavior by changing server parameters related to geospatial hash join, as shown in Table 4. These include enabling/disabling specific optimization strategies, changing the memory reserved for the hash-table caching, etc.
Table 4. A list of system parameters related to geospatial hash join
Hint Name | Default Value |
Description |
---|---|---|
enable-overlaps-hashjoin | TRUE | Enable hash join acceleration for geospatial join functions |
enable-distance-rangejoin | TRUE | Enable Point-to-Point Distance range hash join |
enable-hashjoin-many-to-many | FALSE | Enable Polygon-to-Polygon hash table generation |
hashtable-cache-total-bytes | 4GB | The total size of the hash table caches. When the amount of memory to keep hash tables, including geospatial join, exceeds this value, the system may need to evict some hash tables from the cache to make room for new ones. |
max-cacheable-hashtable-size-bytes | 2GB | The maximum size of a hash table can be cached. If we use too small value for this parameter, we do not cache most hash tables which may cause some noticeable performance degradation for hash join queries. |
The default values are as of the 7.0.2 release.
The general rule of thumb is to modify memory parameters only necessary. For instance, if the system builds the same hash table for each query, it's worth checking one of those parameters to enable caching them. In contrast, when the system does not have enough system memory, it's worth checking those parameters to see whether they are set to large.
Query Hints related to Geospatial Hash Joins
The default behavior of HeavyDB regarding geospatial join, which enables various caches and optimizing various logic to find the best knob for the join processing, is suitable for most cases. But we may need to control those settings for specific queries. To do that, we can exploit our query hint (documentation link) related to geospatial join. Table 5 represents those query hints.
Table 5. A list of query hints related to geospatial joins.
Name | Description | Values |
---|---|---|
overlaps_no_cache | Do not cache a join hash table for geospatial join processing. Note that HeavyDB currently only supports a cache for CPU hash tables. | N/A |
overlaps_allow_gpu_build | Force using GPU to build a hash table for geospatial join instead of CPU. Currently, we build a hash table for geospatial join using CPU regardless of the device type to perform a query. | N/A |
overlaps_bucket_threshold | Set the threshold for how many hash keys are required to represent geometry in a hash table. A geometry requiring N hash keys belongs to N hash slots. A large threshold decrease N for each geometry. And what this means is that it increases # geometry belonging to a single hash key. Regarding the hash table, it reduces # unique hash entries, and so does its memory usage. But it can degrade join performance because it increases the number of checks using an expensive geospatial function. Because the matching hash entry can have many geometries it represents. |
A DOUBLE between 0 and 90 |
overlaps_max_size | Set the maximum total size of the hash table for geospatial join (in bytes). If the size of a hash table is larger than this, we fall back to a nested loop join. | An Integer greater than 0 |
overlaps_keys_per_bin | Set the maximum # geometries each hash table entry (or bucket) can hold. This parameter is related to the `overlaps_bucket_threshold.` But it explicitly controls the bucket's property. Similar to the above parameter, using a larger value will reduce memory usage, but it's also likely to increase the number of overlaps between geometries, potentially slowing down the join operation. |
between 0 and maximum DOUBLE value |
Example queries using query hint(s) related to geospatial hash join
Suppose we need to update a table used to build a hash table frequently. In such a case, we can utilize the overlaps_no_cache hint to prevent keeping the hash table in the cache. Furthermore, since we know that the hash table needs to be built for each run, we can adjust certain parameters to improve the building time of the hash table (e.g., overlaps_bucket_threshold; this may degrade the performance of the hash table probing). Additionally, we may consider building a geospatial hash table using a GPU device using the overlaps_allow_gpu_build hint (as we explained, HeavyDB uses CPU to build a geospatial hash join by default). By considering such query hints, we can experimentally find the best parameter settings for the geospatial join query.
The following query represents a geospatial join between 100 million points and 1 million polygons. Let's assume that the polygon table (polys_table) undergoes frequent changes.
SELECT count(*)
FROM points_table
JOIN polys_table
ON ST_Contains(poly_column,point_column)
Looking for the Final tuner output at the database logs (from debug-level2 log)
OverlapsJoinHashTable.cpp:878 Final tuner output: Step Num: 9,
Threshold: (0.000028, 0.000035),
Step Size: 2.000000, Min: 0.000000 with properties entry_count: 18319768,
emitted_keys 14860893, hash table size 499118004, keys per bin 1.622389
...
OverlapsJoinHashTable.cpp:150 Built geo hash table OneToMany in 1296 ms
* depending on the database version Overlaps or BoundingBoxIntersect can be found here
We can see that our query optimizer uses thresholds (0.000028, 0.000035) related to `overlaps_bucket_threshold.` And the size of the hash table based on those thresholds is around 499MB. Also, it took 1296 ms to build on the CPU, and the total query runtime for this setting was 2000 ms.
We assumed it would be frequently updated, so we may decide to use the overlaps_no_cache hint. Also, adjusting the overlaps_bucket_threshold can improve resource usage and increase performance. Let's change it from 0.000028 to 0.0001. After then, we can reduce the hash table size from 499 MB to 227 MB. Also, it significantly improves the CPU hash table build time from 1296 ms to 486 ms. Though there is a slight decrease in join performance, the total runtime was noticeably improved from 2000 ms to 1400 ms.
We can check all of this in the server logs.
OverlapsJoinHashTable.cpp:562 Setting overlaps bucket threshold
'overlaps_hashjoin_bucket_threshold' via query hint: 0.0001
OverlapsJoinHashTable.cpp:583 User requests to skip caching overlaps join hashtable
and its tuned parameters for this query
OverlapsJoinHashTable.cpp:193 Computing x and y bucket sizes for overlaps hash join
with maximum bucket size 0.000100, 0.000100
BaselineHashTable.h:110 Initialize a CPU baseline hash table with join type OneToMany,
hash table size: 227848148 Bytes, # hash entries: 8215820,
# entries stored in the payload buffer: 7667117, rowid size: 4 Bytes
...
OverlapsJoinHashTable.cpp:150 Built geo hash table OneToMany in 486 ms
To further optimize the process, we can consider using the GPU to build a hash table using the overlaps_allow_gpu_build query hint. In some cases, this substantially reduces the hash table build time, which was only 82 ms in our example, as you can see in the below log:
...
OverlapsJoinHashTable.cpp:1394 Building overlaps join hash table on GPU.
OverlapsJoinHashTable.cpp:150 Built geo hash table OneToMany in 82 ms
...
After applying all techniques we described above, the query elapsed time decreased from 1400 ms to 965 ms. Table 6 summarizes the improvement per query hints we applied.
Table 6. Performance improvements per query hint
Query hint | Hash Table Build time (ms) |
Hash Table Size (Megabytes) |
Total query time (ms) | Speedup |
---|---|---|---|---|
None | 1296 | 499 | 2000 | N/A |
overlaps_bucket_threshold | 486 | 227 | 1400 | 1.42x |
overlaps_allow_gpu_build | 82 | 227 | 965 | 2.07x |
This is the final form of the query, with the hints added in the /*+ ... */ as follows:
SELECT /*+ overlaps_no_cache,
overlaps_bucket_threshold(0.0001),
overlaps_allow_gpu_build */
count(*)
FROM points_table
JOIN polys_table
ON ST_Contains(poly_column,point_column)
In most situations, our query optimizer finds reasonable parameters for the geospatial join query. But this opens a chance to examine the best knob that our optimizer may not be able to find.
6. Conclusion
If you want to deeply understand how our geospatial hash joins are implemented, I recommend reading this fantastic article authored by our engineer. This article provides detailed insights into the inner workings of joins, explaining the underlying algorithms and optimizations employed to enhance performance.
Making Geospatial joins interactive at scale
We expect you can delve into the technical aspects and learn the techniques used to accelerate geospatial joins. This firsthand knowledge will allow you to grasp the nuances of the implementation and make informed decisions when working with geometric data in your queries.
Comments
0 comments
Article is closed for comments.