TPCH Q16, sort is slow
Hi I am runnning TPCH (scale=10) on a single T4 against HeavyDB, I find Q16 is very slow. Q16
select
p_brand,
p_type,
p_size,
count(distinct ps_suppkey) as supplier_cnt
from
partsupp,
part
where
p_partkey = ps_partkey
and p_brand <> 'Brand#41'
and p_type not like 'PROMO ANODIZED%'
and p_size in (34, 10, 44, 45, 19, 26, 18, 47)
and ps_suppkey not in (
select
s_suppkey
from
supplier
where
s_comment like '%Customer%Complaints%'
)
group by
p_brand,
p_type,
p_size
order by
supplier_cnt desc,
p_brand,
p_type,
p_size;
And this is the execution time of different steps
6454ms total duration for executeRelAlgQuery
6454ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:588
0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:589
6ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:895
6ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:1043
6ms start(0ms) executeCompound RelAlgExecutor.cpp:2207
6ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3552
2ms start(0ms) compileWorkUnit NativeCodegen.cpp:2672
0ms start(2ms) markDeadRuntimeFuncs NativeCodegen.cpp:1860
0ms start(2ms) optimizeAndCodegenGPU NativeCodegen.cpp:1336
0ms start(2ms) ExecutionKernel::run ExecutionKernel.cpp:126
0ms start(2ms) fetchChunks Execute.cpp:2927
0ms start(2ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
0ms start(2ms) executePlanWithoutGroupBy Execute.cpp:3314
0ms start(2ms) launchGpuCode QueryExecutionContext.cpp:225
0ms start(3ms) collectAllDeviceResults Execute.cpp:2319
0ms start(3ms) reduceMultiDeviceResults Execute.cpp:1294
0ms start(3ms) reduceMultiDeviceResultSets Execute.cpp:1340
1ms start(3ms) compileWorkUnit NativeCodegen.cpp:2672
0ms start(5ms) markDeadRuntimeFuncs NativeCodegen.cpp:1860
0ms start(5ms) optimizeAndCodegenGPU NativeCodegen.cpp:1336
0ms start(5ms) ExecutionKernel::run ExecutionKernel.cpp:126
0ms start(5ms) fetchChunks Execute.cpp:2927
0ms start(5ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
1ms start(5ms) executePlanWithGroupBy Execute.cpp:3536
1ms start(5ms) launchGpuCode QueryExecutionContext.cpp:225
0ms start(7ms) getRowSet QueryExecutionContext.cpp:160
0ms start(7ms) reduceMultiDeviceResults Execute.cpp:1294
0ms start(7ms) reduceMultiDeviceResultSets Execute.cpp:1340
0ms start(7ms) resultsUnion Execute.cpp:1264
6446ms start(7ms) executeRelAlgSeq RelAlgExecutor.cpp:895
6446ms start(7ms) executeRelAlgStep RelAlgExecutor.cpp:1043
6446ms start(7ms) executeSort RelAlgExecutor.cpp:3149
1864ms start(8ms) executeWorkUnit RelAlgExecutor.cpp:3552
3ms start(8ms) compileWorkUnit NativeCodegen.cpp:2672
0ms start(9ms) getInstance HashJoin.cpp:295
0ms start(9ms) reify PerfectJoinHashTable.cpp:354
0ms start(9ms) getOneColumnFragment ColumnFetcher.cpp:78
New thread(17)
0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:727
0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:90
End thread(17)
0ms start(10ms) markDeadRuntimeFuncs NativeCodegen.cpp:1860
0ms start(10ms) optimizeAndCodegenGPU NativeCodegen.cpp:1336
0ms start(11ms) ExecutionKernel::run ExecutionKernel.cpp:126
0ms start(11ms) fetchChunks Execute.cpp:2927
44ms start(11ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696
1815ms start(56ms) executePlanWithGroupBy Execute.cpp:3536
1814ms start(56ms) launchGpuCode QueryExecutionContext.cpp:225
0ms start(1870ms) getRowSet QueryExecutionContext.cpp:160
0ms start(1870ms) reduceMultiDeviceResults Execute.cpp:1294
0ms start(1870ms) reduceMultiDeviceResultSets Execute.cpp:1340
0ms start(1872ms) collectAllDeviceResults Execute.cpp:2319
0ms start(1872ms) reduceMultiDeviceResults Execute.cpp:1294
0ms start(1872ms) reduceMultiDeviceResultSets Execute.cpp:1340
4581ms start(1872ms) sort ResultSet.cpp:771
3ms start(1872ms) initPermutationBuffer ResultSet.cpp:849
36ms start(1875ms) createComparator ResultSet.h:838
4541ms start(1912ms) topPermutation ResultSet.cpp:1211
I have some observations: 1. We have the sort operator here because of the order by clause. Sort is very slow and it seems sort happens on CPU instead of GPU. I'm wondering why sort is performed on the CPU? Moreover, the query returns 27K records. Sorting 27K queries should not be that slow even on CPU? 2. I observe two executeRelAlgStep here, it seems output from the first step is moved out of GPU after the first step, before executing the second step, the intermediate result is moved from CPU to GPU. Why moving the data back and forth, is it because of multi-device reduction? Since reduction is happened on CPU, therefore all the intermediate results need to be moved out of GPU after a single step? But here, I only have one GPU, it still moves out the intermediate results, any explanation for that?
Thanks! Lily
-
Hi,
Yes probably the last sort operation is limiting your performance, Anyway also for an SF of 10, the 1800ms figure isn't exactly a stellar performer.
The join and the semijoin itself should go thru a perfect hash join, and after the hash index is created, they shouldn't take so much (on my workstation with a SF of 100 and 1 GPU it's taking around
100ms) Removing the distinct in the count(distinct ps_suppkey) and the operation from the sort, the query is taking around 400ms (with the sort of the first 3 fields taking 329ms), and using the approx_count_distinct function is taking around 1000ms (the sort is 327ms).So I guess we have some troubles while trying to sort the results of a reduced operation; we have an optimized path if just the top n results are needed, so if you try to run
heavysql> select p_brand, p_type, p_size, approx_count_distinct(ps_suppkey) as supplier_cnt from partsupp, part where p_partkey = ps_partkey and p_brand <> 'Brand#41' and p_type not like 'PROMO ANODIZED%' and p_size in (34, 10, 44, 45, 19, 26, 18, 47) and ps_suppkey not in ( select s_suppkey from supplier where s_comment like '%Customer%Complaints%' ) group by p_brand, p_type, p_size order by supplier_cnt desc,p_brand, p_type, p_size limit 10; p_brand|p_type|p_size|supplier_cnt Brand#33|MEDIUM BRUSHED TIN|34|611 Brand#11|STANDARD BURNISHED NICKEL|26|591 Brand#55|LARGE BURNISHED STEEL|10|590 Brand#14|SMALL BRUSHED BRASS|45|587 Brand#44|LARGE BRUSHED BRASS|18|578 Brand#44|STANDARD BURNISHED BRASS|18|577 Brand#14|ECONOMY BRUSHED COPPER|19|575 Brand#54|ECONOMY PLATED COPPER|18|575 Brand#24|STANDARD PLATED COPPER|34|573 Brand#25|PROMO BRUSHED COPPER|47|571 10 rows returned. Execution time: 833 ms, Total time: 836 ms
and in this case, the sort is done in parallel, and it takes 64ms
without limit, the query returns the results in more than 3000ms with the sort alone taking 2464ms.
sorting for the dimension only 1100ms (sort 334)
sorting the count distinct only so writing the query this way
select p_brand, p_type, p_size, approx_count_distinct(ps_suppkey) as supplier_cnt from partsupp, part where p_partkey = ps_partkey and p_brand <> 'Brand#41' and p_type not like 'PROMO ANODIZED%' and p_size in (34, 10, 44, 45, 19, 26, 18, 47) and ps_suppkey not in ( select s_suppkey from supplier where s_comment like '%Customer%Complaints%' ) group by p_brand, p_type, p_size order by supplier_cnt desc;
just the sort itself takes 1960ms, so there is something we are doing bad, but I don't know what can be the reason. I opened an issue about something similar (sort with top n) but it hasn't be prioritized.
Candido
Please sign in to leave a comment.
Comments
1 comment