How to find out, what takes most time in query processing
Hello , where can i find out, what takes most time in query proccessing ?
I have following query, but it takes almost 40 sec to proccess on 10GB dataset
select s_acctbal,s_name,n_name,p_partkey,p_mfgr,s_address,s_phone,s_comment from part,supplier,partsupp,nation,region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 15 * and p_type like '%BRASS' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE' and ps_supplycost = (select min(ps_supplycost) from partsupp,supplier,nation,region * where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE') order by s_acctbal desc,n_name,s_name,p_partkey fetch next 100 rows only;
-
Hi @Sipy,
You can start the database by setting the parameter enable-debug-timer to true. The timings of macro steps of the query, like the execution in CPU/GPU, projection, and the "order by" will be written to the log. .
We haven't anything to return the timings of single steps of the query, like group by, joins.
It would be difficult considering the extreme parallelism degree of GPU execution.
I tried query on 100GB, using 1 or 2 GPUs, and the response time is between 4000ms and 1200ms.
heavysql> select s_acctbal,s_name,n_name,p_partkey,p_mfgr,s_address,s_phone,s_comment from part,supplier,partsupp,nation,region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 15 and p_type like '%BRASS' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE' and ps_supplycost = (select min(ps_supplycost) from partsupp,supplier,nation,region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE') order by s_acctbal desc,n_name,s_name,p_partkey limit 10; s_acctbal|s_name|n_name|p_partkey|p_mfgr|s_address|s_phone|s_comment 9999.70|Supplier#000239544|UNITED KINGDOM|6739531|Manufacturer#4|,6oexY7ft68JKJCN1SQ LUO5Xf|33-509-584-9496|ets are. blithely special accounts wake across t 9999.65|Supplier#000143654|FRANCE|2393647|Manufacturer#3|HhHCZ,RosE8He4uYvyIDqsPZe,7cZiJh1y9,|16-166-504-5864|special accounts. never dogged deposits 9999.49|Supplier#000615014|GERMANY|12365001|Manufacturer#3|4C,uOuJDig7z,4e|17-780-902-4027|pinto beans affix along th 9999.28|Supplier#000494480|ROMANIA|17244462|Manufacturer#3|hNB2qcYmnyeqrd em,o29TxzLwwrl|29-756-312-1779|ly. slyly final pains detect furiously qu 9998.87|Supplier#000826281|ROMANIA|14326252|Manufacturer#3|LkGmLSAjbjgKS5ZRepL|29-775-451-4774|slyly silent accounts affix furiously across the final, even ideas; tithe 9998.56|Supplier#000039514|UNITED KINGDOM|7539499|Manufacturer#2|fUN95rpikfiqeGo,okl,27ItSq9fMpr|33-317-828-7758|olites use about the blithely regular warhorses. carefully final instructions a 9997.89|Supplier#000319666|RUSSIA|19319665|Manufacturer#1|SZtdx3rzXnFuiVuJzKathV9|32-972-151-6038|furiously ironic pearls. furiously regular foxes along the furiously 9997.85|Supplier#000718770|ROMANIA|718769|Manufacturer#1|2173woKsCg7zpHCqhiEXFTmYYqpT4XONnRAsUryE|29-663-409-2865|posits sleep across the blithely express requests. 9997.83|Supplier#000348318|FRANCE|9348317|Manufacturer#4|AvfnvjCFAg1aNpj|16-718-707-5676|e pending requests: furiously even deposits boost furiously slyly even requests. 9997.73|Supplier#000329974|RUSSIA|4079969|Manufacturer#3|eKtIadopsE|32-268-488-7178|uickly. furiously ironic requests are slyly above the regular 10 rows returned. Execution time: 1133 ms, Total time: 1179 ms
Could you enable the debug timer and post the log of your execution? If the default storage directory is used the log file is
/var/lib/heavyai/log/omnisci_server.INFO
the result of my execution is
1107ms total duration for executeRelAlgQuery 1107ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516 0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517 1106ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814 4ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962 4ms start(0ms) executeCompound RelAlgExecutor.cpp:2116 4ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3388 1ms start(0ms) compileWorkUnit NativeCodegen.cpp:2589 0ms start(2ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 0ms start(2ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 0ms start(2ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(2ms) fetchChunks Execute.cpp:2941 0ms start(2ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781 0ms start(2ms) executePlanWithoutGroupBy Execute.cpp:3362 0ms start(2ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(3ms) collectAllDeviceResults Execute.cpp:2333 0ms start(3ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(3ms) reduceMultiDeviceResultSets Execute.cpp:1355 1ms start(3ms) compileWorkUnit NativeCodegen.cpp:2589 0ms start(4ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 0ms start(4ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 0ms start(4ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(4ms) fetchChunks Execute.cpp:2941 0ms start(4ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781 0ms start(4ms) executePlanWithGroupBy Execute.cpp:3583 0ms start(4ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(5ms) getRowSet QueryExecutionContext.cpp:158 0ms start(5ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(5ms) reduceMultiDeviceResultSets Execute.cpp:1355 0ms start(5ms) resultsUnion Execute.cpp:1279 156ms start(5ms) executeRelAlgStep RelAlgExecutor.cpp:962 156ms start(5ms) executeCompound RelAlgExecutor.cpp:2116 156ms start(6ms) executeWorkUnit RelAlgExecutor.cpp:3388 2ms start(6ms) compileWorkUnit NativeCodegen.cpp:2589 0ms start(6ms) getInstance HashJoin.cpp:295 0ms start(6ms) reify PerfectJoinHashTable.cpp:354 0ms start(6ms) getOneColumnFragment ColumnFetcher.cpp:78 0ms start(6ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(170) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(170) New thread(171) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(171) 0ms start(7ms) getInstance HashJoin.cpp:295 0ms start(7ms) reify PerfectJoinHashTable.cpp:354 0ms start(7ms) getOneColumnFragment ColumnFetcher.cpp:78 0ms start(7ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(172) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(172) New thread(173) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(173) 0ms start(7ms) getInstance HashJoin.cpp:295 0ms start(7ms) synthesize_metadata InputMetadata.cpp:306 0ms start(7ms) reify PerfectJoinHashTable.cpp:354 0ms start(7ms) getOneColumnFragment ColumnFetcher.cpp:78 0ms start(7ms) ColumnarResults ColumnarResults.cpp:64 0ms start(7ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(174) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(174) New thread(175) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(175) 0ms start(8ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 0ms start(8ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 0ms start(8ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(8ms) fetchChunks Execute.cpp:2941 New thread(47) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(0ms) fetchChunks Execute.cpp:2941 0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781 131ms start(0ms) executePlanWithGroupBy Execute.cpp:3583 131ms start(0ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(131ms) getRowSet QueryExecutionContext.cpp:158 0ms start(131ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(131ms) reduceMultiDeviceResultSets Execute.cpp:1355 End thread(47) 0ms start(8ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781 130ms start(8ms) executePlanWithGroupBy Execute.cpp:3583 130ms start(8ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(139ms) getRowSet QueryExecutionContext.cpp:158 0ms start(139ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(139ms) reduceMultiDeviceResultSets Execute.cpp:1355 21ms start(140ms) collectAllDeviceResults Execute.cpp:2333 21ms start(140ms) reduceMultiDeviceResults Execute.cpp:1309 21ms start(140ms) reduceMultiDeviceResultSets Execute.cpp:1355 4ms start(162ms) executeRelAlgStep RelAlgExecutor.cpp:962 4ms start(162ms) executeCompound RelAlgExecutor.cpp:2116 4ms start(162ms) executeWorkUnit RelAlgExecutor.cpp:3388 1ms start(162ms) compileWorkUnit NativeCodegen.cpp:2589 0ms start(163ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 0ms start(163ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 0ms start(163ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(163ms) fetchChunks Execute.cpp:2941 0ms start(163ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781 0ms start(163ms) executePlanWithoutGroupBy Execute.cpp:3362 0ms start(163ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(164ms) collectAllDeviceResults Execute.cpp:2333 0ms start(164ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(164ms) reduceMultiDeviceResultSets Execute.cpp:1355 1ms start(164ms) compileWorkUnit NativeCodegen.cpp:2589 0ms start(165ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 0ms start(165ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 0ms start(165ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(165ms) fetchChunks Execute.cpp:2941 0ms start(165ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781 0ms start(165ms) executePlanWithGroupBy Execute.cpp:3583 0ms start(165ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(166ms) getRowSet QueryExecutionContext.cpp:158 0ms start(166ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(166ms) reduceMultiDeviceResultSets Execute.cpp:1355 0ms start(166ms) resultsUnion Execute.cpp:1279 5ms start(166ms) executeRelAlgStep RelAlgExecutor.cpp:962 5ms start(166ms) executeCompound RelAlgExecutor.cpp:2116 5ms start(166ms) executeWorkUnit RelAlgExecutor.cpp:3388 1ms start(166ms) compileWorkUnit NativeCodegen.cpp:2589 0ms start(167ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 0ms start(167ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 0ms start(167ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(167ms) fetchChunks Execute.cpp:2941 0ms start(167ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781 0ms start(168ms) executePlanWithoutGroupBy Execute.cpp:3362 0ms start(168ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(168ms) collectAllDeviceResults Execute.cpp:2333 0ms start(168ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(168ms) reduceMultiDeviceResultSets Execute.cpp:1355 1ms start(169ms) compileWorkUnit NativeCodegen.cpp:2589 0ms start(170ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 0ms start(170ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 0ms start(170ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(170ms) fetchChunks Execute.cpp:2941 0ms start(170ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781 1ms start(170ms) executePlanWithGroupBy Execute.cpp:3583 1ms start(170ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(172ms) getRowSet QueryExecutionContext.cpp:158 0ms start(172ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(172ms) reduceMultiDeviceResultSets Execute.cpp:1355 0ms start(172ms) resultsUnion Execute.cpp:1279 893ms start(172ms) executeRelAlgStep RelAlgExecutor.cpp:962 893ms start(172ms) executeProject RelAlgExecutor.cpp:2166 892ms start(173ms) executeWorkUnit RelAlgExecutor.cpp:3388 376ms start(204ms) compileWorkUnit NativeCodegen.cpp:2589 0ms start(205ms) getInstance HashJoin.cpp:295 0ms start(205ms) reify PerfectJoinHashTable.cpp:354 0ms start(205ms) getOneColumnFragment ColumnFetcher.cpp:78 0ms start(205ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(176) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(176) New thread(177) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(177) 13ms start(206ms) getInstance HashJoin.cpp:295 4ms start(206ms) synthesize_metadata InputMetadata.cpp:306 8ms start(210ms) reify PerfectJoinHashTable.cpp:354 8ms start(210ms) getOneColumnFragment ColumnFetcher.cpp:78 7ms start(210ms) ColumnarResults ColumnarResults.cpp:64 0ms start(219ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(178) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(178) New thread(179) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(179) 0ms start(219ms) getInstance HashJoin.cpp:295 0ms start(219ms) reify PerfectJoinHashTable.cpp:354 0ms start(219ms) getOneColumnFragment ColumnFetcher.cpp:78 0ms start(219ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(180) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(180) New thread(181) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(181) 0ms start(219ms) getInstance HashJoin.cpp:295 0ms start(219ms) synthesize_metadata InputMetadata.cpp:306 0ms start(219ms) reify PerfectJoinHashTable.cpp:354 0ms start(219ms) getOneColumnFragment ColumnFetcher.cpp:78 0ms start(219ms) ColumnarResults ColumnarResults.cpp:64 0ms start(220ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(182) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(182) New thread(183) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(183) 360ms start(220ms) getInstance HashJoin.cpp:295 235ms start(220ms) synthesize_metadata InputMetadata.cpp:306 124ms start(456ms) reify PerfectJoinHashTable.cpp:354 115ms start(456ms) getOneColumnFragment ColumnFetcher.cpp:78 107ms start(456ms) ColumnarResults ColumnarResults.cpp:64 6ms start(572ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(184) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(184) New thread(185) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(185) 0ms start(580ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 0ms start(580ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 0ms start(581ms) ExecutionKernel::run ExecutionKernel.cpp:125 24ms start(581ms) fetchChunks Execute.cpp:2941 New thread(88) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125 37ms start(0ms) fetchChunks Execute.cpp:2941 0ms start(37ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781 2ms start(37ms) executePlanWithoutGroupBy Execute.cpp:3362 2ms start(37ms) launchGpuCode QueryExecutionContext.cpp:223 End thread(88) 0ms start(605ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781 2ms start(605ms) executePlanWithoutGroupBy Execute.cpp:3362 2ms start(605ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(622ms) collectAllDeviceResults Execute.cpp:2333 0ms start(622ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(622ms) reduceMultiDeviceResultSets Execute.cpp:1355 394ms start(622ms) compileWorkUnit NativeCodegen.cpp:2589 1ms start(623ms) getInstance HashJoin.cpp:295 1ms start(623ms) reify PerfectJoinHashTable.cpp:354 0ms start(623ms) getOneColumnFragment ColumnFetcher.cpp:78 0ms start(623ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(186) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(186) New thread(187) 1ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 1ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(187) 17ms start(625ms) getInstance HashJoin.cpp:295 7ms start(625ms) synthesize_metadata InputMetadata.cpp:306 9ms start(633ms) reify PerfectJoinHashTable.cpp:354 8ms start(633ms) getOneColumnFragment ColumnFetcher.cpp:78 8ms start(633ms) ColumnarResults ColumnarResults.cpp:64 0ms start(642ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(188) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(188) New thread(189) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(189) 0ms start(642ms) getInstance HashJoin.cpp:295 0ms start(642ms) reify PerfectJoinHashTable.cpp:354 0ms start(643ms) getOneColumnFragment ColumnFetcher.cpp:78 0ms start(643ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(190) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(190) New thread(191) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(191) 0ms start(643ms) getInstance HashJoin.cpp:295 0ms start(643ms) synthesize_metadata InputMetadata.cpp:306 0ms start(643ms) reify PerfectJoinHashTable.cpp:354 0ms start(643ms) getOneColumnFragment ColumnFetcher.cpp:78 0ms start(643ms) ColumnarResults ColumnarResults.cpp:64 0ms start(643ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(192) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(192) New thread(193) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(193) 371ms start(643ms) getInstance HashJoin.cpp:295 238ms start(643ms) synthesize_metadata InputMetadata.cpp:306 133ms start(881ms) reify PerfectJoinHashTable.cpp:354 124ms start(881ms) getOneColumnFragment ColumnFetcher.cpp:78 111ms start(881ms) ColumnarResults ColumnarResults.cpp:64 7ms start(1006ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(194) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(194) New thread(195) 0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(195) 0ms start(1015ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 0ms start(1016ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 0ms start(1016ms) ExecutionKernel::run ExecutionKernel.cpp:125 New thread(196) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125 16ms start(27ms) fetchChunks Execute.cpp:2941 0ms start(44ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781 4ms start(44ms) executePlanWithGroupBy Execute.cpp:3583 4ms start(44ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(49ms) getRowSet QueryExecutionContext.cpp:158 0ms start(49ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(49ms) reduceMultiDeviceResultSets Execute.cpp:1355 End thread(196) New thread(197) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125 23ms start(0ms) fetchChunks Execute.cpp:2941 0ms start(24ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781 6ms start(24ms) executePlanWithGroupBy Execute.cpp:3583 6ms start(24ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(30ms) getRowSet QueryExecutionContext.cpp:158 0ms start(30ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(30ms) reduceMultiDeviceResultSets Execute.cpp:1355 End thread(197) 23ms start(1016ms) fetchChunks Execute.cpp:2941 0ms start(1040ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781 3ms start(1040ms) executePlanWithGroupBy Execute.cpp:3583 3ms start(1040ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(1043ms) getRowSet QueryExecutionContext.cpp:158 0ms start(1043ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(1043ms) reduceMultiDeviceResultSets Execute.cpp:1355 0ms start(1065ms) resultsUnion Execute.cpp:1279 41ms start(1066ms) executeRelAlgStep RelAlgExecutor.cpp:962 41ms start(1066ms) executeSort RelAlgExecutor.cpp:2985 37ms start(1066ms) executeWorkUnit RelAlgExecutor.cpp:3388 1ms start(1066ms) compileWorkUnit NativeCodegen.cpp:2589 0ms start(1067ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 0ms start(1067ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 0ms start(1067ms) ExecutionKernel::run ExecutionKernel.cpp:125 16ms start(1067ms) fetchChunks Execute.cpp:2941 15ms start(1067ms) ColumnarResults ColumnarResults.cpp:64 0ms start(1083ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781 0ms start(1083ms) executePlanWithoutGroupBy Execute.cpp:3362 0ms start(1083ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(1084ms) collectAllDeviceResults Execute.cpp:2333 0ms start(1084ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(1084ms) reduceMultiDeviceResultSets Execute.cpp:1355 1ms start(1084ms) compileWorkUnit NativeCodegen.cpp:2589 0ms start(1085ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 0ms start(1085ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 0ms start(1085ms) ExecutionKernel::run ExecutionKernel.cpp:125 15ms start(1085ms) fetchChunks Execute.cpp:2941 14ms start(1085ms) ColumnarResults ColumnarResults.cpp:64 0ms start(1100ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781 2ms start(1101ms) executePlanWithGroupBy Execute.cpp:3583 2ms start(1101ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(1103ms) getRowSet QueryExecutionContext.cpp:158 0ms start(1103ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(1103ms) reduceMultiDeviceResultSets Execute.cpp:1355 0ms start(1103ms) resultsUnion Execute.cpp:1279 3ms start(1103ms) sort ResultSet.cpp:768 0ms start(1103ms) initPermutationBuffer ResultSet.cpp:846 0ms start(1104ms) createComparator ResultSet.h:840 3ms start(1104ms) topPermutation ResultSet.cpp:1208
Quite complex ;)
Regards, Candido
-
Hi,
It was an educated guess, but having an MPP database engine that doesn't use intermediate results is almost impossible.
As an example, take a simple group-by query like that.
select payment_type,passenger_count,trip_distance,tip_amount,avg(tip_amount) from yellow_tripdata_p group by 1,2,3,4
is generating this
59ms total duration for executeRelAlgQuery 58ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:588 0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:589 58ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:895 58ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:1043 58ms start(0ms) executeSort RelAlgExecutor.cpp:3181 58ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3585 3ms start(0ms) compileWorkUnit NativeCodegen.cpp:2676 0ms start(3ms) markDeadRuntimeFuncs NativeCodegen.cpp:1864 0ms start(4ms) optimizeAndCodegenGPU NativeCodegen.cpp:1340 0ms start(4ms) ExecutionKernel::run ExecutionKernel.cpp:126 0ms start(4ms) fetchChunks Execute.cpp:2928 0ms start(4ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:780 53ms start(4ms) executePlanWithGroupBy Execute.cpp:3537 51ms start(5ms) launchGpuCode QueryExecutionContext.cpp:225 0ms start(56ms) getRowSet QueryExecutionContext.cpp:160 0ms start(56ms) reduceMultiDeviceResults Execute.cpp:1294 0ms start(56ms) reduceMultiDeviceResultSets Execute.cpp:1340 0ms start(58ms) collectAllDeviceResults Execute.cpp:2320 0ms start(58ms) reduceMultiDeviceResults Execute.cpp:1294 0ms start(58ms) reduceMultiDeviceResultSets Execute.cpp:1340
then we add a simple HAVING count(*) < 10
126ms total duration for executeRelAlgQuery 126ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:588 0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:589 125ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:895 47ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:1043 47ms start(0ms) executeCompound RelAlgExecutor.cpp:2238 47ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3585 3ms start(0ms) compileWorkUnit NativeCodegen.cpp:2676 0ms start(3ms) markDeadRuntimeFuncs NativeCodegen.cpp:1864 0ms start(3ms) optimizeAndCodegenGPU NativeCodegen.cpp:1340 0ms start(4ms) ExecutionKernel::run ExecutionKernel.cpp:126 0ms start(4ms) fetchChunks Execute.cpp:2928 0ms start(4ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:780 42ms start(4ms) executePlanWithGroupBy Execute.cpp:3537 39ms start(4ms) launchGpuCode QueryExecutionContext.cpp:225 0ms start(44ms) getRowSet QueryExecutionContext.cpp:160 0ms start(44ms) reduceMultiDeviceResults Execute.cpp:1294 0ms start(44ms) reduceMultiDeviceResultSets Execute.cpp:1340 0ms start(47ms) collectAllDeviceResults Execute.cpp:2320 0ms start(47ms) reduceMultiDeviceResults Execute.cpp:1294 0ms start(47ms) reduceMultiDeviceResultSets Execute.cpp:1340 78ms start(48ms) executeRelAlgStep RelAlgExecutor.cpp:1043 77ms start(48ms) executeSort RelAlgExecutor.cpp:3181 77ms start(48ms) executeWorkUnit RelAlgExecutor.cpp:3585 1ms start(48ms) compileWorkUnit NativeCodegen.cpp:2676 0ms start(49ms) markDeadRuntimeFuncs NativeCodegen.cpp:1864 0ms start(50ms) optimizeAndCodegenGPU NativeCodegen.cpp:1340 35ms start(50ms) synthesize_metadata InputMetadata.cpp:307 0ms start(85ms) ExecutionKernel::run ExecutionKernel.cpp:126 37ms start(85ms) fetchChunks Execute.cpp:2928 36ms start(85ms) ColumnarResults ColumnarResults.cpp:167 0ms start(122ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:780 2ms start(123ms) executePlanWithGroupBy Execute.cpp:3537 2ms start(123ms) launchGpuCode QueryExecutionContext.cpp:225 0ms start(125ms) getRowSet QueryExecutionContext.cpp:160 0ms start(125ms) reduceMultiDeviceResults Execute.cpp:1294 0ms start(125ms) reduceMultiDeviceResultSets Execute.cpp:1340
And this is adding some other steps, including a ColumnmarResult so intermediate results. Fusing a filter that needs reduced results would be impossible, so the intermediate result.
I'm sorry if this isn't the best example, but I don't know every scenario where intermediate results are used.
Regards, Cadido
-
[quote="candido.dessanti, post:6, topic:2983"] Fusing a filter that needs reduced results would be impossible, so the intermediate result. [/quote]
Hi thanks for the reply! Should the query still be fusable if there is only one GPU? you can perform filtering directly after groupby, you don't need reduce results if there is only one GPU?
Please sign in to leave a comment.
Comments
7 comments