How to find out, what takes most time in query processing

Comments

7 comments

  • Avatar
    Candido Dessanti

    Hi @Sipy,

    You can start the database by setting the parameter enable-debug-timer to true. The timings of macro steps of the query, like the execution in CPU/GPU, projection, and the "order by" will be written to the log. .

    We haven't anything to return the timings of single steps of the query, like group by, joins.

    It would be difficult considering the extreme parallelism degree of GPU execution.

    I tried query on 100GB, using 1 or 2 GPUs, and the response time is between 4000ms and 1200ms.

    heavysql> select s_acctbal,s_name,n_name,p_partkey,p_mfgr,s_address,s_phone,s_comment from part,supplier,partsupp,nation,region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 15 and p_type like '%BRASS' and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE' and ps_supplycost = (select min(ps_supplycost) from partsupp,supplier,nation,region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'EUROPE') order by s_acctbal desc,n_name,s_name,p_partkey limit 10;
    s_acctbal|s_name|n_name|p_partkey|p_mfgr|s_address|s_phone|s_comment
    9999.70|Supplier#000239544|UNITED KINGDOM|6739531|Manufacturer#4|,6oexY7ft68JKJCN1SQ LUO5Xf|33-509-584-9496|ets are. blithely special accounts wake across t
    9999.65|Supplier#000143654|FRANCE|2393647|Manufacturer#3|HhHCZ,RosE8He4uYvyIDqsPZe,7cZiJh1y9,|16-166-504-5864|special accounts. never dogged deposits
    9999.49|Supplier#000615014|GERMANY|12365001|Manufacturer#3|4C,uOuJDig7z,4e|17-780-902-4027|pinto beans affix along th
    9999.28|Supplier#000494480|ROMANIA|17244462|Manufacturer#3|hNB2qcYmnyeqrd em,o29TxzLwwrl|29-756-312-1779|ly. slyly final pains detect furiously qu
    9998.87|Supplier#000826281|ROMANIA|14326252|Manufacturer#3|LkGmLSAjbjgKS5ZRepL|29-775-451-4774|slyly silent accounts affix furiously across the final, even ideas; tithe
    9998.56|Supplier#000039514|UNITED KINGDOM|7539499|Manufacturer#2|fUN95rpikfiqeGo,okl,27ItSq9fMpr|33-317-828-7758|olites use about the blithely regular warhorses. carefully final instructions a
    9997.89|Supplier#000319666|RUSSIA|19319665|Manufacturer#1|SZtdx3rzXnFuiVuJzKathV9|32-972-151-6038|furiously ironic pearls. furiously regular foxes along the furiously
    9997.85|Supplier#000718770|ROMANIA|718769|Manufacturer#1|2173woKsCg7zpHCqhiEXFTmYYqpT4XONnRAsUryE|29-663-409-2865|posits sleep across the blithely express requests.
    9997.83|Supplier#000348318|FRANCE|9348317|Manufacturer#4|AvfnvjCFAg1aNpj|16-718-707-5676|e pending requests: furiously even deposits boost furiously slyly even requests.
    9997.73|Supplier#000329974|RUSSIA|4079969|Manufacturer#3|eKtIadopsE|32-268-488-7178|uickly. furiously ironic requests are slyly above the regular
    10 rows returned.
    Execution time: 1133 ms, Total time: 1179 ms
    

    Could you enable the debug timer and post the log of your execution? If the default storage directory is used the log file is /var/lib/heavyai/log/omnisci_server.INFO

    the result of my execution is

    1107ms total duration for executeRelAlgQuery
      1107ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516
        0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517
        1106ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814
          4ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962
            4ms start(0ms) executeCompound RelAlgExecutor.cpp:2116
              4ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3388
                1ms start(0ms) compileWorkUnit NativeCodegen.cpp:2589
                  0ms start(2ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
                  0ms start(2ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
                0ms start(2ms) ExecutionKernel::run ExecutionKernel.cpp:125
                0ms start(2ms) fetchChunks Execute.cpp:2941
                0ms start(2ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                0ms start(2ms) executePlanWithoutGroupBy Execute.cpp:3362
                  0ms start(2ms) launchGpuCode QueryExecutionContext.cpp:223
                0ms start(3ms) collectAllDeviceResults Execute.cpp:2333
                  0ms start(3ms) reduceMultiDeviceResults Execute.cpp:1309
                    0ms start(3ms) reduceMultiDeviceResultSets Execute.cpp:1355
                1ms start(3ms) compileWorkUnit NativeCodegen.cpp:2589
                  0ms start(4ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
                  0ms start(4ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
                0ms start(4ms) ExecutionKernel::run ExecutionKernel.cpp:125
                0ms start(4ms) fetchChunks Execute.cpp:2941
                0ms start(4ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                0ms start(4ms) executePlanWithGroupBy Execute.cpp:3583
                  0ms start(4ms) launchGpuCode QueryExecutionContext.cpp:223
                  0ms start(5ms) getRowSet QueryExecutionContext.cpp:158
                    0ms start(5ms) reduceMultiDeviceResults Execute.cpp:1309
                      0ms start(5ms) reduceMultiDeviceResultSets Execute.cpp:1355
                0ms start(5ms) resultsUnion Execute.cpp:1279
          156ms start(5ms) executeRelAlgStep RelAlgExecutor.cpp:962
            156ms start(5ms) executeCompound RelAlgExecutor.cpp:2116
              156ms start(6ms) executeWorkUnit RelAlgExecutor.cpp:3388
                2ms start(6ms) compileWorkUnit NativeCodegen.cpp:2589
                  0ms start(6ms) getInstance HashJoin.cpp:295
                    0ms start(6ms) reify PerfectJoinHashTable.cpp:354
                      0ms start(6ms) getOneColumnFragment ColumnFetcher.cpp:78
                      0ms start(6ms) getOneColumnFragment ColumnFetcher.cpp:78
                        New thread(170)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(170)
                        New thread(171)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(171)
                  0ms start(7ms) getInstance HashJoin.cpp:295
                    0ms start(7ms) reify PerfectJoinHashTable.cpp:354
                      0ms start(7ms) getOneColumnFragment ColumnFetcher.cpp:78
                      0ms start(7ms) getOneColumnFragment ColumnFetcher.cpp:78
                        New thread(172)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(172)
                        New thread(173)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(173)
                  0ms start(7ms) getInstance HashJoin.cpp:295
                    0ms start(7ms) synthesize_metadata InputMetadata.cpp:306
                    0ms start(7ms) reify PerfectJoinHashTable.cpp:354
                      0ms start(7ms) getOneColumnFragment ColumnFetcher.cpp:78
                        0ms start(7ms) ColumnarResults ColumnarResults.cpp:64
                      0ms start(7ms) getOneColumnFragment ColumnFetcher.cpp:78
                        New thread(174)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(174)
                        New thread(175)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(175)
                  0ms start(8ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
                  0ms start(8ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
                0ms start(8ms) ExecutionKernel::run ExecutionKernel.cpp:125
                0ms start(8ms) fetchChunks Execute.cpp:2941
                    New thread(47)
                      0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                      0ms start(0ms) fetchChunks Execute.cpp:2941
                      0ms start(0ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                      131ms start(0ms) executePlanWithGroupBy Execute.cpp:3583
                        131ms start(0ms) launchGpuCode QueryExecutionContext.cpp:223
                        0ms start(131ms) getRowSet QueryExecutionContext.cpp:158
                          0ms start(131ms) reduceMultiDeviceResults Execute.cpp:1309
                            0ms start(131ms) reduceMultiDeviceResultSets Execute.cpp:1355
                    End thread(47)
                0ms start(8ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                130ms start(8ms) executePlanWithGroupBy Execute.cpp:3583
                  130ms start(8ms) launchGpuCode QueryExecutionContext.cpp:223
                  0ms start(139ms) getRowSet QueryExecutionContext.cpp:158
                    0ms start(139ms) reduceMultiDeviceResults Execute.cpp:1309
                      0ms start(139ms) reduceMultiDeviceResultSets Execute.cpp:1355
                21ms start(140ms) collectAllDeviceResults Execute.cpp:2333
                  21ms start(140ms) reduceMultiDeviceResults Execute.cpp:1309
                    21ms start(140ms) reduceMultiDeviceResultSets Execute.cpp:1355
          4ms start(162ms) executeRelAlgStep RelAlgExecutor.cpp:962
            4ms start(162ms) executeCompound RelAlgExecutor.cpp:2116
              4ms start(162ms) executeWorkUnit RelAlgExecutor.cpp:3388
                1ms start(162ms) compileWorkUnit NativeCodegen.cpp:2589
                  0ms start(163ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
                  0ms start(163ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
                0ms start(163ms) ExecutionKernel::run ExecutionKernel.cpp:125
                0ms start(163ms) fetchChunks Execute.cpp:2941
                0ms start(163ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                0ms start(163ms) executePlanWithoutGroupBy Execute.cpp:3362
                  0ms start(163ms) launchGpuCode QueryExecutionContext.cpp:223
                0ms start(164ms) collectAllDeviceResults Execute.cpp:2333
                  0ms start(164ms) reduceMultiDeviceResults Execute.cpp:1309
                    0ms start(164ms) reduceMultiDeviceResultSets Execute.cpp:1355
                1ms start(164ms) compileWorkUnit NativeCodegen.cpp:2589
                  0ms start(165ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
                  0ms start(165ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
                0ms start(165ms) ExecutionKernel::run ExecutionKernel.cpp:125
                0ms start(165ms) fetchChunks Execute.cpp:2941
                0ms start(165ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                0ms start(165ms) executePlanWithGroupBy Execute.cpp:3583
                  0ms start(165ms) launchGpuCode QueryExecutionContext.cpp:223
                  0ms start(166ms) getRowSet QueryExecutionContext.cpp:158
                    0ms start(166ms) reduceMultiDeviceResults Execute.cpp:1309
                      0ms start(166ms) reduceMultiDeviceResultSets Execute.cpp:1355
                0ms start(166ms) resultsUnion Execute.cpp:1279
          5ms start(166ms) executeRelAlgStep RelAlgExecutor.cpp:962
            5ms start(166ms) executeCompound RelAlgExecutor.cpp:2116
              5ms start(166ms) executeWorkUnit RelAlgExecutor.cpp:3388
                1ms start(166ms) compileWorkUnit NativeCodegen.cpp:2589
                  0ms start(167ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
                  0ms start(167ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
                0ms start(167ms) ExecutionKernel::run ExecutionKernel.cpp:125
                0ms start(167ms) fetchChunks Execute.cpp:2941
                0ms start(167ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                0ms start(168ms) executePlanWithoutGroupBy Execute.cpp:3362
                  0ms start(168ms) launchGpuCode QueryExecutionContext.cpp:223
                0ms start(168ms) collectAllDeviceResults Execute.cpp:2333
                  0ms start(168ms) reduceMultiDeviceResults Execute.cpp:1309
                    0ms start(168ms) reduceMultiDeviceResultSets Execute.cpp:1355
                1ms start(169ms) compileWorkUnit NativeCodegen.cpp:2589
                  0ms start(170ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
                  0ms start(170ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
                0ms start(170ms) ExecutionKernel::run ExecutionKernel.cpp:125
                0ms start(170ms) fetchChunks Execute.cpp:2941
                0ms start(170ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                1ms start(170ms) executePlanWithGroupBy Execute.cpp:3583
                  1ms start(170ms) launchGpuCode QueryExecutionContext.cpp:223
                  0ms start(172ms) getRowSet QueryExecutionContext.cpp:158
                    0ms start(172ms) reduceMultiDeviceResults Execute.cpp:1309
                      0ms start(172ms) reduceMultiDeviceResultSets Execute.cpp:1355
                0ms start(172ms) resultsUnion Execute.cpp:1279
          893ms start(172ms) executeRelAlgStep RelAlgExecutor.cpp:962
            893ms start(172ms) executeProject RelAlgExecutor.cpp:2166
              892ms start(173ms) executeWorkUnit RelAlgExecutor.cpp:3388
                376ms start(204ms) compileWorkUnit NativeCodegen.cpp:2589
                  0ms start(205ms) getInstance HashJoin.cpp:295
                    0ms start(205ms) reify PerfectJoinHashTable.cpp:354
                      0ms start(205ms) getOneColumnFragment ColumnFetcher.cpp:78
                      0ms start(205ms) getOneColumnFragment ColumnFetcher.cpp:78
                        New thread(176)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(176)
                        New thread(177)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(177)
                  13ms start(206ms) getInstance HashJoin.cpp:295
                    4ms start(206ms) synthesize_metadata InputMetadata.cpp:306
                    8ms start(210ms) reify PerfectJoinHashTable.cpp:354
                      8ms start(210ms) getOneColumnFragment ColumnFetcher.cpp:78
                        7ms start(210ms) ColumnarResults ColumnarResults.cpp:64
                      0ms start(219ms) getOneColumnFragment ColumnFetcher.cpp:78
                        New thread(178)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(178)
                        New thread(179)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(179)
                  0ms start(219ms) getInstance HashJoin.cpp:295
                    0ms start(219ms) reify PerfectJoinHashTable.cpp:354
                      0ms start(219ms) getOneColumnFragment ColumnFetcher.cpp:78
                      0ms start(219ms) getOneColumnFragment ColumnFetcher.cpp:78
                        New thread(180)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(180)
                        New thread(181)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(181)
                  0ms start(219ms) getInstance HashJoin.cpp:295
                    0ms start(219ms) synthesize_metadata InputMetadata.cpp:306
                    0ms start(219ms) reify PerfectJoinHashTable.cpp:354
                      0ms start(219ms) getOneColumnFragment ColumnFetcher.cpp:78
                        0ms start(219ms) ColumnarResults ColumnarResults.cpp:64
                      0ms start(220ms) getOneColumnFragment ColumnFetcher.cpp:78
                        New thread(182)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(182)
                        New thread(183)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(183)
                  360ms start(220ms) getInstance HashJoin.cpp:295
                    235ms start(220ms) synthesize_metadata InputMetadata.cpp:306
                    124ms start(456ms) reify PerfectJoinHashTable.cpp:354
                      115ms start(456ms) getOneColumnFragment ColumnFetcher.cpp:78
                        107ms start(456ms) ColumnarResults ColumnarResults.cpp:64
                      6ms start(572ms) getOneColumnFragment ColumnFetcher.cpp:78
                        New thread(184)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(184)
                        New thread(185)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(185)
                  0ms start(580ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
                  0ms start(580ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
                0ms start(581ms) ExecutionKernel::run ExecutionKernel.cpp:125
                24ms start(581ms) fetchChunks Execute.cpp:2941
                    New thread(88)
                      0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                      37ms start(0ms) fetchChunks Execute.cpp:2941
                      0ms start(37ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                      2ms start(37ms) executePlanWithoutGroupBy Execute.cpp:3362
                        2ms start(37ms) launchGpuCode QueryExecutionContext.cpp:223
                    End thread(88)
                0ms start(605ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                2ms start(605ms) executePlanWithoutGroupBy Execute.cpp:3362
                  2ms start(605ms) launchGpuCode QueryExecutionContext.cpp:223
                0ms start(622ms) collectAllDeviceResults Execute.cpp:2333
                  0ms start(622ms) reduceMultiDeviceResults Execute.cpp:1309
                    0ms start(622ms) reduceMultiDeviceResultSets Execute.cpp:1355
                394ms start(622ms) compileWorkUnit NativeCodegen.cpp:2589
                  1ms start(623ms) getInstance HashJoin.cpp:295
                    1ms start(623ms) reify PerfectJoinHashTable.cpp:354
                      0ms start(623ms) getOneColumnFragment ColumnFetcher.cpp:78
                      0ms start(623ms) getOneColumnFragment ColumnFetcher.cpp:78
                        New thread(186)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(186)
                        New thread(187)
                          1ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            1ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(187)
                  17ms start(625ms) getInstance HashJoin.cpp:295
                    7ms start(625ms) synthesize_metadata InputMetadata.cpp:306
                    9ms start(633ms) reify PerfectJoinHashTable.cpp:354
                      8ms start(633ms) getOneColumnFragment ColumnFetcher.cpp:78
                        8ms start(633ms) ColumnarResults ColumnarResults.cpp:64
                      0ms start(642ms) getOneColumnFragment ColumnFetcher.cpp:78
                        New thread(188)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(188)
                        New thread(189)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(189)
                  0ms start(642ms) getInstance HashJoin.cpp:295
                    0ms start(642ms) reify PerfectJoinHashTable.cpp:354
                      0ms start(643ms) getOneColumnFragment ColumnFetcher.cpp:78
                      0ms start(643ms) getOneColumnFragment ColumnFetcher.cpp:78
                        New thread(190)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(190)
                        New thread(191)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(191)
                  0ms start(643ms) getInstance HashJoin.cpp:295
                    0ms start(643ms) synthesize_metadata InputMetadata.cpp:306
                    0ms start(643ms) reify PerfectJoinHashTable.cpp:354
                      0ms start(643ms) getOneColumnFragment ColumnFetcher.cpp:78
                        0ms start(643ms) ColumnarResults ColumnarResults.cpp:64
                      0ms start(643ms) getOneColumnFragment ColumnFetcher.cpp:78
                        New thread(192)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(192)
                        New thread(193)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(193)
                  371ms start(643ms) getInstance HashJoin.cpp:295
                    238ms start(643ms) synthesize_metadata InputMetadata.cpp:306
                    133ms start(881ms) reify PerfectJoinHashTable.cpp:354
                      124ms start(881ms) getOneColumnFragment ColumnFetcher.cpp:78
                        111ms start(881ms) ColumnarResults ColumnarResults.cpp:64
                      7ms start(1006ms) getOneColumnFragment ColumnFetcher.cpp:78
                        New thread(194)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(194)
                        New thread(195)
                          0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(195)
                  0ms start(1015ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
                  0ms start(1016ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
                0ms start(1016ms) ExecutionKernel::run ExecutionKernel.cpp:125
                  New thread(196)
                    0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                    16ms start(27ms) fetchChunks Execute.cpp:2941
                    0ms start(44ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                    4ms start(44ms) executePlanWithGroupBy Execute.cpp:3583
                      4ms start(44ms) launchGpuCode QueryExecutionContext.cpp:223
                      0ms start(49ms) getRowSet QueryExecutionContext.cpp:158
                        0ms start(49ms) reduceMultiDeviceResults Execute.cpp:1309
                          0ms start(49ms) reduceMultiDeviceResultSets Execute.cpp:1355
                  End thread(196)
                  New thread(197)
                    0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125
                    23ms start(0ms) fetchChunks Execute.cpp:2941
                    0ms start(24ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                    6ms start(24ms) executePlanWithGroupBy Execute.cpp:3583
                      6ms start(24ms) launchGpuCode QueryExecutionContext.cpp:223
                      0ms start(30ms) getRowSet QueryExecutionContext.cpp:158
                        0ms start(30ms) reduceMultiDeviceResults Execute.cpp:1309
                          0ms start(30ms) reduceMultiDeviceResultSets Execute.cpp:1355
                  End thread(197)
                23ms start(1016ms) fetchChunks Execute.cpp:2941
                0ms start(1040ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                3ms start(1040ms) executePlanWithGroupBy Execute.cpp:3583
                  3ms start(1040ms) launchGpuCode QueryExecutionContext.cpp:223
                  0ms start(1043ms) getRowSet QueryExecutionContext.cpp:158
                    0ms start(1043ms) reduceMultiDeviceResults Execute.cpp:1309
                      0ms start(1043ms) reduceMultiDeviceResultSets Execute.cpp:1355
                0ms start(1065ms) resultsUnion Execute.cpp:1279
          41ms start(1066ms) executeRelAlgStep RelAlgExecutor.cpp:962
            41ms start(1066ms) executeSort RelAlgExecutor.cpp:2985
              37ms start(1066ms) executeWorkUnit RelAlgExecutor.cpp:3388
                1ms start(1066ms) compileWorkUnit NativeCodegen.cpp:2589
                  0ms start(1067ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
                  0ms start(1067ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
                0ms start(1067ms) ExecutionKernel::run ExecutionKernel.cpp:125
                16ms start(1067ms) fetchChunks Execute.cpp:2941
                  15ms start(1067ms) ColumnarResults ColumnarResults.cpp:64
                0ms start(1083ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                0ms start(1083ms) executePlanWithoutGroupBy Execute.cpp:3362
                  0ms start(1083ms) launchGpuCode QueryExecutionContext.cpp:223
                0ms start(1084ms) collectAllDeviceResults Execute.cpp:2333
                  0ms start(1084ms) reduceMultiDeviceResults Execute.cpp:1309
                    0ms start(1084ms) reduceMultiDeviceResultSets Execute.cpp:1355
                1ms start(1084ms) compileWorkUnit NativeCodegen.cpp:2589
                  0ms start(1085ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
                  0ms start(1085ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
                0ms start(1085ms) ExecutionKernel::run ExecutionKernel.cpp:125
                15ms start(1085ms) fetchChunks Execute.cpp:2941
                  14ms start(1085ms) ColumnarResults ColumnarResults.cpp:64
                0ms start(1100ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
                2ms start(1101ms) executePlanWithGroupBy Execute.cpp:3583
                  2ms start(1101ms) launchGpuCode QueryExecutionContext.cpp:223
                  0ms start(1103ms) getRowSet QueryExecutionContext.cpp:158
                    0ms start(1103ms) reduceMultiDeviceResults Execute.cpp:1309
                      0ms start(1103ms) reduceMultiDeviceResultSets Execute.cpp:1355
                0ms start(1103ms) resultsUnion Execute.cpp:1279
              3ms start(1103ms) sort ResultSet.cpp:768
                0ms start(1103ms) initPermutationBuffer ResultSet.cpp:846
                0ms start(1104ms) createComparator ResultSet.h:840
                3ms start(1104ms) topPermutation ResultSet.cpp:1208
    

    Quite complex ;)

    Regards, Candido

    1
    Comment actions Permalink
  • Avatar
    Lily Liu

    Hi Candido, I notice ColumnarResults takes quite a few amount of time, can I have a rough understanding about what ColumnarResults is doing? Thanks!

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi,

    Do you mean the 111ms? Probably some intermediate result is materialized and passed to the next query step.

    I'm not sure about that

    0
    Comment actions Permalink
  • Avatar
    Lily Liu

    Thanks for the reply! Since HeavyDB do aggressive coalesce (fuse project, groupby and filer), why still are multiple steps? Could you give me a example that a query has to be divided into multiple steps? Thanks!

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi,

    It was an educated guess, but having an MPP database engine that doesn't use intermediate results is almost impossible.

    As an example, take a simple group-by query like that.

    select payment_type,passenger_count,trip_distance,tip_amount,avg(tip_amount) 
    from yellow_tripdata_p
    group by 1,2,3,4
    

    is generating this

    59ms total duration for executeRelAlgQuery
      58ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:588
        0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:589
        58ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:895
          58ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:1043
            58ms start(0ms) executeSort RelAlgExecutor.cpp:3181
              58ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3585
                3ms start(0ms) compileWorkUnit NativeCodegen.cpp:2676
                  0ms start(3ms) markDeadRuntimeFuncs NativeCodegen.cpp:1864
                  0ms start(4ms) optimizeAndCodegenGPU NativeCodegen.cpp:1340
                0ms start(4ms) ExecutionKernel::run ExecutionKernel.cpp:126
                0ms start(4ms) fetchChunks Execute.cpp:2928
                0ms start(4ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:780
                53ms start(4ms) executePlanWithGroupBy Execute.cpp:3537
                  51ms start(5ms) launchGpuCode QueryExecutionContext.cpp:225
                  0ms start(56ms) getRowSet QueryExecutionContext.cpp:160
                    0ms start(56ms) reduceMultiDeviceResults Execute.cpp:1294
                      0ms start(56ms) reduceMultiDeviceResultSets Execute.cpp:1340
                0ms start(58ms) collectAllDeviceResults Execute.cpp:2320
                  0ms start(58ms) reduceMultiDeviceResults Execute.cpp:1294
                    0ms start(58ms) reduceMultiDeviceResultSets Execute.cpp:1340
    

    then we add a simple HAVING count(*) < 10

    126ms total duration for executeRelAlgQuery
      126ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:588
        0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:589
        125ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:895
          47ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:1043
            47ms start(0ms) executeCompound RelAlgExecutor.cpp:2238
              47ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3585
                3ms start(0ms) compileWorkUnit NativeCodegen.cpp:2676
                  0ms start(3ms) markDeadRuntimeFuncs NativeCodegen.cpp:1864
                  0ms start(3ms) optimizeAndCodegenGPU NativeCodegen.cpp:1340
                0ms start(4ms) ExecutionKernel::run ExecutionKernel.cpp:126
                0ms start(4ms) fetchChunks Execute.cpp:2928
                0ms start(4ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:780
                42ms start(4ms) executePlanWithGroupBy Execute.cpp:3537
                  39ms start(4ms) launchGpuCode QueryExecutionContext.cpp:225
                  0ms start(44ms) getRowSet QueryExecutionContext.cpp:160
                    0ms start(44ms) reduceMultiDeviceResults Execute.cpp:1294
                      0ms start(44ms) reduceMultiDeviceResultSets Execute.cpp:1340
                0ms start(47ms) collectAllDeviceResults Execute.cpp:2320
                  0ms start(47ms) reduceMultiDeviceResults Execute.cpp:1294
                    0ms start(47ms) reduceMultiDeviceResultSets Execute.cpp:1340
          78ms start(48ms) executeRelAlgStep RelAlgExecutor.cpp:1043
            77ms start(48ms) executeSort RelAlgExecutor.cpp:3181
              77ms start(48ms) executeWorkUnit RelAlgExecutor.cpp:3585
                1ms start(48ms) compileWorkUnit NativeCodegen.cpp:2676
                  0ms start(49ms) markDeadRuntimeFuncs NativeCodegen.cpp:1864
                  0ms start(50ms) optimizeAndCodegenGPU NativeCodegen.cpp:1340
                35ms start(50ms) synthesize_metadata InputMetadata.cpp:307
                0ms start(85ms) ExecutionKernel::run ExecutionKernel.cpp:126
                37ms start(85ms) fetchChunks Execute.cpp:2928
                  36ms start(85ms) ColumnarResults ColumnarResults.cpp:167
                0ms start(122ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:780
                2ms start(123ms) executePlanWithGroupBy Execute.cpp:3537
                  2ms start(123ms) launchGpuCode QueryExecutionContext.cpp:225
                  0ms start(125ms) getRowSet QueryExecutionContext.cpp:160
                    0ms start(125ms) reduceMultiDeviceResults Execute.cpp:1294
                      0ms start(125ms) reduceMultiDeviceResultSets Execute.cpp:1340
    

    And this is adding some other steps, including a ColumnmarResult so intermediate results. Fusing a filter that needs reduced results would be impossible, so the intermediate result.

    I'm sorry if this isn't the best example, but I don't know every scenario where intermediate results are used.

    Regards, Cadido

    0
    Comment actions Permalink
  • Avatar
    Lily Liu

    [quote="candido.dessanti, post:6, topic:2983"] Fusing a filter that needs reduced results would be impossible, so the intermediate result. [/quote]

    Hi thanks for the reply! Should the query still be fusable if there is only one GPU? you can perform filtering directly after groupby, you don't need reduce results if there is only one GPU?

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi,

    Nope, also with just one GPU, you need to reduce the results. The GPUs are the hardware that needs to be used in an MPP way, so 1 or more GPUs in a system is going to need a reduction after a scan-join-filter-aggregate, filter on the reduced aggregate

    Candido

    0
    Comment actions Permalink

Please sign in to leave a comment.