Analyzing the time of queries sub-steps
Hi @jiashenC,
I created a thread to discuss about hot to interpret the output of the debug timer command
You said:
One thing puzzles me is for some queries, the time reported in debug timer is different from time profiled from the client. Please see the following output. Thread 6 only reports 460 ms, but clients gets ~1300 ms. This is same as what reported by thread 1, but huge difference than thread 6. How should I interpret this?
I guess there is something that's initializing the first time you run a query after the cold of the database
Try to run another query on another table like a select distinct f1 from table1 limit 10;
I guess most of the overhead will go away.
About the incomplete out, it looks that when you use CTEs a chunk of the debug timer output is omitted, and that seems the reason why aren't you getting a lòarge chunk of the outìput comprising the memory transfers and the GPU execution.
Removing the CTE part of the query and running a simple join and aggregate query using the TPC-H schema
select extract(month from O_ORDERDATE),
O_ORDERSTATUS,
sum(L_EXTENDEDPRICE*L_QUANTITY)
from lineitem
join orders on L_ORDERKEY=O_ORDERKEY
where L_COMMITDATE between '1992-03-03' and '1992-12-03'
group by 1,2;
and that's the full output
242ms total duration for executeRelAlgQuery
242ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516
0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517
241ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814
241ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962
241ms start(0ms) executeSort RelAlgExecutor.cpp:2985
241ms start(1ms) executeWorkUnit RelAlgExecutor.cpp:3388
131ms start(1ms) compileWorkUnit NativeCodegen.cpp:2589
11ms start(2ms) getInstance HashJoin.cpp:295
11ms start(2ms) reify PerfectJoinHashTable.cpp:354
10ms start(2ms) getOneColumnFragment ColumnFetcher.cpp:78
New thread(9)
0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
End thread(9)
2ms start(14ms) codegen GpuSharedMemoryUtils.cpp:55
0ms start(16ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
116ms start(16ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
0ms start(17ms) initializeNVPTXBackend NativeCodegen.cpp:1442
115ms start(17ms) generateNativeGPUCode NativeCodegen.cpp:1126
0ms start(17ms) check_module_requires_libdevice NativeCodegen.cpp:272
6ms start(17ms) optimize_ir NativeCodegen.cpp:310
13ms start(24ms) generatePTX NativeCodegen.cpp:1405
94ms start(38ms) ptx_to_cubin NvidiaKernel.cpp:128
0ms start(133ms) ExecutionKernel::run ExecutionKernel.cpp:125
97ms start(133ms) fetchChunks Execute.cpp:2941
0ms start(231ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
10ms start(231ms) executePlanWithGroupBy Execute.cpp:3583
5ms start(231ms) launchGpuCode QueryExecutionContext.cpp:223
4ms start(237ms) getRowSet QueryExecutionContext.cpp:158
4ms start(237ms) reduceMultiDeviceResults Execute.cpp:1309
4ms start(237ms) reduceMultiDeviceResultSets Execute.cpp:1355
3ms start(237ms) generateNativeCPUCode NativeCodegen.cpp:443
2ms start(237ms) optimize_ir NativeCodegen.cpp:310
1ms start(240ms) create_execution_engine NativeCodegen.cpp:419
0ms start(241ms) collectAllDeviceResults Execute.cpp:2333
0ms start(241ms) reduceMultiDeviceResults Execute.cpp:1309
0ms start(241ms) reduceMultiDeviceResultSets Execute.cpp:1355
the runtime is higher than the 242ms because for whatever reason 65ms has been spent in the parser
2022-06-09T14:09:03.923738 I 3835849 0 1 Calcite.cpp:635 Time in Thrift 2 (ms), Time in Java Calcite server 65 (ms)
Analyzing the out we can notice that we spent around 11ms to create a hash-index to manage the hash-join
11ms start(2ms) getInstance HashJoin.cpp:295
11ms start(2ms) reify PerfectJoinHashTable.cpp:354
10ms start(2ms) getOneColumnFragment ColumnFetcher.cpp:78
New thread(9)
0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
End thread(9)
then around 131ms to generate and compile the llvm code needed to run the query
131ms start(1ms) compileWorkUnit NativeCodegen.cpp:2589
[cut]
2ms start(14ms) codegen GpuSharedMemoryUtils.cpp:55
0ms start(16ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
116ms start(16ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
0ms start(17ms) initializeNVPTXBackend NativeCodegen.cpp:1442
115ms start(17ms) generateNativeGPUCode NativeCodegen.cpp:1126
0ms start(17ms) check_module_requires_libdevice NativeCodegen.cpp:272
6ms start(17ms) optimize_ir NativeCodegen.cpp:310
13ms start(24ms) generatePTX NativeCodegen.cpp:1405
94ms start(38ms) ptx_to_cubin NvidiaKernel.cpp:128
another 97ms to fetch data from DISK into CPU, then into GPU memory
97ms start(133ms) fetchChunks Execute.cpp:2941
and 10ms to execute the kernel, fetch and reduce the results
10ms start(231ms) executePlanWithGroupBy Execute.cpp:3583
5ms start(231ms) launchGpuCode QueryExecutionContext.cpp:223
4ms start(237ms) getRowSet QueryExecutionContext.cpp:158
4ms start(237ms) reduceMultiDeviceResults Execute.cpp:1309
4ms start(237ms) reduceMultiDeviceResultSets Execute.cpp:1355
3ms start(237ms) generateNativeCPUCode NativeCodegen.cpp:443
2ms start(237ms) optimize_ir NativeCodegen.cpp:310
1ms start(240ms) create_execution_engine NativeCodegen.cpp:419
0ms start(241ms) collectAllDeviceResults Execute.cpp:2333
0ms start(241ms) reduceMultiDeviceResults Execute.cpp:1309
0ms start(241ms) reduceMultiDeviceResultSets Execute.cpp:1355
Running the same query another time, changing the filtered values, making the optimizer reuse the same plan, so the faster execution
12ms total duration for executeRelAlgQuery
12ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516
0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517
12ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814
12ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962
12ms start(0ms) executeSort RelAlgExecutor.cpp:2985
11ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3388
3ms start(0ms) compileWorkUnit NativeCodegen.cpp:2589
0ms start(1ms) getInstance HashJoin.cpp:295
0ms start(1ms) reify PerfectJoinHashTable.cpp:354
0ms start(1ms) getOneColumnFragment ColumnFetcher.cpp:78
New thread(33)
0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
End thread(33)
1ms start(2ms) codegen GpuSharedMemoryUtils.cpp:55
0ms start(4ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
0ms start(4ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
0ms start(4ms) ExecutionKernel::run ExecutionKernel.cpp:125
0ms start(4ms) fetchChunks Execute.cpp:2941
0ms start(4ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
7ms start(4ms) executePlanWithGroupBy Execute.cpp:3583
7ms start(4ms) launchGpuCode QueryExecutionContext.cpp:223
0ms start(11ms) getRowSet QueryExecutionContext.cpp:158
0ms start(11ms) reduceMultiDeviceResults Execute.cpp:1309
0ms start(11ms) reduceMultiDeviceResultSets Execute.cpp:1355
0ms start(12ms) collectAllDeviceResults Execute.cpp:2333
0ms start(12ms) reduceMultiDeviceResults Execute.cpp:1309
0ms start(12ms) reduceMultiDeviceResultSets Execute.cpp:1355
2022-06-09T14:33:46.299971 I 3835849 0 1 DBHandler.cpp:1556 DEBUG_TIMER thread_id(1)
29ms total duration for sql_execute
Also in this query, the calcite is taking a big chunk of query duration (13ms), but it's not important because with the 6.0 release the Calcite, generation, and compilation of the query's code are almost parallelized, so when running queries concurrently, this overhead is less important.
Anyway enclosing the query in a select * from (select ... ) limit 0;
is returning a cutted down debug timer output
12ms total duration for executeRelAlgQuery
12ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516
0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517
12ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814
11ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962
11ms start(0ms) executeSort RelAlgExecutor.cpp:2985
11ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3388
11ms start(0ms) compileWorkUnit NativeCodegen.cpp:2589
7ms start(1ms) getInstance HashJoin.cpp:295
7ms start(1ms) reify PerfectJoinHashTable.cpp:354
0ms start(1ms) getOneColumnFragment ColumnFetcher.cpp:78
New thread(15)
0ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
0ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
End thread(15)
1ms start(10ms) codegen GpuSharedMemoryUtils.cpp:55
0ms start(11ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
0ms start(11ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
0ms start(12ms) collectAllDeviceResults Execute.cpp:2333
0ms start(12ms) reduceMultiDeviceResults Execute.cpp:1309
while adding an additional limit in the inner query
select * from (
select extract(month from O_ORDERDATE),
O_ORDERSTATUS,
sum(L_EXTENDEDPRICE*L_QUANTITY)
from lineitem
join orders on L_ORDERKEY=O_ORDERKEY
where L_COMMITDATE between '1992-03-03' and '1992-12-03'
group by 1,2
limit 100000)
limit 0;
this way is returning a more complete output, but I would go with the inner query (with a limit?).
19ms total duration for executeRelAlgQuery
19ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516
0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517
19ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814
16ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962
16ms start(0ms) executeSort RelAlgExecutor.cpp:2985
16ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3388
11ms start(0ms) compileWorkUnit NativeCodegen.cpp:2589
8ms start(1ms) getInstance HashJoin.cpp:295
8ms start(1ms) reify PerfectJoinHashTable.cpp:354
0ms start(1ms) getOneColumnFragment ColumnFetcher.cpp:78
New thread(18)
1ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
1ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
End thread(18)
1ms start(10ms) codegen GpuSharedMemoryUtils.cpp:55
0ms start(12ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
0ms start(12ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
0ms start(12ms) ExecutionKernel::run ExecutionKernel.cpp:125
0ms start(12ms) fetchChunks Execute.cpp:2941
0ms start(12ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:781
4ms start(12ms) executePlanWithGroupBy Execute.cpp:3583
3ms start(12ms) launchGpuCode QueryExecutionContext.cpp:223
0ms start(16ms) getRowSet QueryExecutionContext.cpp:158
0ms start(16ms) reduceMultiDeviceResults Execute.cpp:1309
0ms start(16ms) reduceMultiDeviceResultSets Execute.cpp:1355
0ms start(17ms) collectAllDeviceResults Execute.cpp:2333
0ms start(17ms) reduceMultiDeviceResults Execute.cpp:1309
0ms start(17ms) reduceMultiDeviceResultSets Execute.cpp:1355
2ms start(17ms) executeRelAlgStep RelAlgExecutor.cpp:962
2ms start(17ms) executeSort RelAlgExecutor.cpp:2985
2ms start(17ms) executeWorkUnit RelAlgExecutor.cpp:3388
1ms start(17ms) compileWorkUnit NativeCodegen.cpp:2589
0ms start(18ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
0ms start(18ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
0ms start(18ms) collectAllDeviceResults Execute.cpp:2333
1ms start(18ms) compileWorkUnit NativeCodegen.cpp:2589
0ms start(19ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
0ms start(19ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
0ms start(19ms) resultsUnion Execute.cpp:1279
Besides that, I also noticed some discrepancies between the runtime and the total time, but at this time I haven't if it's something introduced in the 6.0 release, or something else.
Regards, Candido
-
Thanks!
In addition, putting the query inside
select * from (...) limit 0
seems to affect the query execution time in debug timer as well. My inner query is just a join between two tablesselect * from tb1,tb2 where tb1.k=tb2.k
. Here, I compare both runs under the system is warm up.The following log is when I just run the inner query, which indicates ~400ms.
439ms total duration for executeRelAlgQuery 439ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516 0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517 439ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814 439ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962 439ms start(0ms) executeProject RelAlgExecutor.cpp:2166 438ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3388 26ms start(0ms) compileWorkUnit NativeCodegen.cpp:2589 24ms start(2ms) getInstance HashJoin.cpp:295 24ms start(2ms) reify PerfectJoinHashTable.cpp:354 0ms start(2ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(18) 24ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 23ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(18) 0ms start(26ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 0ms start(26ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 0ms start(27ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(27ms) fetchChunks Execute.cpp:2941 0ms start(27ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696 41ms start(27ms) executePlanWithoutGroupBy Execute.cpp:3362 41ms start(27ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(69ms) collectAllDeviceResults Execute.cpp:2333 0ms start(69ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(69ms) reduceMultiDeviceResultSets Execute.cpp:1355 25ms start(69ms) compileWorkUnit NativeCodegen.cpp:2589 24ms start(70ms) getInstance HashJoin.cpp:295 24ms start(70ms) reify PerfectJoinHashTable.cpp:354 0ms start(70ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(19) 24ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 23ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(19) 0ms start(94ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 0ms start(95ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 0ms start(95ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(95ms) fetchChunks Execute.cpp:2941 0ms start(95ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696 New thread(20) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(17ms) fetchChunks Execute.cpp:2941 0ms start(17ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696 42ms start(17ms) executePlanWithGroupBy Execute.cpp:3583 42ms start(17ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(59ms) getRowSet QueryExecutionContext.cpp:158 0ms start(59ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(59ms) reduceMultiDeviceResultSets Execute.cpp:1355 End thread(20) New thread(21) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(59ms) fetchChunks Execute.cpp:2941 0ms start(59ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696 38ms start(59ms) executePlanWithGroupBy Execute.cpp:3583 38ms start(59ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(97ms) getRowSet QueryExecutionContext.cpp:158 0ms start(97ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(97ms) reduceMultiDeviceResultSets Execute.cpp:1355 End thread(21) 16ms start(95ms) executePlanWithGroupBy Execute.cpp:3583 New thread(22) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(303ms) fetchChunks Execute.cpp:2941 0ms start(303ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696 39ms start(304ms) executePlanWithGroupBy Execute.cpp:3583 39ms start(304ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(343ms) getRowSet QueryExecutionContext.cpp:158 0ms start(343ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(343ms) reduceMultiDeviceResultSets Execute.cpp:1355 End thread(22) New thread(23) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(179ms) fetchChunks Execute.cpp:2941 0ms start(179ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696 43ms start(179ms) executePlanWithGroupBy Execute.cpp:3583 43ms start(179ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(222ms) getRowSet QueryExecutionContext.cpp:158 0ms start(222ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(222ms) reduceMultiDeviceResultSets Execute.cpp:1355 End thread(23) New thread(24) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(97ms) fetchChunks Execute.cpp:2941 0ms start(97ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696 39ms start(97ms) executePlanWithGroupBy Execute.cpp:3583 39ms start(97ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(136ms) getRowSet QueryExecutionContext.cpp:158 0ms start(136ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(136ms) reduceMultiDeviceResultSets Execute.cpp:1355 End thread(24) New thread(25) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(137ms) fetchChunks Execute.cpp:2941 0ms start(137ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696 42ms start(137ms) executePlanWithGroupBy Execute.cpp:3583 42ms start(137ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(179ms) getRowSet QueryExecutionContext.cpp:158 0ms start(179ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(179ms) reduceMultiDeviceResultSets Execute.cpp:1355 End thread(25) New thread(26) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(264ms) fetchChunks Execute.cpp:2941 0ms start(264ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696 39ms start(264ms) executePlanWithGroupBy Execute.cpp:3583 39ms start(264ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(303ms) getRowSet QueryExecutionContext.cpp:158 0ms start(303ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(303ms) reduceMultiDeviceResultSets Execute.cpp:1355 End thread(26) New thread(27) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:125 0ms start(222ms) fetchChunks Execute.cpp:2941 0ms start(222ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:696 41ms start(222ms) executePlanWithGroupBy Execute.cpp:3583 41ms start(222ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(264ms) getRowSet QueryExecutionContext.cpp:158 0ms start(264ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(264ms) reduceMultiDeviceResultSets Execute.cpp:1355 End thread(27) 16ms start(95ms) launchGpuCode QueryExecutionContext.cpp:223 0ms start(112ms) getRowSet QueryExecutionContext.cpp:158 0ms start(112ms) reduceMultiDeviceResults Execute.cpp:1309 0ms start(112ms) reduceMultiDeviceResultSets Execute.cpp:1355 0ms start(439ms) resultsUnion Execute.cpp:1279
But, if I add the
limit
statement, the execution time becomes much less (~50ms)52ms total duration for executeRelAlgQuery 52ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516 0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517 52ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814 52ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962 52ms start(0ms) executeSort RelAlgExecutor.cpp:2985 52ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3388 25ms start(0ms) compileWorkUnit NativeCodegen.cpp:2589 24ms start(1ms) getInstance HashJoin.cpp:295 24ms start(1ms) reify PerfectJoinHashTable.cpp:354 0ms start(1ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(13) 24ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 23ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(13) 0ms start(25ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 0ms start(25ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 0ms start(26ms) collectAllDeviceResults Execute.cpp:2333 26ms start(26ms) compileWorkUnit NativeCodegen.cpp:2589 24ms start(27ms) getInstance HashJoin.cpp:295 24ms start(27ms) reify PerfectJoinHashTable.cpp:354 0ms start(27ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(14) 23ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 23ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(14) 0ms start(52ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 0ms start(52ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 0ms start(52ms) resultsUnion Execute.cpp:1279
I wonder which represents the closest query execution time?
-
Hi,
Probably the time difference is the time spent on projection.
In the earlier release we had some timings on that, but lately.they are disappeared.
Yes in the client everything in the server is included. The client is calling a thrifth endpoint, and ita reporting the time of a execution in the server reporter by the thrifth call and a Total time spent ti return the result back to the user.
Please sign in to leave a comment.
Comments
3 comments