TPC-H Data Loading
Hi, I'm a new user of omnisci. My current target is to test TPC-H benchmark on omnisci.
However, I have no idea how to load the TPC-H data into omnisci. Is there any related script? Or any hint?
Currently, I plan to a stupid way:
1. prepare TPC-H tables on local machine.
2. Load the tables into omnisci.
3. Run the query on omnisci.
I'm worried about this stupid way would miss the optimization of pymapd
data format, and so on.
-
Hi @Eilowangfang,
I wouldn't say you planned anything in a stupid way.
to run regualr tpch queries on omnisci, you have to start the server disabling watchdog, so with the parameter enable-watchdog truned to false, and to be able to run all queires except one, you have to do some rewriting.
I can supply everything you need, so ccontact me by pm or by email I supplied in github
-
yes. Thanks for your help. An email has been sent to candido.dessanti@omnisci.com from wangfang722@outlook.com Look forward your reply.
-
Thanks for your help with TPC-H benchmarking! Now I can execute the TPC-H queries on omnisci.
Well. Sorry. I still need your help with: 1) is there any timer that indicates the end-to-end/elapsed time of one query execution? (I could find any from Documents)
2) I'm using omnisci GPU. I noticed that once I started the omnisql server, the GPU memory will be occupied, as shown in the figure. The GPU memory is occupied until I stop the omnisql server.
Then there are three questions: * The data is loaded to GPU memory once when the omnisql server starts. And then data resides the GPU memory all the time. Is the understanding correct? * If so, for each query execution, the PCIe data of transferring data from the CPU to GPU is not included. The running time is only GPU in-device computing. Is the understanding correct? * Now I want to measure the running time includes PCIe transfer time (as It is not fair for me to ignore data transfer). How to achieve this time measurement?
-
I have found the execution time in LOG files in /mapd_log. The figure shows the execution time of TPC-H Q10 (repeat 5 times) from LOG file.
I have two questions: 1) Does the “execution_time_ms” include PCIe data transfer time?
2) I repeated the query five times, the first query execution is always too long. Then the next queries are fast. The long first query execution might be: A. due to PCIe data transfer for the first query. Then next queries (if same query content) might not need PCIe transfer. In this case, 2857ms includes PCIe transfer, and 154/149/142/139ms does not include PCIe transfer. B. due to GPU startup and init time at the first query. For next each query, the PCIe are still essential. In this case, all 2857/154/149/142/139ms include PCIe transfer.
Which do you think is the actual reason?
The query result looks great to me, if the short execution time (e.g., 153/149/143/139ms) already includes PCIe transfer time.
-
Hi @Eilowangfang,
To your question, if there is a time that indicates the end-to-end etc., the answer is YES. to activate it, you have to start the server with the parameter enable-debug-timer turned to true. If you look at docs is listed, and it's supposed to
Enable fine-grained query execution timers for debug. For debugging, logs verbose timing information for query execution (time to load data, time to compile code, and so on).
the Infos will be written into the logs at INFO level so that they will be available at [/mapd_storage]/data/mapd_logs/omnisci.INFO
running the TPCHQ10 query after a server cold start the output is this one
2020-12-18T16:26:00.744844 I 4773 1 measure.h:74 Timer start getExecutor getExecutor: 158 2020-12-18T16:26:00.744860 I 4773 1 measure.h:81 Timer end getExecutor getExecutor: 158 elapsed 0 ms 2020-12-18T16:26:00.744900 I 4773 1 QueryDispatchQueue.h:61 Dispatching query with 0 queries in the queue. 2020-12-18T16:26:00.744953 I 4773 7 QueryDispatchQueue.h:92 Worker 1 running the query and returning control. There are now 0 queries in the queue. 2020-12-18T16:26:00.744998 I 4773 7 measure.h:74 Timer start getExecutor getExecutor: 158 2020-12-18T16:26:00.745018 I 4773 7 measure.h:81 Timer end getExecutor getExecutor: 158 elapsed 0 ms 2020-12-18T16:26:00.745647 I 4773 7 measure.h:74 Timer start executeRelAlgQuery executeRelAlgQuery: 222 2020-12-18T16:26:00.745668 I 4773 7 measure.h:74 Timer start executeRelAlgQueryNoRetry executeRelAlgQueryNoRetry: 253 2020-12-18T16:26:00.943161 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_28 was 197ms 2020-12-18T16:26:01.140584 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_30 was 197ms 2020-12-18T16:26:01.313512 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_26 was 172ms 2020-12-18T16:26:01.505229 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_27 was 191ms 2020-12-18T16:26:01.505494 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_22 was 0ms 2020-12-18T16:26:01.505611 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_31 was 0ms 2020-12-18T16:26:01.505698 I 4773 7 measure.h:74 Timer start executeRelAlgSeq executeRelAlgSeq: 598 2020-12-18T16:26:01.505714 I 4773 7 measure.h:74 Timer start executeRelAlgStep executeRelAlgStep: 677 2020-12-18T16:26:01.505870 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(lineitem, ["L_ORDERKEY", "L_PARTKEY", "L_SUPPKEY", "L_LINENUMBER", "L_QUANTITY", "L_EXTENDEDPRICE", "L_DISCOUNT", "L_TAX", "L_RETURNFLAG", "L_LINESTATUS", "L_SHIPDATE", "L_COMMITDATE", "L_RECEIPTDATE", "L_SHIPINSTRUCT", "L_SHIPMODE", "L_COMMENT", "rowid"]) to nest level 0 2020-12-18T16:26:01.505891 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(orders, ["O_ORDERKEY", "O_CUSTKEY", "O_ORDERSTATUS", "O_TOTALPRICE", "O_ORDERDATE", "O_ORDERPRIORITY", "O_CLERK", "O_SHIPPRIORITY", "O_COMMENT", "rowid"]) to nest level 1 2020-12-18T16:26:01.505898 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(customer, ["C_CUSTKEY", "C_NAME", "C_ADDRESS", "C_NATIONKEY", "C_PHONE", "C_ACCTBAL", "C_MKTSEGMENT", "C_COMMENT", "rowid"]) to nest level 2 2020-12-18T16:26:01.505905 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(nation, ["N_NATIONKEY", "N_NAME", "N_REGIONKEY", "N_COMMENT", "rowid"]) to nest level 3 2020-12-18T16:26:01.505929 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(lineitem, ["L_ORDERKEY", "L_PARTKEY", "L_SUPPKEY", "L_LINENUMBER", "L_QUANTITY", "L_EXTENDEDPRICE", "L_DISCOUNT", "L_TAX", "L_RETURNFLAG", "L_LINESTATUS", "L_SHIPDATE", "L_COMMITDATE", "L_RECEIPTDATE", "L_SHIPINSTRUCT", "L_SHIPMODE", "L_COMMENT", "rowid"]) to nest level 0 2020-12-18T16:26:01.505947 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(orders, ["O_ORDERKEY", "O_CUSTKEY", "O_ORDERSTATUS", "O_TOTALPRICE", "O_ORDERDATE", "O_ORDERPRIORITY", "O_CLERK", "O_SHIPPRIORITY", "O_COMMENT", "rowid"]) to nest level 1 2020-12-18T16:26:01.505953 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(customer, ["C_CUSTKEY", "C_NAME", "C_ADDRESS", "C_NATIONKEY", "C_PHONE", "C_ACCTBAL", "C_MKTSEGMENT", "C_COMMENT", "rowid"]) to nest level 2 2020-12-18T16:26:01.505959 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(nation, ["N_NATIONKEY", "N_NAME", "N_REGIONKEY", "N_COMMENT", "rowid"]) to nest level 3 2020-12-18T16:26:01.506037 I 4773 7 measure.h:74 Timer start executeWorkUnit executeWorkUnit: 2769 2020-12-18T16:26:01.506046 I 4773 7 measure.h:74 Timer start get_table_infos get_table_infos: 263 2020-12-18T16:26:01.506053 I 4773 7 measure.h:81 Timer end get_table_infos get_table_infos: 263 elapsed 0 ms 2020-12-18T16:26:01.506113 I 4773 7 measure.h:74 Timer start Exec_executeWorkUnit executeWorkUnitImpl: 1405 2020-12-18T16:26:01.506138 I 4773 7 measure.h:74 Timer start query_step_compilation executeWorkUnitImpl: 1425 2020-12-18T16:26:01.506247 I 4773 7 measure.h:81 Timer end query_step_compilation executeWorkUnitImpl: 1425 elapsed 0 ms 2020-12-18T16:26:01.506255 I 4773 7 measure.h:81 Timer end Exec_executeWorkUnit executeWorkUnitImpl: 1405 elapsed 0 ms 2020-12-18T16:26:01.506268 I 4773 7 measure.h:74 Timer start get_table_infos get_table_infos: 263 2020-12-18T16:26:01.506276 I 4773 7 measure.h:81 Timer end get_table_infos get_table_infos: 263 elapsed 0 ms 2020-12-18T16:26:01.506280 I 4773 7 measure.h:74 Timer start Exec_executeWorkUnit executeWorkUnitImpl: 1405 2020-12-18T16:26:01.506289 I 4773 7 measure.h:74 Timer start query_step_compilation executeWorkUnitImpl: 1425 2020-12-18T16:26:01.507720 I 4773 7 measure.h:74 Timer start buildJoinLoops buildJoinLoops: 266 2020-12-18T16:26:01.510955 I 4773 7 BufferMgr.cpp:304 ALLOCATION slab of 8388608 pages (4294967296B) created in 3 ms GPU_MGR:0 2020-12-18T16:26:01.510986 I 4773 7 BufferMgr.cpp:304 ALLOCATION slab of 8388608 pages (4294967296B) created in 0 ms CPU_MGR:0 2020-12-18T16:26:01.584934 I 4773 7 BufferMgr.cpp:304 ALLOCATION slab of 8388608 pages (4294967296B) created in 3 ms GPU_MGR:1 2020-12-18T16:26:01.996789 I 4773 7 measure.h:81 Timer end buildJoinLoops buildJoinLoops: 266 elapsed 489 ms 2020-12-18T16:26:01.998070 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice 2020-12-18T16:26:01.998105 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice 2020-12-18T16:26:02.083164 I 4773 7 measure.h:81 Timer end query_step_compilation executeWorkUnitImpl: 1425 elapsed 576 ms 2020-12-18T16:26:02.083338 I 4773 14 measure.h:74 Timer start kernel_run run: 92 2020-12-18T16:26:02.083397 I 4773 14 measure.h:74 Timer start fetchChunks fetchChunks: 2342 2020-12-18T16:26:02.083890 I 4773 15 measure.h:74 Timer start kernel_run run: 92 2020-12-18T16:26:02.083909 I 4773 15 measure.h:74 Timer start fetchChunks fetchChunks: 2342 2020-12-18T16:26:02.634163 I 4773 14 measure.h:81 Timer end fetchChunks fetchChunks: 2342 elapsed 550 ms 2020-12-18T16:26:02.634241 I 4773 14 measure.h:74 Timer start executePlanWithoutGroupBy executePlanWithoutGroupBy: 2693 2020-12-18T16:26:02.634276 I 4773 14 measure.h:74 Timer start lauchGpuCode launchGpuCode: 200 2020-12-18T16:26:02.727827 I 4773 15 measure.h:81 Timer end fetchChunks fetchChunks: 2342 elapsed 643 ms 2020-12-18T16:26:02.727893 I 4773 15 measure.h:74 Timer start executePlanWithoutGroupBy executePlanWithoutGroupBy: 2693 2020-12-18T16:26:02.727910 I 4773 15 measure.h:74 Timer start lauchGpuCode launchGpuCode: 200 2020-12-18T16:26:02.771668 I 4773 15 measure.h:81 Timer end lauchGpuCode launchGpuCode: 200 elapsed 43 ms 2020-12-18T16:26:02.771713 I 4773 15 measure.h:81 Timer end executePlanWithoutGroupBy executePlanWithoutGroupBy: 2693 elapsed 43 ms 2020-12-18T16:26:02.771734 I 4773 15 measure.h:81 Timer end kernel_run run: 92 elapsed 687 ms 2020-12-18T16:26:02.775700 I 4773 14 measure.h:81 Timer end lauchGpuCode launchGpuCode: 200 elapsed 141 ms 2020-12-18T16:26:02.775716 I 4773 14 measure.h:81 Timer end executePlanWithoutGroupBy executePlanWithoutGroupBy: 2693 elapsed 141 ms 2020-12-18T16:26:02.775736 I 4773 14 measure.h:81 Timer end kernel_run run: 92 elapsed 692 ms 2020-12-18T16:26:02.779040 I 4773 7 measure.h:81 Timer end Exec_executeWorkUnit executeWorkUnitImpl: 1405 elapsed 1272 ms 2020-12-18T16:26:02.779554 I 4773 7 measure.h:74 Timer start Exec_executeWorkUnit executeWorkUnitImpl: 1405 2020-12-18T16:26:02.779595 I 4773 7 measure.h:74 Timer start query_step_compilation executeWorkUnitImpl: 1425 2020-12-18T16:26:02.781268 I 4773 7 measure.h:74 Timer start buildJoinLoops buildJoinLoops: 266 2020-12-18T16:26:02.790142 I 4773 7 measure.h:81 Timer end buildJoinLoops buildJoinLoops: 266 elapsed 8 ms 2020-12-18T16:26:02.791306 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice 2020-12-18T16:26:02.791374 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice 2020-12-18T16:26:02.879975 I 4773 7 measure.h:81 Timer end query_step_compilation executeWorkUnitImpl: 1425 elapsed 100 ms 2020-12-18T16:26:02.880165 I 4773 23 measure.h:74 Timer start kernel_run run: 92 2020-12-18T16:26:02.880199 I 4773 23 measure.h:74 Timer start fetchChunks fetchChunks: 2342 2020-12-18T16:26:02.880148 I 4773 22 measure.h:74 Timer start kernel_run run: 92 2020-12-18T16:26:02.880376 I 4773 22 measure.h:74 Timer start fetchChunks fetchChunks: 2342 2020-12-18T16:26:02.956186 I 4773 23 measure.h:81 Timer end fetchChunks fetchChunks: 2342 elapsed 76 ms 2020-12-18T16:26:02.956326 I 4773 23 measure.h:74 Timer start executePlanWithGroupBy executePlanWithGroupBy: 2896 2020-12-18T16:26:02.956351 I 4773 23 measure.h:74 Timer start lauchGpuCode launchGpuCode: 200 2020-12-18T16:26:03.024190 I 4773 22 measure.h:81 Timer end fetchChunks fetchChunks: 2342 elapsed 143 ms 2020-12-18T16:26:03.024304 I 4773 22 measure.h:74 Timer start executePlanWithGroupBy executePlanWithGroupBy: 2896 2020-12-18T16:26:03.024330 I 4773 22 measure.h:74 Timer start lauchGpuCode launchGpuCode: 200 2020-12-18T16:26:03.115051 I 4773 23 measure.h:81 Timer end lauchGpuCode launchGpuCode: 200 elapsed 158 ms 2020-12-18T16:26:03.116177 I 4773 22 measure.h:81 Timer end lauchGpuCode launchGpuCode: 200 elapsed 91 ms 2020-12-18T16:26:03.119799 I 4773 23 measure.h:81 Timer end executePlanWithGroupBy executePlanWithGroupBy: 2896 elapsed 163 ms 2020-12-18T16:26:03.119840 I 4773 23 measure.h:81 Timer end kernel_run run: 92 elapsed 239 ms 2020-12-18T16:26:03.119846 I 4773 22 measure.h:81 Timer end executePlanWithGroupBy executePlanWithGroupBy: 2896 elapsed 95 ms 2020-12-18T16:26:03.119875 I 4773 22 measure.h:81 Timer end kernel_run run: 92 elapsed 239 ms 2020-12-18T16:26:03.217164 I 4773 7 measure.h:81 Timer end Exec_executeWorkUnit executeWorkUnitImpl: 1405 elapsed 437 ms 2020-12-18T16:26:03.217273 I 4773 7 measure.h:81 Timer end executeWorkUnit executeWorkUnit: 2769 elapsed 1711 ms 2020-12-18T16:26:03.359704 I 4773 7 measure.h:81 Timer end executeRelAlgStep executeRelAlgStep: 677 elapsed 1854 ms 2020-12-18T16:26:03.359741 I 4773 7 measure.h:81 Timer end executeRelAlgSeq executeRelAlgSeq: 598 elapsed 1854 ms 2020-12-18T16:26:03.359769 I 4773 7 measure.h:81 Timer end executeRelAlgQueryNoRetry executeRelAlgQueryNoRetry: 253 elapsed 2614 ms 2020-12-18T16:26:03.359778 I 4773 7 measure.h:81 Timer end executeRelAlgQuery executeRelAlgQuery: 222 elapsed 2614 ms
the most relevant info are:
2020-12-18T16:26:02.083164 I 4773 7 measure.h:81 Timer end query_step_compilation executeWorkUnitImpl: 1425 elapsed 576 ms
The system didn't know the query, so it has to parse, optimize and generate the LLVM code for each step of the query (You can get the LLVM code with the explain [your_statement] command).
To build the code, the system took 576ms (if you run the same query with different parameters, the old code is reused)
then to load the data from disk to system RAM and GPU ram, you have to look to fetch chunks
2020-12-18T16:26:02.634163 I 4773 14 measure.h:81 Timer end fetchChunks fetchChunks: 2342 elapsed **550** ms 2020-12-18T16:26:02.727827 I 4773 15 measure.h:81 Timer end fetchChunks fetchChunks: 2342 elapsed **643** ms
Those threads run in parallel, so the entire operation likely lasted 643ms. I have a disk subsystem capable of 10GB/sec, so the reads from disks are really fast, and the bottleneck is the CPU power and the PCI-ex bus. You get other data like the time the query spent on Gpu or in Cpu depending on the executor chooses for your query.
You will also get a summarization in a more readable format at the end of the log. Something like that
2614ms total duration for executeRelAlgQuery 2614ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:254 760ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:255 1854ms start(760ms) executeRelAlgSeq RelAlgExecutor.cpp:599 1853ms start(760ms) executeRelAlgStep RelAlgExecutor.cpp:678 1853ms start(760ms) executeSort RelAlgExecutor.cpp:2446 1711ms start(760ms) executeWorkUnit RelAlgExecutor.cpp:2770 0ms start(760ms) compileWorkUnit NativeCodegen.cpp:2409 576ms start(760ms) compileWorkUnit NativeCodegen.cpp:2409 474ms start(762ms) getInstance HashJoin.cpp:480 474ms start(762ms) reify JoinHashTable.cpp:487 New thread(8) 214ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673 End thread(8) New thread(9) 374ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673 End thread(9) 14ms start(1236ms) getInstance HashJoin.cpp:480 14ms start(1236ms) reify JoinHashTable.cpp:487 New thread(10) 2ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673 End thread(10) New thread(11) 0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673 End thread(11) 0ms start(1250ms) getInstance HashJoin.cpp:480 0ms start(1250ms) reify JoinHashTable.cpp:487 New thread(12) 0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673 End thread(12) New thread(13) 0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673 End thread(13) New thread(14) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91 550ms start(0ms) fetchChunks Execute.cpp:2341 0ms start(550ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703 141ms start(550ms) executePlanWithoutGroupBy Execute.cpp:2694 141ms start(550ms) launchGpuCode QueryExecutionContext.cpp:199 End thread(14) New thread(15) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91 643ms start(0ms) fetchChunks Execute.cpp:2341 0ms start(644ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703 43ms start(644ms) executePlanWithoutGroupBy Execute.cpp:2694 43ms start(644ms) launchGpuCode QueryExecutionContext.cpp:199 End thread(15) 3ms start(2030ms) collectAllDeviceResults Execute.cpp:1783 3ms start(2030ms) reduceMultiDeviceResults Execute.cpp:887 100ms start(2033ms) compileWorkUnit NativeCodegen.cpp:2409 7ms start(2035ms) getInstance HashJoin.cpp:480 7ms start(2035ms) reify JoinHashTable.cpp:487 New thread(16) 6ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673 End thread(16) New thread(17) 6ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673 End thread(17) 0ms start(2043ms) getInstance HashJoin.cpp:480 0ms start(2043ms) reify JoinHashTable.cpp:487 New thread(18) 0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673 End thread(18) New thread(19) 0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673 End thread(19) 0ms start(2044ms) getInstance HashJoin.cpp:480 0ms start(2044ms) reify JoinHashTable.cpp:487 New thread(20) 0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673 End thread(20) New thread(21) 0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673 End thread(21) New thread(22) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91 143ms start(0ms) fetchChunks Execute.cpp:2341 0ms start(144ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703 95ms start(144ms) executePlanWithGroupBy Execute.cpp:2895 91ms start(144ms) launchGpuCode QueryExecutionContext.cpp:199 3ms start(236ms) getRowSet QueryExecutionContext.cpp:136 3ms start(236ms) reduceMultiDeviceResults Execute.cpp:887 3ms start(236ms) reduceMultiDeviceResultSets Execute.cpp:933 End thread(22) New thread(23) 0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91 76ms start(0ms) fetchChunks Execute.cpp:2341 0ms start(76ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703 163ms start(76ms) executePlanWithGroupBy Execute.cpp:2895 158ms start(76ms) launchGpuCode QueryExecutionContext.cpp:199 4ms start(234ms) getRowSet QueryExecutionContext.cpp:136 4ms start(234ms) reduceMultiDeviceResults Execute.cpp:887 4ms start(234ms) reduceMultiDeviceResultSets Execute.cpp:933 End thread(23) 97ms start(2374ms) collectAllDeviceResults Execute.cpp:1783 97ms start(2374ms) reduceMultiDeviceResults Execute.cpp:887 97ms start(2374ms) reduceMultiDeviceResultSets Execute.cpp:933 142ms start(2471ms) sort ResultSet.cpp:484 142ms start(2471ms) baselineSort ResultSet.cpp:544 0ms start(2613ms) createComparator ResultSet.h:638 0ms start(2614ms) topPermutation ResultSet.cpp:846
Re-running the query changing the year from 1993 to 1994, the total time is 441 ms.
Clearing the GPU memory cache with the command \clear_gpu and re-running the query, the total time went up to 874 ms, so it's likely to transfer data and populate the GPU caches, the system is taking over 400ms.
So your guesses were mostly right, except we don't load anything automatically when the Omnisci server start (there is a parameter that can be used to warm-up some caches), and the data resides in the GPU if needed, but for a complete GPU execution using the default parameters you need that all the data fit in GPU memory; if there isn't enough GPU ram the query will be executed on CPU if there isn't enough CPU memory, it's likely to fail.
The GPU memory is allocated when needed, and you can limit using the GPU-buffer-mem-bytes that, as default, is 0 and will allocate all available GPU mem in your system (if needed).
I think I replied to your question, but If I forgot something, pleas tell me.
Best Regards, Candido ;)
-
Thank you. Your answer is clear to me.
I have configured the file with adding:
I did get the running time info, but only the "total_time_ms" and "execution_time_ms". not detailed like that you showed. .
BTW, i'm curious of the usage of "num-gpus = 1". My understanding is that the number of GPU triggered to execute the query. I have two gpus, and I just want to use one of them. So I set "num-gpus = 1". However, when I run the TPC-H query, both GPU0 and GPU1 are loaded data, but seems only GPU0 works. I kind of cannot understand what happends to the GPUs. Are GPU0 and GPU1 working together for the query? Or just GPU0 works, and GPU1 only holds the data and does nothing?
-
Hi @Eilowangfang,
You have placed the parameters num-gpus and enable-debug-timer into the [web] section, so they are ignored by the DB, and used by the omnisci_web_server, which's likely to fail, so the database is running 2 GPUs, and you won't file the new entries in the logfile.
Looking at the nvidia-smi screen, you posted it looks like it's not running anything on the GPU; while the GPU0 is saying 10%, the power consumption is so low that it's unlikely to run a heavy kernel. Probably the query you are running has fallen back for CPOU executions, or there isn't any query running.
While benchmarking, I suggest you collect data in a fine-grained way, like a sample with an interval of 100ms or 200ms using the switch -lms that take the input in milliseconds and collect data CSV format.
something like that
nvidia-smi -lms 200 --query-gpu=gpu_name,index,utilization.gpu,utilization.memory,clocks.current.sm,power.draw,memory.used --format=csv
So having multiple GPUs, you will get an idea about how much they are used.
About your question of how many GPUs are used during a query, the correct answers that all of them will be used at a point in time every GPU will be used, and if everything is properly tuned, all the GPUs in a system will be used during query execution.
The tables in OmnisciDB consists of something like a macro-partition that we call fragment; at the time of writing, a fragment can be assigned to a GPUs, so if the table has 2 fragments and you have 2 GPUs, each GPU will have a fragment to work with, so they will be busy. Still, if a table has 3 fragments the first GPUs will be work on two fragments. In contrast, the second one on just 1, so it's likely that the second GPU will be idle while the 1st one is still working.
In the same scenario, if the first fragment has more records than the first one, the first GPU will work more than the second one.
You can mitigate this behavior, perfectly balancing the number of records and the number of fragments manually, or creating the tables with a number of shards equal to the number of gpu you have in the system.
Regards, Candido
-
Hi,
Thanks. It was my mistake that put the configuration line at wrong place. Now, I got the detailed timer results.
When I run TPC-H SF=10 Q10, the utilization of GPU shows the below Figure:
I think the utilization are perhaps reasonable, as my workload is simple (SF-10), the kernel exeuction only takes around 104ms. The time of compliation and other things such as result printing, take more than 160ms.
The nvidia-smi GPU utilization denotes "the percentage of time over the last second that any of the SMs was busy". Hence, since the kernel execution time is shorter than other time components. I think 13% percentage is reasonable. As to nvidia-smi GPU memory utilization denotes actually the percentage of time the memory controller (PCIe) was busy, as the query is executed when the data resides in GPU memory, so 1% utilization is reasonable to me.
Regards, WANG Fang
-
Hi,
The frequencies look a little lower than expected and also the power consumption, while the 104ms kernel runtime is reasonable (400ms on Rtx 2080ti of total time with an SF40).
The memory time shouldn't be the PCIe usage, but the memory is used so the memory scan read about 10GB in the last second...looks quite weird.
To get the PCIe usage you have to use other tools.
Regards
-
Hi,
Yes. I agree that the memory time is weird. I plan to profiling further via tool nvprof/Nsight Compute.
Could you give some hints on how to use the profiling tool on omnisci? Using nvprof normally is commanded with an executable file, while omnisci is interactive command for a query.
I tried package a shell file query.sh of the command " $OMNISCI_PATH/bin/omnisql -p HyperInteractive omnisci < scripts.sql"
and then nvprof query.sh. But seems this does not work, no profile data collected.
-
Hi @Eilowangfang,
You have to run nvprof on omnisci_server executable.
So you have to stop the service and run the server from the command line
nvprof [your flags] omnisci_server - - data /path_to_omnisci_storage [other omnisci server flags]
Then run some queries and stop the server with control+c. You should get the output, or the binary file depending on the parameters you specified
-
I updated the discussion of memory utilization here. Previously, the memory utilization reported by nvidia-smi is just 1%, which is such wired.
I just repeat the profiling by using Nsight compute, and the memory utilization reported is around 86%.
@candido.dessanti, the memory utilization here denotes the time spent on L1/L2 and global memory access. 86% might be reasonable (on TPC-H SF=10).
-
@Eilowangfang,
it looks to be the profiling of init_hash_join, so it's not the query itself, but a kernel called to initialize the hash join buffers. the duration is 240 nanoseconds...well, it would be nice to be so fast ;)
I will try to cook something for you during the day, while I am having some troubles with the new driver into my WS.
It would be nice to recover posts of the previous version of the community forum, but I think it's not possible
Regards
-
Hi, Yes. The running time of omnisci is good to me.
As mentioned in the Email, my next step is to survey the multi-query performance of omnisci. I understand the omnisci currently dose not support concurrent multi-query. But my scenario wants to support more users, achieving high query throughput.
Could you give me some hints that how to implement this? Thank you so much.
-
I wanted to say that you haven't profiled the query runtime, but a kernel that initializes some buffers to perform hash joins, and Q10 is doing some of them.
To run queries in parallel, you have to create 2 instances of omniscidb, load the same data using the same table names, and connect in a round-robin manner to the instance's nodes.
so as the first step, you have to init the storages you are using for the two DBs
./bin/initdb /data_storage_1 ./bin/initdb /data_storage_2
then run two instances of the software changing storage and ports and limiting the memory of the system and GPU accessible to each instance; I allocated 16GB per instance because you are using V100S GPUs
./bin/omnisci_server --data /data_storage_1 --gpu-buffer-mem-bytes 16000000000 --cpu-buffer-mem-bytes 32000000000 ./bin/omnisci_server --data /data_storage_2 --gpu-buffer-mem-bytes 16000000000 --cpu-buffer-mem-bytes 32000000000 --calcite-port 16279 --http-port 16278 --port 16274
Then create the tables on both instances and populate them; from now on, you are ready to run queries in parallel on your GPUs. Depending on the queries you will get some performance boost compared to the usual intra-parallelism we use in the database. I'm not sure the waste of resources can justify the performance boost.
Maybe is better to add GPUs and rely on
num-executors
parameter discussed in the thread https://community.heavy.ai/t/concurrent-queries/2399 and turning on the dynamic-watchdog to stop the queries that would stall the system.Regards,
Candido
-
<-- I wanted to say that you haven’t profiled the query runtime, but a kernel that initializes some buffers to perform hash joins, and Q10 is doing some of them. -->
Yes. Sorry for the previous Figure makes the profiling result confusing. I did receive detailed profiling of Q10, which I think include all the kernels executed for Q10. The below Figure shows the part of kernels:
Thanks for your help with concurrent query implementation. I will try soon and update the performance.
You mentioned that "I’m not sure the waste of resources can justify the performance boost." What kind of resources that you think would be probably wasted? (global memory, or SM, or others?)
My motivation for running multi-query is that: 1. My dataset is small that GPU can fit. 2. Running a single query cannot make full utilization of GPU resource. 3. Running multi-query can improve resource utilization, and also enjoy high query throughput. I got the motivation only because the community says so. However, I'm not clear what specific kind of GPU resources that I should try to better use. The resources that I can recognize: * GPU global memory * SM utilization * register * shared memory I still know very few about GPUs. It there any other kind of resources that could be better used for concurrent multi-query?
-
Hi,
Yap, the multifrag_query kernel is the one that is running most of the query runtime.
Well, if you are doing what I did, you are cutting in half your memory resource and doubling the space allocated on the disk. When I briefly tried doing that, I got some performance improvement versus a classical serial execution, but not so big as I expected.
Maybe using a tesla class card is better because I tried with RTX Gaming class. I am not a great program on GPU, but maybe you could try to run queries changing the cuda-block-size and cuda-grid-size; in the past the default values has 1024 and 2 times the number of SMs on GPU (You can get the value from the log or by looking at specs of your cards on internet)
-
Hi. Merry Christmas.
I just got a chance to access Nvidia DGX-2 server, which has 10 V100-SXM32 GPUs. So I revisited the chance of moving data between CPU memory and GPU memory. However, I found that the data transfer between CPU memory and GPU memory is still slow.
After reading some documents, I think that DGX-2 only provides high bandwidth via the NVlink2.0 among GPU-to-GPU data transfer. The host-to-GPU and GPU-to-host is still via PCIe 3.0. The reason is that the CPU in DGX-2 is Intel(R) Xeon(R) Platinum 8168 CPU, which dose not support NVLink between the host and the GPUs. Do you agree with this?
Seems now I have no chance to use NVLink 2.0 unless I can access IBM Power servers.
-
Hi,
I hope you got a great time these holidays.
Yes, you are right, and I bet that's the main reason because Nvidia switched to AMD's EPYC CPUs with PCI-ex 4.0 for their new system. In fact, the x86 CPUs are connected to the GPUs with a PCI-ex switch that's going to limit the performances for queries that need to transfer data thru the bus.
We bought a DGX A100 to experiment a little with the new architecture and we are experiencing a somewhat low system to GPU transfers speed as anticipated; I think that our choice to rely on caching data on GPU's Ran is the right one right now to get the maximum performances.
I tried out other databases that use GPU as accelerator and they are not efficient like us while caching.
I never tried OmniSci DB on IBM power, but you should get a decent speedup on transfers, while I can't say by how much.
I saw a benchmark somewhere, but the speedup wasn't like night and day; I will try to find out and post here.
It's a white paper you can access to it here
There is also some queries that use the tpch schema.
The speedup of nvlink over the pci-ex 3.0 is an average of 78%, so with the pci-ex 4.0 would be lower (I am saying that because the speed of Cpu ram is going to limit the performances of an nvlink 2.0 more than a pci-ex 4.0).
Have you tried running concurrent queries duplicating the database?
Regards, Candido
-
Hi, Thanks. The white paper shows significant improvement for the example queries. I can see the queries of TPC-H schema shows various improvements, (e.g., from 12% to 198%).
Yes. I tried concurrent queries. I can have the observations now: 1) the performance degradation of co-executing queries are various, depending on the concurrent workloads (i.e., queries). 2) there are some queries that concurrently execute can indeed improve throughputs, at the same time, the performance of each individual query is not degraded severely. 2) there are some queries that can easily cause performance interference to other concurrent queries. The example of such queries are Q17, Q19, Q9.
I listed the experiments of Q5 here. When running Q5 alone (TPC-H SF=10), the execution time is 68.4ms. |For Q5|Single query exeuction time: 68.4 ms ||||| |---|---|---|---|---|---| |For Q5 with other three queries to co-exeucute (time unit: ms)|||||| ||4-Query Concurrent|Exeuction Time|Exeuction Time|Exeuction Time|Exeuction Time| ||Q1 x Q2 x Q3 x Q5|Q1|Q2|Q3|Q5| |||95.83|7247.83|235|119.23| ||Q2 x Q3 x Q4 x Q5|Q2|Q3|Q4|Q5| |||7298.5|220.16|5013.16|82.16| ||Q3 x Q4 x Q6 x Q5|Q3|Q4|Q6 |Q5| |||225.83|2501.16|38.5|89.33| ||Q4 x Q6 x Q7 x Q5|Q4|Q6 |Q7|Q5| |||2489.83|32.33|74|92.83| ||Q6 x Q7 x Q8 x Q5|Q6 |Q7|Q8|Q5| |||26.16|60.17|136.66|122.33| ||Q7 x Q8 x Q9 x Q5|Q7|Q8|Q9|Q5| |||87.16|227.33|228.66|205| ||Q8 x Q9 x Q10 x Q5|Q8|Q9|Q10|Q5| |||239.16|238.5|124.83|218.66| ||Q9 x Q10 x Q11 x Q5|Q9|Q10|Q11|Q5| |||192.5|134.33|783.33|145| ||Q10 x Q11 x Q12 x Q5|Q10|Q11|Q12|Q5| |||122.33|719.67|48.17|98.33| ||Q11 x Q12 x Q13 x Q5|Q11|Q12|Q13|Q5| |||736|60.16|199.66|97.16| ||Q12 x Q13 x Q14 x Q5|Q12|Q13|Q14|Q5| |||55|175.16|507.16|92.33| ||Q13 x Q14 x Q15 x Q5|Q13|Q14|Q15|Q5| |||182|495.83|243.66|84.66| ||Q14 x Q15 x Q17 x Q5|Q14|Q15|Q17|Q5| |||600.5|270.16|625.5|146.83| ||Q15 x Q17 x Q18 x Q5|Q15|Q17|Q18|Q5| |||306.5|700.83|1058.66|171.5| ||Q17 x Q18 x Q19 x Q5|Q17|Q18|Q19|Q5| |||716.83|1023.33|165.66|206| ||Q18 x Q19 x Q20 x Q5|Q18|Q19|Q20|Q5| |||1286.16|118.5|1475.83|141.5| ||Q1 x Q3 x Q6 x Q5|Q1|Q3|Q6 |Q5| |||70|170.8|26.2|98.8| ||Q2 x Q4 x Q8 x Q5|Q2|Q4|Q8|Q5| |||7279.6|4895.4|141.4|128.2| ||Q3 x Q6 x Q9 x Q5|Q3|Q6 |Q9|Q5| |||225.5|28.33|167.83|144.16| ||Q1 x Q10 x Q15 x Q5|Q1|Q10|Q15|Q5| |||96.8|119.8|248.16|113.5| ||Q7 x Q17 x Q20 x Q5|Q7|Q17|Q20|Q5| |||135.8|641|1490.8|165.4| ||Q9 x Q19 x Q20 x Q5|Q9|Q19|Q20|Q5| |||223|126.6|1462.4|202.4| ||Q8x Q9 x Q17 x Q5|Q8|Q9|Q17|Q5| |||282.4|283|633.8|254| ||Q8x Q9 x Q19 x Q5|Q8|Q9|Q19|Q5| |||247.8|252|118.4|220.2| ||Q8 x Q17 x Q19 x Q5|Q8|Q17|Q19|Q5| |||230.6|695|136|215.8| ||Q9 x Q17 x Q19 x Q5|Q9|Q17|Q19|Q5| |||282|699|197|241.2|
As we can see, Q5 is friendly to be co-executed with some queries such as Q3, Q6, Q7 and so on.
However, Q17, Q19 and Q9 can easily cause significant performance interference to Q5 (i.e., 2x slower). Though I did not show here, Q17, Q19, and Q9 also cause significant performance interference to Q1, Q6, Q7. I'm trying to understand what happens there. After reading the query contents, I identified the common features of Q17, Q19, and Q9 are 1) many join operations 2) query contains more than one subqueries.
Then I think the reasons for performance interference might be global memory access contention: 1) hash join operations need often access hash table residing GPU memory. 2) subqueries might create intermediate tables, and then materializes the tables on GPU memory, which causes global memory access.
Do you agree? What is your opinion?
Best, WANG Fang
-
Hi @Lily_Liu
To run the TPC-H benchmark, I usually create a dedicated database on my heavydb using heavysql.
candido@zion-legion:~$ heavysql -p HyperInteractive heavysql> create database tpch; heavysql> \\db tpch User admin switched to database tpch \\q
create the tables using the DDL script that fits better your config (in this example, something unoptimized using 1 GPU only)
cat ddl_tpch_1gpu.sql | heavysql -p HyperInteractive tpch candido@zion-legion:~/tpch$ cat ddl_tpch_1gpu.sql | heavysql -p HyperInteractive tpch User admin connected to database tpch Table or View with the name "revenue" already exists. User admin disconnected from database tpch
To generate the data, I suggest using the tpch-kit from Greg Rahn.
You can download and compile it from tpch-kit
candido@zion-legion:~/tpch$ git clone https://github.com/gregrahn/tpch-kit.git Cloning into 'tpch-kit'... remote: Enumerating objects: 8883, done. remote: Total 8883 (delta 0), reused 0 (delta 0), pack-reused 8883 Receiving objects: 100% (8883/8883), 29.24 MiB | 8.23 MiB/s, done. Resolving deltas: 100% (1130/1130), done. candido@zion-legion:~/tpch/tpch-kit/dbgen$ make clean rm -f dbgen qgen build.o driver.o bm_utils.o rnd.o print.o load_stub.o bcd2.o speed_seed.o text.o permute.o rng64.o build.o bm_utils.o qgen.o rnd.o varsub.o text.o bcd2.o permute.o speed_seed.o rng64.o candido@zion-legion:~/tpch/tpch-kit/dbgen$ make gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o build.o build.c gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o driver.o driver.c gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o rnd.o rnd.c gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o print.o print.c gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o load_stub.o load_stub.c gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o bcd2.o bcd2.c gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o speed_seed.o speed_seed.c gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o text.o text.c gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o permute.o permute.c gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o rng64.o rng64.c gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -O -o dbgen build.o driver.o bm_utils.o rnd.o print.o load_stub.o bcd2.o speed_seed.o text.o permute.o rng64.o -lm gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o qgen.o qgen.c gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -c -o varsub.o varsub.c gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -O -o qgen build.o bm_utils.o qgen.o rnd.o varsub.o text.o bcd2.o permute.o speed_seed.o rng64.o -lm
or you can download the one I already compiled in my drive
then you can use my script generate_tpch_data.sh to automate the generation of the TPCH data
the script syntax is
generate_tpch_data.sh (number of threads) (caling factor) (table datafileis to build c|L|S|s|O|P|r|o) (target directory) (dbgen install)
For an example, If you want to create the data for the table customer runs using 4 CPU cores (so generating four files) with SF of just 1GB with the tpch-kit located in /home/candido/tpch-kit/dbgen using the /var/lib/heavyai/storage/import as target directory because it's a directory where the server is allowed to read.
candido@zion-legion:~/tpch$ ./generate_tpch_data.sh 4 1 c /var/lib/heavyai/storage/import /home/candido/tpch-kit/dbgen starting 4 thread to build table datafile customer into directory /var/lib/heavyai/storage/import with scaling factor of 1 GB TPC-H Population Generator (Version 2.17.3) Copyright Transaction Processing Performance Council 1994 - 2010 TPC-H Population Generator (Version 2.17.3) Copyright Transaction Processing Performance Council 1994 - 2010 TPC-H Population Generator (Version 2.17.3) Copyright Transaction Processing Performance Council 1994 - 2010 TPC-H Population Generator (Version 2.17.3) Copyright Transaction Processing Performance Council 1994 - 2010 ls -ltr candido@zion-legion:~/tpch$ ls -ltr /var/lib/heavyai/storage/import total 4108 drwxrwxr-x 3 candido candido 4096 mag 29 09:56 sample_datasets -rw-rw-r-- 1 candido candido 1048741 giu 3 15:05 customer.2.tbl.gz -rw-rw-r-- 1 candido candido 1049425 giu 3 15:05 customer.1.tbl.gz -rw-rw-r-- 1 candido candido 1048257 giu 3 15:05 customer.3.tbl.gz -rw-rw-r-- 1 candido candido 1047675 giu 3 15:05 customer.4.tbl.gz
To create the data file for all the tables, you can run this loop.
for t in $(echo "c L S s O P r o") do ./generate_tpch_data.sh 4 1 $t /var/lib/heavyai/storage/import /home/candido/tpch-kit/dbgen done;
Now you can load the tables using the copy command.
candido@zion-legion:~/tpch$ heavysql -p HyperInteractive tpch User admin connected to database tpch heavysql> copy customer from '/var/lib/heavyai/storage/import/customer.*.tbl.gz' with (delimiter = '|', header='false'); Result Loaded: 150000 recs, Rejected: 0 recs in 0.458000 secs heavysql> copy orders from '/var/lib/heavyai/storage/import/orders.*.tbl.gz' with (delimiter = '|', header='false'); Result Loaded: 1500000 recs, Rejected: 0 recs in 1.410000 secs heavysql> copy lineitem from '/var/lib/heavyai/storage/import/lineitem.*.tbl.gz' with (delimiter = '|', header='false'); Result Loaded: 6001215 recs, Rejected: 0 recs in 6.176000 secs heavysql> copy supplier from '/var/lib/heavyai/storage/import/supplier.*.tbl.gz' with (delimiter = '|', header='false'); Result Loaded: 10000 recs, Rejected: 0 recs in 0.247000 secs heavysql> copy part from '/var/lib/heavyai/storage/import/part.*.tbl.gz' with (delimiter = '|', header='false'); Result Loaded: 200000 recs, Rejected: 0 recs in 0.511000 secs heavysql> copy partsupp from '/var/lib/heavyai/storage/import/partsupp.*.tbl.gz' with (delimiter = '|', header='false'); Result Loaded: 800000 recs, Rejected: 0 recs in 1.138000 secs heavysql> copy region from '/var/lib/heavyai/storage/import/region.*.tbl.gz' with (delimiter = '|', header='false'); Result Loaded: 20 recs, Rejected: 0 recs in 0.162000 secs heavysql> copy nation from '/var/lib/heavyai/storage/import/nation.*.tbl.gz' with (delimiter = '|', header='false'); Result Loaded: 100 recs, Rejected: 0 recs in 0.156000 secs \\q
After that, you can run the TCPH queries. Except for the Q21, everything should run.
I tried with this script. I generated it with qgen and slightly modified because it's used reserved words
as for the DDL I can share two files for
Let me know if everything is working and if you need some help.
Candido
-
Hi @candido.dessanti ,
I am also looking into the detailed time breakdown in HeavyDB. One thing puzzles me is for some queries, the time reported in debug timer is different from time profiled from the client. Please see the following output. Thread 6 only reports 460 ms, but clients gets ~1300 ms. This is same as what reported by thread 1, but huge difference than thread 6. How should I interpret this?
Another thing is I do not really see any
LaunchGPUCode
block here as others' outputs, but I do see query is being executed on GPU innvidia-smi
.2022-06-08T21:57:21.808907 I 151753 0 6 RelAlgExecutor.cpp:476 DEBUG_TIMER thread_id(6) 460ms total duration for executeRelAlgQuery 460ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516 0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517 460ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814 460ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962 460ms start(0ms) executeSort RelAlgExecutor.cpp:2985 460ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3388 338ms start(0ms) compileWorkUnit NativeCodegen.cpp:2589 210ms start(1ms) getInstance HashJoin.cpp:295 210ms start(1ms) reify PerfectJoinHashTable.cpp:354 112ms start(1ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(7) 73ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 72ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(7) New thread(8) 24ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 24ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(8) 0ms start(212ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 125ms start(213ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 3ms start(214ms) initializeNVPTXBackend NativeCodegen.cpp:1442 121ms start(217ms) generateNativeGPUCode NativeCodegen.cpp:1126 0ms start(217ms) check_module_requires_libdevice NativeCodegen.cpp:272 9ms start(217ms) optimize_ir NativeCodegen.cpp:310 12ms start(227ms) generatePTX NativeCodegen.cpp:1405 98ms start(239ms) ptx_to_cubin NvidiaKernel.cpp:128 0ms start(339ms) collectAllDeviceResults Execute.cpp:2333 121ms start(339ms) compileWorkUnit NativeCodegen.cpp:2589 24ms start(340ms) getInstance HashJoin.cpp:295 24ms start(340ms) reify PerfectJoinHashTable.cpp:354 0ms start(340ms) getOneColumnFragment ColumnFetcher.cpp:78 New thread(9) 23ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721 23ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89 End thread(9) 0ms start(365ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849 95ms start(365ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326 95ms start(365ms) generateNativeGPUCode NativeCodegen.cpp:1126 0ms start(365ms) check_module_requires_libdevice NativeCodegen.cpp:272 6ms start(365ms) optimize_ir NativeCodegen.cpp:310 10ms start(372ms) generatePTX NativeCodegen.cpp:1405 77ms start(382ms) ptx_to_cubin NvidiaKernel.cpp:128 0ms start(460ms) resultsUnion Execute.cpp:1279 2022-06-08T21:57:21.809143 I 151753 0 1 DBHandler.cpp:1260 DEBUG_TIMER thread_id(1) 1388ms total duration for sql_execute
Please sign in to leave a comment.
Comments
26 comments