TPC-H Data Loading

Comments

26 comments

  • Avatar
    Candido Dessanti

    Hi @Eilowangfang,

    I wouldn't say you planned anything in a stupid way.

    to run regualr tpch queries on omnisci, you have to start the server disabling watchdog, so with the parameter enable-watchdog truned to false, and to be able to run all queires except one, you have to do some rewriting.

    I can supply everything you need, so ccontact me by pm or by email I supplied in github

    0
    Comment actions Permalink
  • Avatar
    Eilowangfang

    yes. Thanks for your help. An email has been sent to candido.dessanti@omnisci.com from wangfang722@outlook.com Look forward your reply.

    0
    Comment actions Permalink
  • Avatar
    Eilowangfang

    Thanks for your help with TPC-H benchmarking! Now I can execute the TPC-H queries on omnisci.

    Well. Sorry. I still need your help with: 1) is there any timer that indicates the end-to-end/elapsed time of one query execution? (I could find any from Documents)

    2) I'm using omnisci GPU. I noticed that once I started the omnisql server, the GPU memory will be occupied, as shown in the figure. The GPU memory is occupied until I stop the omnisql server. WeChat Image_20201218193040|690x380

    Then there are three questions: * The data is loaded to GPU memory once when the omnisql server starts. And then data resides the GPU memory all the time. Is the understanding correct? * If so, for each query execution, the PCIe data of transferring data from the CPU to GPU is not included. The running time is only GPU in-device computing. Is the understanding correct? * Now I want to measure the running time includes PCIe transfer time (as It is not fair for me to ignore data transfer). How to achieve this time measurement?

    0
    Comment actions Permalink
  • Avatar
    Eilowangfang

    I have found the execution time in LOG files in /mapd_log. The figure shows the execution time of TPC-H Q10 (repeat 5 times) from LOG file.

    I have two questions: 1) Does the “execution_time_ms” include PCIe data transfer time?

    2) I repeated the query five times, the first query execution is always too long. Then the next queries are fast. The long first query execution might be: A. due to PCIe data transfer for the first query. Then next queries (if same query content) might not need PCIe transfer. In this case, 2857ms includes PCIe transfer, and 154/149/142/139ms does not include PCIe transfer. B. due to GPU startup and init time at the first query. For next each query, the PCIe are still essential. In this case, all 2857/154/149/142/139ms include PCIe transfer.

    Which do you think is the actual reason? af|690x169

    The query result looks great to me, if the short execution time (e.g., 153/149/143/139ms) already includes PCIe transfer time.

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi @Eilowangfang,

    To your question, if there is a time that indicates the end-to-end etc., the answer is YES. to activate it, you have to start the server with the parameter enable-debug-timer turned to true. If you look at docs is listed, and it's supposed to

    Enable fine-grained query execution timers for debug. For debugging, logs verbose timing information for query execution (time to load data, time to compile code, and so on).

    the Infos will be written into the logs at INFO level so that they will be available at [/mapd_storage]/data/mapd_logs/omnisci.INFO

    running the TPCHQ10 query after a server cold start the output is this one

    2020-12-18T16:26:00.744844 I 4773 1 measure.h:74 Timer start                         getExecutor                         getExecutor:  158
    2020-12-18T16:26:00.744860 I 4773 1 measure.h:81 Timer end                           getExecutor                         getExecutor:  158 elapsed 0 ms
    2020-12-18T16:26:00.744900 I 4773 1 QueryDispatchQueue.h:61 Dispatching query with 0 queries in the queue.
    2020-12-18T16:26:00.744953 I 4773 7 QueryDispatchQueue.h:92 Worker 1 running the query and returning control. There are now 0 queries in the queue.
    2020-12-18T16:26:00.744998 I 4773 7 measure.h:74 Timer start                         getExecutor                         getExecutor:  158
    2020-12-18T16:26:00.745018 I 4773 7 measure.h:81 Timer end                           getExecutor                         getExecutor:  158 elapsed 0 ms
    2020-12-18T16:26:00.745647 I 4773 7 measure.h:74 Timer start                  executeRelAlgQuery                  executeRelAlgQuery:  222
    2020-12-18T16:26:00.745668 I 4773 7 measure.h:74 Timer start           executeRelAlgQueryNoRetry           executeRelAlgQueryNoRetry:  253
    2020-12-18T16:26:00.943161 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_28 was 197ms
    2020-12-18T16:26:01.140584 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_30 was 197ms
    2020-12-18T16:26:01.313512 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_26 was 172ms
    2020-12-18T16:26:01.505229 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_27 was 191ms
    2020-12-18T16:26:01.505494 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_22 was 0ms
    2020-12-18T16:26:01.505611 I 4773 7 Catalog.cpp:1485 Time to load Dictionary 3_31 was 0ms
    2020-12-18T16:26:01.505698 I 4773 7 measure.h:74 Timer start                    executeRelAlgSeq                    executeRelAlgSeq:  598
    2020-12-18T16:26:01.505714 I 4773 7 measure.h:74 Timer start                   executeRelAlgStep                   executeRelAlgStep:  677
    2020-12-18T16:26:01.505870 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(lineitem, ["L_ORDERKEY", "L_PARTKEY", "L_SUPPKEY", "L_LINENUMBER", "L_QUANTITY", "L_EXTENDEDPRICE", "L_DISCOUNT", "L_TAX", "L_RETURNFLAG", "L_LINESTATUS", "L_SHIPDATE", "L_COMMITDATE", "L_RECEIPTDATE", "L_SHIPINSTRUCT", "L_SHIPMODE", "L_COMMENT", "rowid"]) to nest level 0
    2020-12-18T16:26:01.505891 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(orders, ["O_ORDERKEY", "O_CUSTKEY", "O_ORDERSTATUS", "O_TOTALPRICE", "O_ORDERDATE", "O_ORDERPRIORITY", "O_CLERK", "O_SHIPPRIORITY", "O_COMMENT", "rowid"]) to nest level 1
    2020-12-18T16:26:01.505898 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(customer, ["C_CUSTKEY", "C_NAME", "C_ADDRESS", "C_NATIONKEY", "C_PHONE", "C_ACCTBAL", "C_MKTSEGMENT", "C_COMMENT", "rowid"]) to nest level 2
    2020-12-18T16:26:01.505905 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(nation, ["N_NATIONKEY", "N_NAME", "N_REGIONKEY", "N_COMMENT", "rowid"]) to nest level 3
    2020-12-18T16:26:01.505929 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(lineitem, ["L_ORDERKEY", "L_PARTKEY", "L_SUPPKEY", "L_LINENUMBER", "L_QUANTITY", "L_EXTENDEDPRICE", "L_DISCOUNT", "L_TAX", "L_RETURNFLAG", "L_LINESTATUS", "L_SHIPDATE", "L_COMMITDATE", "L_RECEIPTDATE", "L_SHIPINSTRUCT", "L_SHIPMODE", "L_COMMENT", "rowid"]) to nest level 0
    2020-12-18T16:26:01.505947 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(orders, ["O_ORDERKEY", "O_CUSTKEY", "O_ORDERSTATUS", "O_TOTALPRICE", "O_ORDERDATE", "O_ORDERPRIORITY", "O_CLERK", "O_SHIPPRIORITY", "O_COMMENT", "rowid"]) to nest level 1
    2020-12-18T16:26:01.505953 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(customer, ["C_CUSTKEY", "C_NAME", "C_ADDRESS", "C_NATIONKEY", "C_PHONE", "C_ACCTBAL", "C_MKTSEGMENT", "C_COMMENT", "rowid"]) to nest level 2
    2020-12-18T16:26:01.505959 I 4773 7 RelAlgExecutor.cpp:1042 Assigned input RelScan(nation, ["N_NATIONKEY", "N_NAME", "N_REGIONKEY", "N_COMMENT", "rowid"]) to nest level 3
    2020-12-18T16:26:01.506037 I 4773 7 measure.h:74 Timer start                     executeWorkUnit                     executeWorkUnit: 2769
    2020-12-18T16:26:01.506046 I 4773 7 measure.h:74 Timer start                     get_table_infos                     get_table_infos:  263
    2020-12-18T16:26:01.506053 I 4773 7 measure.h:81 Timer end                       get_table_infos                     get_table_infos:  263 elapsed 0 ms
    2020-12-18T16:26:01.506113 I 4773 7 measure.h:74 Timer start                Exec_executeWorkUnit                 executeWorkUnitImpl: 1405
    2020-12-18T16:26:01.506138 I 4773 7 measure.h:74 Timer start              query_step_compilation                 executeWorkUnitImpl: 1425
    2020-12-18T16:26:01.506247 I 4773 7 measure.h:81 Timer end                query_step_compilation                 executeWorkUnitImpl: 1425 elapsed 0 ms
    2020-12-18T16:26:01.506255 I 4773 7 measure.h:81 Timer end                  Exec_executeWorkUnit                 executeWorkUnitImpl: 1405 elapsed 0 ms
    2020-12-18T16:26:01.506268 I 4773 7 measure.h:74 Timer start                     get_table_infos                     get_table_infos:  263
    2020-12-18T16:26:01.506276 I 4773 7 measure.h:81 Timer end                       get_table_infos                     get_table_infos:  263 elapsed 0 ms
    2020-12-18T16:26:01.506280 I 4773 7 measure.h:74 Timer start                Exec_executeWorkUnit                 executeWorkUnitImpl: 1405
    2020-12-18T16:26:01.506289 I 4773 7 measure.h:74 Timer start              query_step_compilation                 executeWorkUnitImpl: 1425
    2020-12-18T16:26:01.507720 I 4773 7 measure.h:74 Timer start                      buildJoinLoops                      buildJoinLoops:  266
    2020-12-18T16:26:01.510955 I 4773 7 BufferMgr.cpp:304 ALLOCATION slab of 8388608 pages (4294967296B) created in 3 ms GPU_MGR:0
    2020-12-18T16:26:01.510986 I 4773 7 BufferMgr.cpp:304 ALLOCATION slab of 8388608 pages (4294967296B) created in 0 ms CPU_MGR:0
    2020-12-18T16:26:01.584934 I 4773 7 BufferMgr.cpp:304 ALLOCATION slab of 8388608 pages (4294967296B) created in 3 ms GPU_MGR:1
    2020-12-18T16:26:01.996789 I 4773 7 measure.h:81 Timer end                        buildJoinLoops                      buildJoinLoops:  266 elapsed 489 ms
    2020-12-18T16:26:01.998070 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice
    2020-12-18T16:26:01.998105 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice
    2020-12-18T16:26:02.083164 I 4773 7 measure.h:81 Timer end                query_step_compilation                 executeWorkUnitImpl: 1425 elapsed 576 ms
    2020-12-18T16:26:02.083338 I 4773 14 measure.h:74 Timer start                          kernel_run                                 run:   92
    2020-12-18T16:26:02.083397 I 4773 14 measure.h:74 Timer start                         fetchChunks                         fetchChunks: 2342
    2020-12-18T16:26:02.083890 I 4773 15 measure.h:74 Timer start                          kernel_run                                 run:   92
    2020-12-18T16:26:02.083909 I 4773 15 measure.h:74 Timer start                         fetchChunks                         fetchChunks: 2342
    2020-12-18T16:26:02.634163 I 4773 14 measure.h:81 Timer end                           fetchChunks                         fetchChunks: 2342 elapsed 550 ms
    2020-12-18T16:26:02.634241 I 4773 14 measure.h:74 Timer start           executePlanWithoutGroupBy           executePlanWithoutGroupBy: 2693
    2020-12-18T16:26:02.634276 I 4773 14 measure.h:74 Timer start                        lauchGpuCode                       launchGpuCode:  200
    2020-12-18T16:26:02.727827 I 4773 15 measure.h:81 Timer end                           fetchChunks                         fetchChunks: 2342 elapsed 643 ms
    2020-12-18T16:26:02.727893 I 4773 15 measure.h:74 Timer start           executePlanWithoutGroupBy           executePlanWithoutGroupBy: 2693
    2020-12-18T16:26:02.727910 I 4773 15 measure.h:74 Timer start                        lauchGpuCode                       launchGpuCode:  200
    2020-12-18T16:26:02.771668 I 4773 15 measure.h:81 Timer end                          lauchGpuCode                       launchGpuCode:  200 elapsed 43 ms
    2020-12-18T16:26:02.771713 I 4773 15 measure.h:81 Timer end             executePlanWithoutGroupBy           executePlanWithoutGroupBy: 2693 elapsed 43 ms
    2020-12-18T16:26:02.771734 I 4773 15 measure.h:81 Timer end                            kernel_run                                 run:   92 elapsed 687 ms
    2020-12-18T16:26:02.775700 I 4773 14 measure.h:81 Timer end                          lauchGpuCode                       launchGpuCode:  200 elapsed 141 ms
    2020-12-18T16:26:02.775716 I 4773 14 measure.h:81 Timer end             executePlanWithoutGroupBy           executePlanWithoutGroupBy: 2693 elapsed 141 ms
    2020-12-18T16:26:02.775736 I 4773 14 measure.h:81 Timer end                            kernel_run                                 run:   92 elapsed 692 ms
    2020-12-18T16:26:02.779040 I 4773 7 measure.h:81 Timer end                  Exec_executeWorkUnit                 executeWorkUnitImpl: 1405 elapsed 1272 ms
    2020-12-18T16:26:02.779554 I 4773 7 measure.h:74 Timer start                Exec_executeWorkUnit                 executeWorkUnitImpl: 1405
    2020-12-18T16:26:02.779595 I 4773 7 measure.h:74 Timer start              query_step_compilation                 executeWorkUnitImpl: 1425
    2020-12-18T16:26:02.781268 I 4773 7 measure.h:74 Timer start                      buildJoinLoops                      buildJoinLoops:  266
    2020-12-18T16:26:02.790142 I 4773 7 measure.h:81 Timer end                        buildJoinLoops                      buildJoinLoops:  266 elapsed 8 ms
    2020-12-18T16:26:02.791306 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice
    2020-12-18T16:26:02.791374 I 4773 7 NativeCodegen.cpp:283 module does not require linking against libdevice
    2020-12-18T16:26:02.879975 I 4773 7 measure.h:81 Timer end                query_step_compilation                 executeWorkUnitImpl: 1425 elapsed 100 ms
    2020-12-18T16:26:02.880165 I 4773 23 measure.h:74 Timer start                          kernel_run                                 run:   92
    2020-12-18T16:26:02.880199 I 4773 23 measure.h:74 Timer start                         fetchChunks                         fetchChunks: 2342
    2020-12-18T16:26:02.880148 I 4773 22 measure.h:74 Timer start                          kernel_run                                 run:   92
    2020-12-18T16:26:02.880376 I 4773 22 measure.h:74 Timer start                         fetchChunks                         fetchChunks: 2342
    2020-12-18T16:26:02.956186 I 4773 23 measure.h:81 Timer end                           fetchChunks                         fetchChunks: 2342 elapsed 76 ms
    2020-12-18T16:26:02.956326 I 4773 23 measure.h:74 Timer start              executePlanWithGroupBy              executePlanWithGroupBy: 2896
    2020-12-18T16:26:02.956351 I 4773 23 measure.h:74 Timer start                        lauchGpuCode                       launchGpuCode:  200
    2020-12-18T16:26:03.024190 I 4773 22 measure.h:81 Timer end                           fetchChunks                         fetchChunks: 2342 elapsed 143 ms
    2020-12-18T16:26:03.024304 I 4773 22 measure.h:74 Timer start              executePlanWithGroupBy              executePlanWithGroupBy: 2896
    2020-12-18T16:26:03.024330 I 4773 22 measure.h:74 Timer start                        lauchGpuCode                       launchGpuCode:  200
    2020-12-18T16:26:03.115051 I 4773 23 measure.h:81 Timer end                          lauchGpuCode                       launchGpuCode:  200 elapsed 158 ms
    2020-12-18T16:26:03.116177 I 4773 22 measure.h:81 Timer end                          lauchGpuCode                       launchGpuCode:  200 elapsed 91 ms
    2020-12-18T16:26:03.119799 I 4773 23 measure.h:81 Timer end                executePlanWithGroupBy              executePlanWithGroupBy: 2896 elapsed 163 ms
    2020-12-18T16:26:03.119840 I 4773 23 measure.h:81 Timer end                            kernel_run                                 run:   92 elapsed 239 ms
    2020-12-18T16:26:03.119846 I 4773 22 measure.h:81 Timer end                executePlanWithGroupBy              executePlanWithGroupBy: 2896 elapsed 95 ms
    2020-12-18T16:26:03.119875 I 4773 22 measure.h:81 Timer end                            kernel_run                                 run:   92 elapsed 239 ms
    2020-12-18T16:26:03.217164 I 4773 7 measure.h:81 Timer end                  Exec_executeWorkUnit                 executeWorkUnitImpl: 1405 elapsed 437 ms
    2020-12-18T16:26:03.217273 I 4773 7 measure.h:81 Timer end                       executeWorkUnit                     executeWorkUnit: 2769 elapsed 1711 ms
    2020-12-18T16:26:03.359704 I 4773 7 measure.h:81 Timer end                     executeRelAlgStep                   executeRelAlgStep:  677 elapsed 1854 ms
    2020-12-18T16:26:03.359741 I 4773 7 measure.h:81 Timer end                      executeRelAlgSeq                    executeRelAlgSeq:  598 elapsed 1854 ms
    2020-12-18T16:26:03.359769 I 4773 7 measure.h:81 Timer end             executeRelAlgQueryNoRetry           executeRelAlgQueryNoRetry:  253 elapsed 2614 ms
    2020-12-18T16:26:03.359778 I 4773 7 measure.h:81 Timer end                    executeRelAlgQuery                  executeRelAlgQuery:  222 elapsed 2614 ms
    

    the most relevant info are:

    2020-12-18T16:26:02.083164 I 4773 7 measure.h:81 Timer end query_step_compilation executeWorkUnitImpl: 1425 elapsed 576 ms

    The system didn't know the query, so it has to parse, optimize and generate the LLVM code for each step of the query (You can get the LLVM code with the explain [your_statement] command).

    To build the code, the system took 576ms (if you run the same query with different parameters, the old code is reused)

    then to load the data from disk to system RAM and GPU ram, you have to look to fetch chunks

    2020-12-18T16:26:02.634163 I 4773 14 measure.h:81 Timer end                           fetchChunks                         fetchChunks: 2342 elapsed **550** ms 
    2020-12-18T16:26:02.727827 I 4773 15 measure.h:81 Timer end                           fetchChunks                         fetchChunks: 2342 elapsed **643** ms
    

    Those threads run in parallel, so the entire operation likely lasted 643ms. I have a disk subsystem capable of 10GB/sec, so the reads from disks are really fast, and the bottleneck is the CPU power and the PCI-ex bus. You get other data like the time the query spent on Gpu or in Cpu depending on the executor chooses for your query.

    You will also get a summarization in a more readable format at the end of the log. Something like that

    2614ms total duration for executeRelAlgQuery
      2614ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:254
        760ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:255
        1854ms start(760ms) executeRelAlgSeq RelAlgExecutor.cpp:599
          1853ms start(760ms) executeRelAlgStep RelAlgExecutor.cpp:678
            1853ms start(760ms) executeSort RelAlgExecutor.cpp:2446
              1711ms start(760ms) executeWorkUnit RelAlgExecutor.cpp:2770
                0ms start(760ms) compileWorkUnit NativeCodegen.cpp:2409
                576ms start(760ms) compileWorkUnit NativeCodegen.cpp:2409
                  474ms start(762ms) getInstance HashJoin.cpp:480
                    474ms start(762ms) reify JoinHashTable.cpp:487
                        New thread(8)
                          214ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                        End thread(8)
                        New thread(9)
                          374ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                        End thread(9)
                  14ms start(1236ms) getInstance HashJoin.cpp:480
                    14ms start(1236ms) reify JoinHashTable.cpp:487
                        New thread(10)
                          2ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                        End thread(10)
                        New thread(11)
                          0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                        End thread(11)
                  0ms start(1250ms) getInstance HashJoin.cpp:480
                    0ms start(1250ms) reify JoinHashTable.cpp:487
                        New thread(12)
                          0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                        End thread(12)
                        New thread(13)
                          0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                        End thread(13)
                  New thread(14)
                    0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91
                    550ms start(0ms) fetchChunks Execute.cpp:2341
                    0ms start(550ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703
                    141ms start(550ms) executePlanWithoutGroupBy Execute.cpp:2694
                      141ms start(550ms) launchGpuCode QueryExecutionContext.cpp:199
                  End thread(14)
                  New thread(15)
                    0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91
                    643ms start(0ms) fetchChunks Execute.cpp:2341
                    0ms start(644ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703
                    43ms start(644ms) executePlanWithoutGroupBy Execute.cpp:2694
                      43ms start(644ms) launchGpuCode QueryExecutionContext.cpp:199
                  End thread(15)
                3ms start(2030ms) collectAllDeviceResults Execute.cpp:1783
                  3ms start(2030ms) reduceMultiDeviceResults Execute.cpp:887
                100ms start(2033ms) compileWorkUnit NativeCodegen.cpp:2409
                  7ms start(2035ms) getInstance HashJoin.cpp:480
                    7ms start(2035ms) reify JoinHashTable.cpp:487
                        New thread(16)
                          6ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                        End thread(16)
                        New thread(17)
                          6ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                        End thread(17)
                  0ms start(2043ms) getInstance HashJoin.cpp:480
                    0ms start(2043ms) reify JoinHashTable.cpp:487
                        New thread(18)
                          0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                        End thread(18)
                        New thread(19)
                          0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                        End thread(19)
                  0ms start(2044ms) getInstance HashJoin.cpp:480
                    0ms start(2044ms) reify JoinHashTable.cpp:487
                        New thread(20)
                          0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                        End thread(20)
                        New thread(21)
                          0ms start(0ms) initHashTableForDevice JoinHashTable.cpp:673
                        End thread(21)
                  New thread(22)
                    0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91
                    143ms start(0ms) fetchChunks Execute.cpp:2341
                    0ms start(144ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703
                    95ms start(144ms) executePlanWithGroupBy Execute.cpp:2895
                      91ms start(144ms) launchGpuCode QueryExecutionContext.cpp:199
                      3ms start(236ms) getRowSet QueryExecutionContext.cpp:136
                        3ms start(236ms) reduceMultiDeviceResults Execute.cpp:887
                          3ms start(236ms) reduceMultiDeviceResultSets Execute.cpp:933
                  End thread(22)
                  New thread(23)
                    0ms start(0ms) ExecutionKernel::run ExecutionKernel.cpp:91
                    76ms start(0ms) fetchChunks Execute.cpp:2341
                    0ms start(76ms) getQueryExecutionContext QueryMemoryDescriptor.cpp:703
                    163ms start(76ms) executePlanWithGroupBy Execute.cpp:2895
                      158ms start(76ms) launchGpuCode QueryExecutionContext.cpp:199
                      4ms start(234ms) getRowSet QueryExecutionContext.cpp:136
                        4ms start(234ms) reduceMultiDeviceResults Execute.cpp:887
                          4ms start(234ms) reduceMultiDeviceResultSets Execute.cpp:933
                  End thread(23)
                97ms start(2374ms) collectAllDeviceResults Execute.cpp:1783
                  97ms start(2374ms) reduceMultiDeviceResults Execute.cpp:887
                    97ms start(2374ms) reduceMultiDeviceResultSets Execute.cpp:933
              142ms start(2471ms) sort ResultSet.cpp:484
                142ms start(2471ms) baselineSort ResultSet.cpp:544
                  0ms start(2613ms) createComparator ResultSet.h:638
                  0ms start(2614ms) topPermutation ResultSet.cpp:846
    

    Re-running the query changing the year from 1993 to 1994, the total time is 441 ms.

    Clearing the GPU memory cache with the command \clear_gpu and re-running the query, the total time went up to 874 ms, so it's likely to transfer data and populate the GPU caches, the system is taking over 400ms.

    So your guesses were mostly right, except we don't load anything automatically when the Omnisci server start (there is a parameter that can be used to warm-up some caches), and the data resides in the GPU if needed, but for a complete GPU execution using the default parameters you need that all the data fit in GPU memory; if there isn't enough GPU ram the query will be executed on CPU if there isn't enough CPU memory, it's likely to fail.

    The GPU memory is allocated when needed, and you can limit using the GPU-buffer-mem-bytes that, as default, is 0 and will allocate all available GPU mem in your system (if needed).

    I think I replied to your question, but If I forgot something, pleas tell me.

    Best Regards, Candido ;)

    0
    Comment actions Permalink
  • Avatar
    Eilowangfang

    Thank you. Your answer is clear to me.

    I have configured the file with adding: image|250x160

    I did get the running time info, but only the "total_time_ms" and "execution_time_ms". not detailed like that you showed. image|690x286 .

    BTW, i'm curious of the usage of "num-gpus = 1". My understanding is that the number of GPU triggered to execute the query. I have two gpus, and I just want to use one of them. So I set "num-gpus = 1". However, when I run the TPC-H query, both GPU0 and GPU1 are loaded data, but seems only GPU0 works. I kind of cannot understand what happends to the GPUs. Are GPU0 and GPU1 working together for the query? Or just GPU0 works, and GPU1 only holds the data and does nothing? image|690x386

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi @Eilowangfang,

    You have placed the parameters num-gpus and enable-debug-timer into the [web] section, so they are ignored by the DB, and used by the omnisci_web_server, which's likely to fail, so the database is running 2 GPUs, and you won't file the new entries in the logfile.

    Looking at the nvidia-smi screen, you posted it looks like it's not running anything on the GPU; while the GPU0 is saying 10%, the power consumption is so low that it's unlikely to run a heavy kernel. Probably the query you are running has fallen back for CPOU executions, or there isn't any query running.

    While benchmarking, I suggest you collect data in a fine-grained way, like a sample with an interval of 100ms or 200ms using the switch -lms that take the input in milliseconds and collect data CSV format.

    something like that

    nvidia-smi -lms 200 --query-gpu=gpu_name,index,utilization.gpu,utilization.memory,clocks.current.sm,power.draw,memory.used --format=csv

    So having multiple GPUs, you will get an idea about how much they are used.

    About your question of how many GPUs are used during a query, the correct answers that all of them will be used at a point in time every GPU will be used, and if everything is properly tuned, all the GPUs in a system will be used during query execution.

    The tables in OmnisciDB consists of something like a macro-partition that we call fragment; at the time of writing, a fragment can be assigned to a GPUs, so if the table has 2 fragments and you have 2 GPUs, each GPU will have a fragment to work with, so they will be busy. Still, if a table has 3 fragments the first GPUs will be work on two fragments. In contrast, the second one on just 1, so it's likely that the second GPU will be idle while the 1st one is still working.

    In the same scenario, if the first fragment has more records than the first one, the first GPU will work more than the second one.

    You can mitigate this behavior, perfectly balancing the number of records and the number of fragments manually, or creating the tables with a number of shards equal to the number of gpu you have in the system.

    Regards, Candido

    0
    Comment actions Permalink
  • Avatar
    Eilowangfang

    Hi,

    Thanks. It was my mistake that put the configuration line at wrong place. Now, I got the detailed timer results. image|690x336

    When I run TPC-H SF=10 Q10, the utilization of GPU shows the below Figure: image|690x250

    I think the utilization are perhaps reasonable, as my workload is simple (SF-10), the kernel exeuction only takes around 104ms. The time of compliation and other things such as result printing, take more than 160ms.

    The nvidia-smi GPU utilization denotes "the percentage of time over the last second that any of the SMs was busy". Hence, since the kernel execution time is shorter than other time components. I think 13% percentage is reasonable. As to nvidia-smi GPU memory utilization denotes actually the percentage of time the memory controller (PCIe) was busy, as the query is executed when the data resides in GPU memory, so 1% utilization is reasonable to me.

    Regards, WANG Fang

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi,

    The frequencies look a little lower than expected and also the power consumption, while the 104ms kernel runtime is reasonable (400ms on Rtx 2080ti of total time with an SF40).

    The memory time shouldn't be the PCIe usage, but the memory is used so the memory scan read about 10GB in the last second...looks quite weird.

    To get the PCIe usage you have to use other tools.

    Regards

    0
    Comment actions Permalink
  • Avatar
    Eilowangfang

    Hi,

    Yes. I agree that the memory time is weird. I plan to profiling further via tool nvprof/Nsight Compute.

    Could you give some hints on how to use the profiling tool on omnisci? Using nvprof normally is commanded with an executable file, while omnisci is interactive command for a query.

    I tried package a shell file query.sh of the command " $OMNISCI_PATH/bin/omnisql -p HyperInteractive omnisci < scripts.sql"

    and then nvprof query.sh. But seems this does not work, no profile data collected. image|517x246

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi @Eilowangfang,

    You have to run nvprof on omnisci_server executable.

    So you have to stop the service and run the server from the command line

    nvprof [your flags] omnisci_server - - data /path_to_omnisci_storage [other omnisci server flags]
    

    Then run some queries and stop the server with control+c. You should get the output, or the binary file depending on the parameters you specified

    0
    Comment actions Permalink
  • Avatar
    Eilowangfang

    Thank you. I tried and it works. image|690x142

    0
    Comment actions Permalink
  • Avatar
    Eilowangfang

    I updated the discussion of memory utilization here. Previously, the memory utilization reported by nvidia-smi is just 1%, which is such wired.

    I just repeat the profiling by using Nsight compute, and the memory utilization reported is around 86%. image|690x200

    @candido.dessanti, the memory utilization here denotes the time spent on L1/L2 and global memory access. 86% might be reasonable (on TPC-H SF=10).

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    @Eilowangfang,

    it looks to be the profiling of init_hash_join, so it's not the query itself, but a kernel called to initialize the hash join buffers. the duration is 240 nanoseconds...well, it would be nice to be so fast ;)

    I will try to cook something for you during the day, while I am having some troubles with the new driver into my WS.

    It would be nice to recover posts of the previous version of the community forum, but I think it's not possible

    Regards

    0
    Comment actions Permalink
  • Avatar
    Eilowangfang

    Hi, Yes. The running time of omnisci is good to me.

    As mentioned in the Email, my next step is to survey the multi-query performance of omnisci. I understand the omnisci currently dose not support concurrent multi-query. But my scenario wants to support more users, achieving high query throughput.

    Could you give me some hints that how to implement this? Thank you so much.

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    I wanted to say that you haven't profiled the query runtime, but a kernel that initializes some buffers to perform hash joins, and Q10 is doing some of them.

    To run queries in parallel, you have to create 2 instances of omniscidb, load the same data using the same table names, and connect in a round-robin manner to the instance's nodes.

    so as the first step, you have to init the storages you are using for the two DBs

    ./bin/initdb /data_storage_1
    ./bin/initdb /data_storage_2
    

    then run two instances of the software changing storage and ports and limiting the memory of the system and GPU accessible to each instance; I allocated 16GB per instance because you are using V100S GPUs

    ./bin/omnisci_server --data  /data_storage_1 --gpu-buffer-mem-bytes 16000000000 --cpu-buffer-mem-bytes 32000000000 
    ./bin/omnisci_server --data  /data_storage_2 --gpu-buffer-mem-bytes 16000000000 --cpu-buffer-mem-bytes 32000000000 --calcite-port 16279 --http-port 16278 --port 16274
    

    Then create the tables on both instances and populate them; from now on, you are ready to run queries in parallel on your GPUs. Depending on the queries you will get some performance boost compared to the usual intra-parallelism we use in the database. I'm not sure the waste of resources can justify the performance boost.

    Maybe is better to add GPUs and rely on num-executors parameter discussed in the thread https://community.heavy.ai/t/concurrent-queries/2399 and turning on the dynamic-watchdog to stop the queries that would stall the system.

    Regards,

    Candido

    0
    Comment actions Permalink
  • Avatar
    Eilowangfang

    <-- I wanted to say that you haven’t profiled the query runtime, but a kernel that initializes some buffers to perform hash joins, and Q10 is doing some of them. -->

    Yes. Sorry for the previous Figure makes the profiling result confusing. I did receive detailed profiling of Q10, which I think include all the kernels executed for Q10. The below Figure shows the part of kernels: q10|690x362

    Thanks for your help with concurrent query implementation. I will try soon and update the performance.

    You mentioned that "I’m not sure the waste of resources can justify the performance boost." What kind of resources that you think would be probably wasted? (global memory, or SM, or others?)

    My motivation for running multi-query is that: 1. My dataset is small that GPU can fit. 2. Running a single query cannot make full utilization of GPU resource. 3. Running multi-query can improve resource utilization, and also enjoy high query throughput. I got the motivation only because the community says so. However, I'm not clear what specific kind of GPU resources that I should try to better use. The resources that I can recognize: * GPU global memory * SM utilization * register * shared memory I still know very few about GPUs. It there any other kind of resources that could be better used for concurrent multi-query?

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi,

    Yap, the multifrag_query kernel is the one that is running most of the query runtime.

    Well, if you are doing what I did, you are cutting in half your memory resource and doubling the space allocated on the disk. When I briefly tried doing that, I got some performance improvement versus a classical serial execution, but not so big as I expected.

    Maybe using a tesla class card is better because I tried with RTX Gaming class. I am not a great program on GPU, but maybe you could try to run queries changing the cuda-block-size and cuda-grid-size; in the past the default values has 1024 and 2 times the number of SMs on GPU (You can get the value from the log or by looking at specs of your cards on internet)

    0
    Comment actions Permalink
  • Avatar
    Eilowangfang

    Hi. Merry Christmas.

    I just got a chance to access Nvidia DGX-2 server, which has 10 V100-SXM32 GPUs. So I revisited the chance of moving data between CPU memory and GPU memory. However, I found that the data transfer between CPU memory and GPU memory is still slow.

    After reading some documents, I think that DGX-2 only provides high bandwidth via the NVlink2.0 among GPU-to-GPU data transfer. The host-to-GPU and GPU-to-host is still via PCIe 3.0. The reason is that the CPU in DGX-2 is Intel(R) Xeon(R) Platinum 8168 CPU, which dose not support NVLink between the host and the GPUs. Do you agree with this?

    Seems now I have no chance to use NVLink 2.0 unless I can access IBM Power servers.

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi,

    I hope you got a great time these holidays.

    Yes, you are right, and I bet that's the main reason because Nvidia switched to AMD's EPYC CPUs with PCI-ex 4.0 for their new system. In fact, the x86 CPUs are connected to the GPUs with a PCI-ex switch that's going to limit the performances for queries that need to transfer data thru the bus.

    We bought a DGX A100 to experiment a little with the new architecture and we are experiencing a somewhat low system to GPU transfers speed as anticipated; I think that our choice to rely on caching data on GPU's Ran is the right one right now to get the maximum performances.

    I tried out other databases that use GPU as accelerator and they are not efficient like us while caching.

    I never tried OmniSci DB on IBM power, but you should get a decent speedup on transfers, while I can't say by how much.

    I saw a benchmark somewhere, but the speedup wasn't like night and day; I will try to find out and post here.

    It's a white paper you can access to it here

    There is also some queries that use the tpch schema.

    The speedup of nvlink over the pci-ex 3.0 is an average of 78%, so with the pci-ex 4.0 would be lower (I am saying that because the speed of Cpu ram is going to limit the performances of an nvlink 2.0 more than a pci-ex 4.0).

    Have you tried running concurrent queries duplicating the database?

    Regards, Candido

    0
    Comment actions Permalink
  • Avatar
    Eilowangfang

    Hi, Thanks. The white paper shows significant improvement for the example queries. I can see the queries of TPC-H schema shows various improvements, (e.g., from 12% to 198%).

    Yes. I tried concurrent queries. I can have the observations now: 1) the performance degradation of co-executing queries are various, depending on the concurrent workloads (i.e., queries). 2) there are some queries that concurrently execute can indeed improve throughputs, at the same time, the performance of each individual query is not degraded severely. 2) there are some queries that can easily cause performance interference to other concurrent queries. The example of such queries are Q17, Q19, Q9.

    I listed the experiments of Q5 here. When running Q5 alone (TPC-H SF=10), the execution time is 68.4ms. |For Q5|Single query exeuction time: 68.4 ms ||||| |---|---|---|---|---|---| |For Q5 with other three queries to co-exeucute (time unit: ms)|||||| ||4-Query Concurrent|Exeuction Time|Exeuction Time|Exeuction Time|Exeuction Time| ||Q1 x Q2 x Q3 x Q5|Q1|Q2|Q3|Q5| |||95.83|7247.83|235|119.23| ||Q2 x Q3 x Q4 x Q5|Q2|Q3|Q4|Q5| |||7298.5|220.16|5013.16|82.16| ||Q3 x Q4 x Q6 x Q5|Q3|Q4|Q6 |Q5| |||225.83|2501.16|38.5|89.33| ||Q4 x Q6 x Q7 x Q5|Q4|Q6 |Q7|Q5| |||2489.83|32.33|74|92.83| ||Q6 x Q7 x Q8 x Q5|Q6 |Q7|Q8|Q5| |||26.16|60.17|136.66|122.33| ||Q7 x Q8 x Q9 x Q5|Q7|Q8|Q9|Q5| |||87.16|227.33|228.66|205| ||Q8 x Q9 x Q10 x Q5|Q8|Q9|Q10|Q5| |||239.16|238.5|124.83|218.66| ||Q9 x Q10 x Q11 x Q5|Q9|Q10|Q11|Q5| |||192.5|134.33|783.33|145| ||Q10 x Q11 x Q12 x Q5|Q10|Q11|Q12|Q5| |||122.33|719.67|48.17|98.33| ||Q11 x Q12 x Q13 x Q5|Q11|Q12|Q13|Q5| |||736|60.16|199.66|97.16| ||Q12 x Q13 x Q14 x Q5|Q12|Q13|Q14|Q5| |||55|175.16|507.16|92.33| ||Q13 x Q14 x Q15 x Q5|Q13|Q14|Q15|Q5| |||182|495.83|243.66|84.66| ||Q14 x Q15 x Q17 x Q5|Q14|Q15|Q17|Q5| |||600.5|270.16|625.5|146.83| ||Q15 x Q17 x Q18 x Q5|Q15|Q17|Q18|Q5| |||306.5|700.83|1058.66|171.5| ||Q17 x Q18 x Q19 x Q5|Q17|Q18|Q19|Q5| |||716.83|1023.33|165.66|206| ||Q18 x Q19 x Q20 x Q5|Q18|Q19|Q20|Q5| |||1286.16|118.5|1475.83|141.5| ||Q1 x Q3 x Q6 x Q5|Q1|Q3|Q6 |Q5| |||70|170.8|26.2|98.8| ||Q2 x Q4 x Q8 x Q5|Q2|Q4|Q8|Q5| |||7279.6|4895.4|141.4|128.2| ||Q3 x Q6 x Q9 x Q5|Q3|Q6 |Q9|Q5| |||225.5|28.33|167.83|144.16| ||Q1 x Q10 x Q15 x Q5|Q1|Q10|Q15|Q5| |||96.8|119.8|248.16|113.5| ||Q7 x Q17 x Q20 x Q5|Q7|Q17|Q20|Q5| |||135.8|641|1490.8|165.4| ||Q9 x Q19 x Q20 x Q5|Q9|Q19|Q20|Q5| |||223|126.6|1462.4|202.4| ||Q8x Q9 x Q17 x Q5|Q8|Q9|Q17|Q5| |||282.4|283|633.8|254| ||Q8x Q9 x Q19 x Q5|Q8|Q9|Q19|Q5| |||247.8|252|118.4|220.2| ||Q8 x Q17 x Q19 x Q5|Q8|Q17|Q19|Q5| |||230.6|695|136|215.8| ||Q9 x Q17 x Q19 x Q5|Q9|Q17|Q19|Q5| |||282|699|197|241.2|

    As we can see, Q5 is friendly to be co-executed with some queries such as Q3, Q6, Q7 and so on.

    However, Q17, Q19 and Q9 can easily cause significant performance interference to Q5 (i.e., 2x slower). Though I did not show here, Q17, Q19, and Q9 also cause significant performance interference to Q1, Q6, Q7. I'm trying to understand what happens there. After reading the query contents, I identified the common features of Q17, Q19, and Q9 are 1) many join operations 2) query contains more than one subqueries.

    Then I think the reasons for performance interference might be global memory access contention: 1) hash join operations need often access hash table residing GPU memory. 2) subqueries might create intermediate tables, and then materializes the tables on GPU memory, which causes global memory access.

    Do you agree? What is your opinion?

    Best, WANG Fang

    0
    Comment actions Permalink
  • Avatar
    Lily Liu

    I have exactly the same question (How to run TPCH in heavyDB). Could I find the instruction anywhere or should I also email you? Thanks a lot!

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi @Lily_Liu,

    thanks for joining our community.

    If you are interested I can post the DDLs and some scripts to run the TPC-H benchmark on HeavyDB.

    0
    Comment actions Permalink
  • Avatar
    Lily Liu

    Thanks for the reply! It would be very helpful if you post it, really appreciate your help!

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi @Lily_Liu

    To run the TPC-H benchmark, I usually create a dedicated database on my heavydb using heavysql.

    candido@zion-legion:~$ heavysql -p HyperInteractive
    heavysql> create database tpch;
    heavysql> \\db tpch
    User admin switched to database tpch
    \\q
    

    create the tables using the DDL script that fits better your config (in this example, something unoptimized using 1 GPU only)

    cat ddl_tpch_1gpu.sql | heavysql -p HyperInteractive tpch
    candido@zion-legion:~/tpch$ cat ddl_tpch_1gpu.sql | heavysql -p HyperInteractive tpch
    User admin connected to database tpch
    Table or View with the name "revenue" already exists.
    User admin disconnected from database tpch
    

    To generate the data, I suggest using the tpch-kit from Greg Rahn.

    You can download and compile it from tpch-kit

    candido@zion-legion:~/tpch$ git clone https://github.com/gregrahn/tpch-kit.git
    Cloning into 'tpch-kit'...
    remote: Enumerating objects: 8883, done.
    remote: Total 8883 (delta 0), reused 0 (delta 0), pack-reused 8883
    Receiving objects: 100% (8883/8883), 29.24 MiB | 8.23 MiB/s, done.
    Resolving deltas: 100% (1130/1130), done.
    candido@zion-legion:~/tpch/tpch-kit/dbgen$ make clean
    rm -f dbgen qgen build.o driver.o bm_utils.o rnd.o print.o load_stub.o bcd2.o speed_seed.o text.o permute.o rng64.o build.o bm_utils.o qgen.o rnd.o varsub.o text.o bcd2.o permute.o speed_seed.o rng64.o 
    candido@zion-legion:~/tpch/tpch-kit/dbgen$ make 
    gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64   -c -o build.o build.c
    gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64   -c -o driver.o driver.c
    gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64   -c -o rnd.o rnd.c
    gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64   -c -o print.o print.c
    gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64   -c -o load_stub.o load_stub.c
    gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64   -c -o bcd2.o bcd2.c
    gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64   -c -o speed_seed.o speed_seed.c
    gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64   -c -o text.o text.c
    gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64   -c -o permute.o permute.c
    gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64   -c -o rng64.o rng64.c
    gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -O -o dbgen build.o driver.o bm_utils.o rnd.o print.o load_stub.o bcd2.o speed_seed.o text.o permute.o rng64.o -lm
    gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64   -c -o qgen.o qgen.c
    gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64   -c -o varsub.o varsub.c
    gcc -g -DDBNAME=\\"dss\\" -DLINUX -DPOSTGRESQL -DTPCH -DRNG_TEST -D_FILE_OFFSET_BITS=64 -O -o qgen build.o bm_utils.o qgen.o rnd.o varsub.o text.o bcd2.o permute.o speed_seed.o rng64.o -lm
    
    

    or you can download the one I already compiled in my drive

    then you can use my script generate_tpch_data.sh to automate the generation of the TPCH data

    the script syntax is

    generate_tpch_data.sh (number of threads) (caling factor) (table datafileis to build c|L|S|s|O|P|r|o) (target directory) (dbgen install)

    For an example, If you want to create the data for the table customer runs using 4 CPU cores (so generating four files) with SF of just 1GB with the tpch-kit located in /home/candido/tpch-kit/dbgen using the /var/lib/heavyai/storage/import as target directory because it's a directory where the server is allowed to read.

    candido@zion-legion:~/tpch$ ./generate_tpch_data.sh 4 1 c /var/lib/heavyai/storage/import /home/candido/tpch-kit/dbgen 
    starting 4 thread to build table datafile customer into directory /var/lib/heavyai/storage/import with scaling factor of 1 GB
    TPC-H Population Generator (Version 2.17.3)
    Copyright Transaction Processing Performance Council 1994 - 2010
    TPC-H Population Generator (Version 2.17.3)
    Copyright Transaction Processing Performance Council 1994 - 2010
    TPC-H Population Generator (Version 2.17.3)
    Copyright Transaction Processing Performance Council 1994 - 2010
    TPC-H Population Generator (Version 2.17.3)
    Copyright Transaction Processing Performance Council 1994 - 2010
    ls -ltr candido@zion-legion:~/tpch$ ls -ltr /var/lib/heavyai/storage/import
    total 4108
    drwxrwxr-x 3 candido candido    4096 mag 29 09:56 sample_datasets
    -rw-rw-r-- 1 candido candido 1048741 giu  3 15:05 customer.2.tbl.gz
    -rw-rw-r-- 1 candido candido 1049425 giu  3 15:05 customer.1.tbl.gz
    -rw-rw-r-- 1 candido candido 1048257 giu  3 15:05 customer.3.tbl.gz
    -rw-rw-r-- 1 candido candido 1047675 giu  3 15:05 customer.4.tbl.gz
    

    To create the data file for all the tables, you can run this loop.

    for t in $(echo "c L S s O P r o")
    do 
    ./generate_tpch_data.sh 4 1 $t /var/lib/heavyai/storage/import /home/candido/tpch-kit/dbgen
    done;
    

    Now you can load the tables using the copy command.

    candido@zion-legion:~/tpch$ heavysql -p HyperInteractive tpch
    User admin connected to database tpch
    heavysql> copy customer from '/var/lib/heavyai/storage/import/customer.*.tbl.gz' with (delimiter = '|', header='false');
    Result
    Loaded: 150000 recs, Rejected: 0 recs in 0.458000 secs
    heavysql> copy orders from '/var/lib/heavyai/storage/import/orders.*.tbl.gz' with (delimiter = '|', header='false');
    Result
    Loaded: 1500000 recs, Rejected: 0 recs in 1.410000 secs
    heavysql> copy lineitem from '/var/lib/heavyai/storage/import/lineitem.*.tbl.gz' with (delimiter = '|', header='false');
    Result
    Loaded: 6001215 recs, Rejected: 0 recs in 6.176000 secs
    heavysql> copy supplier from '/var/lib/heavyai/storage/import/supplier.*.tbl.gz' with (delimiter = '|', header='false');
    Result
    Loaded: 10000 recs, Rejected: 0 recs in 0.247000 secs
    heavysql> copy part from '/var/lib/heavyai/storage/import/part.*.tbl.gz' with (delimiter = '|', header='false');
    Result
    Loaded: 200000 recs, Rejected: 0 recs in 0.511000 secs
    heavysql> copy partsupp from '/var/lib/heavyai/storage/import/partsupp.*.tbl.gz' with (delimiter = '|', header='false');
    Result
    Loaded: 800000 recs, Rejected: 0 recs in 1.138000 secs
    heavysql> copy region from '/var/lib/heavyai/storage/import/region.*.tbl.gz' with (delimiter = '|', header='false');
    Result
    Loaded: 20 recs, Rejected: 0 recs in 0.162000 secs
    heavysql> copy nation from '/var/lib/heavyai/storage/import/nation.*.tbl.gz' with (delimiter = '|', header='false');
    Result
    Loaded: 100 recs, Rejected: 0 recs in 0.156000 secs
    \\q
    

    After that, you can run the TCPH queries. Except for the Q21, everything should run.

    I tried with this script. I generated it with qgen and slightly modified because it's used reserved words

    as for the DDL I can share two files for

    1GPU and 2GPU

    Let me know if everything is working and if you need some help.

    Candido

    0
    Comment actions Permalink
  • Avatar
    Jiashen Cao

    Hi @candido.dessanti ,

    I am also looking into the detailed time breakdown in HeavyDB. One thing puzzles me is for some queries, the time reported in debug timer is different from time profiled from the client. Please see the following output. Thread 6 only reports 460 ms, but clients gets ~1300 ms. This is same as what reported by thread 1, but huge difference than thread 6. How should I interpret this?

    Another thing is I do not really see any LaunchGPUCode block here as others' outputs, but I do see query is being executed on GPU in nvidia-smi.

    2022-06-08T21:57:21.808907 I 151753 0 6 RelAlgExecutor.cpp:476 DEBUG_TIMER thread_id(6)
    460ms total duration for executeRelAlgQuery
      460ms start(0ms) executeRelAlgQueryNoRetry RelAlgExecutor.cpp:516
        0ms start(0ms) Query pre-execution steps RelAlgExecutor.cpp:517
        460ms start(0ms) executeRelAlgSeq RelAlgExecutor.cpp:814
          460ms start(0ms) executeRelAlgStep RelAlgExecutor.cpp:962
            460ms start(0ms) executeSort RelAlgExecutor.cpp:2985
              460ms start(0ms) executeWorkUnit RelAlgExecutor.cpp:3388
                338ms start(0ms) compileWorkUnit NativeCodegen.cpp:2589
                  210ms start(1ms) getInstance HashJoin.cpp:295
                    210ms start(1ms) reify PerfectJoinHashTable.cpp:354
                      112ms start(1ms) getOneColumnFragment ColumnFetcher.cpp:78
                        New thread(7)
                          73ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            72ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(7)
                        New thread(8)
                          24ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            24ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(8)
                  0ms start(212ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
                  125ms start(213ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
                    3ms start(214ms) initializeNVPTXBackend NativeCodegen.cpp:1442
                    121ms start(217ms) generateNativeGPUCode NativeCodegen.cpp:1126
                      0ms start(217ms) check_module_requires_libdevice NativeCodegen.cpp:272
                      9ms start(217ms) optimize_ir NativeCodegen.cpp:310
                      12ms start(227ms) generatePTX NativeCodegen.cpp:1405
                      98ms start(239ms) ptx_to_cubin NvidiaKernel.cpp:128
                0ms start(339ms) collectAllDeviceResults Execute.cpp:2333
                121ms start(339ms) compileWorkUnit NativeCodegen.cpp:2589
                  24ms start(340ms) getInstance HashJoin.cpp:295
                    24ms start(340ms) reify PerfectJoinHashTable.cpp:354
                      0ms start(340ms) getOneColumnFragment ColumnFetcher.cpp:78
                        New thread(9)
                          23ms start(0ms) initHashTableForDevice PerfectJoinHashTable.cpp:721
                            23ms start(0ms) initHashTableOnGpu PerfectHashTableBuilder.h:89
                        End thread(9)
                  0ms start(365ms) markDeadRuntimeFuncs NativeCodegen.cpp:1849
                  95ms start(365ms) optimizeAndCodegenGPU NativeCodegen.cpp:1326
                    95ms start(365ms) generateNativeGPUCode NativeCodegen.cpp:1126
                      0ms start(365ms) check_module_requires_libdevice NativeCodegen.cpp:272
                      6ms start(365ms) optimize_ir NativeCodegen.cpp:310
                      10ms start(372ms) generatePTX NativeCodegen.cpp:1405
                      77ms start(382ms) ptx_to_cubin NvidiaKernel.cpp:128
                0ms start(460ms) resultsUnion Execute.cpp:1279
    2022-06-08T21:57:21.809143 I 151753 0 1 DBHandler.cpp:1260 DEBUG_TIMER thread_id(1)
    1388ms total duration for sql_execute
    
    0
    Comment actions Permalink

Please sign in to leave a comment.