Error Running HeavyDB with Nvidia Nsight Compute: Broken Pipe in Thrift Connection
I am encountering an issue when attempting to run Star Schema Benchmark (SSB) queries on HeavyDB, with profiling using Nvidia's [Nsight Compute (ncu)](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html). The queries run without issues when ncu is not involved, but running the HeavyDB server with ncu leads to a broken pipe error in the Thrift connection.
**Environment:**
HeavyDB version: latest
CUDA version: 12.1
GPU driver version: 530.30.02
Operating System: Ubuntu 20.04
**Steps to Reproduce**
Start HeavyDB server with Nvidia Nsight Compute:
`ncu ./heavydb/build/bin/heavydb `
Note, I have tried the above command with sudo also, to allow ncu access to the hardware perf counters.
In a separate terminal, open the HeavyDB client.:
`./heavydb/build/bin/heavysql -p HyperInteractive`
Example query (tables are pre-populated):
`select sum(lo_extendedprice * lo_discount) as revenue from lineorder, ddate where lo_orderdate = d_datekey and d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25; `
**Expected Behavior**
The query executes smoothly with the HeavyDB server running under Nvidia Nsight Compute profiling.
**Actual Behavior**
When the HeavyDB server is launched with Nvidia Nsight Compute, the following error is encountered:
```
Thrift error: write() send(): Broken pipe
Thrift connection error: write() send(): Broken pipe
Retrying connection
Thrift: [date and time] TSocket::write_partial() send() <Host: localhost Port: 6274>: Broken pipe
```
**Additional Information**
The error seems to be related to the Thrift transport layer, specifically when the server is profiled with Nvidia Nsight Compute.
This issue does not occur when the HeavyDB server runs without ncu profiling.
**Request**
Any insights or solutions to resolve this broken pipe error when profiling HeavyDB with Nvidia Nsight Compute would be greatly appreciated.
-
We previously discussed the matter on GitHub, and after conducting further tests on profiling the SSB-SF100 query using a single plate number for video and computing, we observed that the GPU usage is causing instability.
Our suggestions are as follows:
1. Use the 535 Driver at a minimum. In the 7.1 release we changed the way the memory is allocated in the GPU, so using a driver version older than 535 can cause performance issues.
2. Check whether the system has enough memory to accommodate the GPU memory dump during profiling. If there is a memory shortage, close memory-intensive programs such as VSCode, retry, or cut the SF of the benchmark.
3. Run the profiling and switch to text mode. On Ubuntu, switch to a text virtual terminal by pressing Ctrl + Alt + F3 (you can use other virtual terminals by changing the F3 to F1, F2, etc.).
Please sign in to leave a comment.
Comments
1 comment