Unable to start Heavy 6.0 in Docker
Hi, we tried to upgrade from OmniSci 5.10.2 to Heavy 6.0 but got stuck at the server initialization.
We used a customized image based on the heavyai/heavyai-ee-cuda:v6.0.0, but trying to narrow this problem down, I used the base to reproduce this problem.
In GCP, we have a VM with a NVIDIA Tesla V100 GPU available. In this instance we execute the following commands.
To run the container, I executed:
docker run -it --rm \\
--name heavyai-test \\
--gpus all \\
--entrypoint /bin/bash \\
-v /var/lib/omnisci/omnisci-storage:/omnisci-storage \\
heavyai/heavyai-ee-cuda:v6.0.0
Then, I tried to run the server but got a CUDA error that stops the process, it is related to some library version mismatch.
root@e309cd4ea29c:/opt/heavyai# /opt/heavyai/bin/heavydb /omnisci-storage/storage --config /omnisci-storage/heavy.conf --log-severity-clog INFO
2022-06-21T05:36:58.420016 I 38 0 0 CommandLineOptions.cpp:2009 Max import threads 32
2022-06-21T05:36:58.420680 I 38 0 0 CommandLineOptions.cpp:2012 cuda block size 0
2022-06-21T05:36:58.420699 I 38 0 0 CommandLineOptions.cpp:2013 cuda grid size 0
2022-06-21T05:36:58.420708 I 38 0 0 CommandLineOptions.cpp:2014 Min CPU buffer pool slab size 268435456
2022-06-21T05:36:58.420717 I 38 0 0 CommandLineOptions.cpp:2015 Max CPU buffer pool slab size 4294967296
2022-06-21T05:36:58.420725 I 38 0 0 CommandLineOptions.cpp:2016 Min GPU buffer pool slab size 268435456
2022-06-21T05:36:58.420733 I 38 0 0 CommandLineOptions.cpp:2017 Max GPU buffer pool slab size 4294967296
2022-06-21T05:36:58.420741 I 38 0 0 CommandLineOptions.cpp:2018 calcite JVM max memory 1024
2022-06-21T05:36:58.420750 I 38 0 0 CommandLineOptions.cpp:2019 HeavyDB Server Port 6274
2022-06-21T05:36:58.420758 I 38 0 0 CommandLineOptions.cpp:2020 HeavyDB Calcite Port 6279
2022-06-21T05:36:58.420766 I 38 0 0 CommandLineOptions.cpp:2021 Enable Calcite view optimize true
2022-06-21T05:36:58.420775 I 38 0 0 CommandLineOptions.cpp:2023 Allow Local Auth Fallback: enabled
2022-06-21T05:36:58.420787 I 38 0 0 CommandLineOptions.cpp:2025 ParallelTop min threshold: 100000
2022-06-21T05:36:58.420795 I 38 0 0 CommandLineOptions.cpp:2026 ParallelTop watchdog max: 20000000
2022-06-21T05:36:58.420803 I 38 0 0 CommandLineOptions.cpp:2028 Enable Data Recycler: enabled
2022-06-21T05:36:58.420811 I 38 0 0 CommandLineOptions.cpp:2031 Use hashtable cache: enabled
2022-06-21T05:36:58.420820 I 38 0 0 CommandLineOptions.cpp:2034 Total amount of bytes that hashtable cache keeps: 4096 MB.
2022-06-21T05:36:58.420830 I 38 0 0 CommandLineOptions.cpp:2036 Per-hashtable size limit: 2048 MB.
2022-06-21T05:36:58.420839 I 38 0 0 CommandLineOptions.cpp:2039 Use query resultset cache: enabled
2022-06-21T05:36:58.420848 I 38 0 0 CommandLineOptions.cpp:2042 Total amount of bytes that query resultset cache keeps: 4096 MB.
2022-06-21T05:36:58.420857 I 38 0 0 CommandLineOptions.cpp:2044 Per-query resultset size limit: 2048 MB.
2022-06-21T05:36:58.420866 I 38 0 0 CommandLineOptions.cpp:2047 Use auto query resultset caching: disabled
2022-06-21T05:36:58.420875 I 38 0 0 CommandLineOptions.cpp:2054 Use query step skipping: enabled
2022-06-21T05:36:58.420884 I 38 0 0 CommandLineOptions.cpp:2056 Use chunk metadata cache: enabled
2022-06-21T05:36:58.420893 I 38 0 0 CommandLineOptions.cpp:2059 Use chunk metadata cache: enabled
2022-06-21T05:36:58.420902 I 38 0 0 CommandLineOptions.cpp:2070 Runtime UDF/UDTF Registration Policy: ALLOWED for superusers only
2022-06-21T05:36:58.421933 I 38 0 0 CommandLineOptions.cpp:1503 License will expire at: 2999-12-31 23:59:59+0000 [MODIFIED]
2022-06-21T05:36:58.421983 I 38 0 0 CommandLineOptions.cpp:1514 HeavyDB started with data directory at '/omnisci-storage/storage'
2022-06-21T05:36:58.421997 I 38 0 0 CommandLineOptions.cpp:1524 Server read-only mode is false
2022-06-21T05:36:58.422008 I 38 0 0 CommandLineOptions.cpp:1528 Threading layer: TBB
2022-06-21T05:36:58.422018 I 38 0 0 CommandLineOptions.cpp:1532 Watchdog is set to true
2022-06-21T05:36:58.422027 I 38 0 0 CommandLineOptions.cpp:1533 Dynamic Watchdog is set to false
2022-06-21T05:36:58.422037 I 38 0 0 CommandLineOptions.cpp:1537 Runtime query interrupt is set to true
2022-06-21T05:36:58.422046 I 38 0 0 CommandLineOptions.cpp:1539 A frequency of checking pending query interrupt request is set to 1000 (in ms.)
2022-06-21T05:36:58.422057 I 38 0 0 CommandLineOptions.cpp:1541 A frequency of checking running query interrupt request is set to 0.1 (0.0 ~ 1.0)
2022-06-21T05:36:58.422075 I 38 0 0 CommandLineOptions.cpp:1544 Non-kernel time query interrupt is set to true
2022-06-21T05:36:58.422085 I 38 0 0 CommandLineOptions.cpp:1547 Debug Timer is set to false
2022-06-21T05:36:58.422094 I 38 0 0 CommandLineOptions.cpp:1548 LogUserId is set to false
2022-06-21T05:36:58.422104 I 38 0 0 CommandLineOptions.cpp:1549 Maximum idle session duration 60
2022-06-21T05:36:58.422114 I 38 0 0 CommandLineOptions.cpp:1550 Maximum active session duration 43200
2022-06-21T05:36:58.422129 I 38 0 0 CommandLineOptions.cpp:1551 Maximum number of sessions -1
2022-06-21T05:36:58.422139 I 38 0 0 CommandLineOptions.cpp:1553 Legacy delimited import is set to true
2022-06-21T05:36:58.422149 I 38 0 0 CommandLineOptions.cpp:1555 Legacy parquet import is set to false
2022-06-21T05:36:58.422158 I 38 0 0 CommandLineOptions.cpp:1558 FSI ODBC import is set to true
2022-06-21T05:36:58.422168 I 38 0 0 CommandLineOptions.cpp:1560 FSI regex parsed import is set to true
2022-06-21T05:36:58.422178 I 38 0 0 CommandLineOptions.cpp:1562 Allowed import paths is set to ["/omnisci-storage"]
2022-06-21T05:36:58.422187 I 38 0 0 CommandLineOptions.cpp:1563 Allowed export paths is set to ["/omnisci-storage"]
2022-06-21T05:36:58.422264 I 38 0 0 DdlUtils.cpp:823 Parsed allowed-import-paths: (/omnisci-storage/storage/import /omnisci-storage)
2022-06-21T05:36:58.422294 I 38 0 0 DdlUtils.cpp:823 Parsed allowed-export-paths: (/omnisci-storage/storage/export /omnisci-storage)
2022-06-21T05:36:58.422337 I 38 0 0 CommandLineOptions.cpp:1634 Disk cache enabled for foreign tables only
2022-06-21T05:36:58.422350 I 38 0 0 CommandLineOptions.cpp:1688 Vacuum Min Selectivity: 0.1
2022-06-21T05:36:58.422362 I 38 0 0 CommandLineOptions.cpp:1690 Enable system tables is set to true
2022-06-21T05:36:58.422371 I 38 0 0 CommandLineOptions.cpp:1699 Enable FSI is set to true
2022-06-21T05:36:58.422386 I 38 0 0 HeavyDB.cpp:430 HeavyDB starting up
2022-06-21T05:36:58.426400 I 38 0 0 DBHandler.cpp:376 OmniSci Server 6.0.0-20220418-d4d1c2a42c
2022-06-21T05:36:58.539412 I 38 0 0 CudaMgr.cpp:369 Using 1 Gpus.
2022-06-21T05:36:58.539662 I 38 0 0 CudaMgr.cpp:68 Warming up the GPU JIT Compiler... (this may take several seconds)
2022-06-21T05:36:58.644111 F 38 0 0 NvidiaKernel.cpp:95 Check failed: cuLinkAddFile_v2( link_state, CU_JIT_INPUT_FATBINARY, gpu_rt_path.c_str(), 0, nullptr, nullptr) == CUDA_SUCCESS (222 == 0) ptxas application ptx input, line 9; fatal : Unsupported .version 7.4; current version is '7.3'
2022-06-21T05:36:59.425464 I 38 0 1 HeavyDB.cpp:380 Interrupt signal (6) received.
Aborted (core dumped)
Any ideas why is this happening? I would expect the libraries bundled in the base image to be compatible and tested, unless this problem is related to the underlying hardware, namely the GPU.
-
Hi,
This kind of errors
[quote="Joshua_Mendoza, post:1, topic:3028"]
Unsupported .version 7.4; current version is '7.3
[/quote]generally mean that you are using an outdated driver or a mismatch between cuda and the driver itself.
The minimum supported version of the driver is the 470 but it should be changed from 5.10. Could you give me the output of the nvidia-smi?
depending on the image and the OS this command would be the right one.
sudo docker run --gpus=all \\ --rm nvidia/cuda:11.0-runtime-ubuntu20.04 nvidia-smi
-
Thanks for you reply. Here's the output of the command you suggested:
[root@omnisci-prod-0-vm omnisci-storage]# sudo docker run --gpus=all \\ > --rm nvidia/cuda:11.0-runtime-ubuntu20.04 nvidia-smi Unable to find image 'nvidia/cuda:11.0-runtime-ubuntu20.04' locally 11.0-runtime-ubuntu20.04: Pulling from nvidia/cuda d72e567cc804: Pull complete 0f3630e5ff08: Pull complete b6a83d81d1f4: Pull complete 651c4abefb41: Pull complete dfde59c9d941: Pull complete 9b2bcdc98b8a: Pull complete 3c0d268a007b: Pull complete 598190a71a49: Pull complete Digest: sha256:74be12403e480fe1120f2fc16efef36fa4cb0165d3a3c96d2c09d8652b7312ef Status: Downloaded newer image for nvidia/cuda:11.0-runtime-ubuntu20.04 Tue Jun 21 15:10:09 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA Tesla V1... On | 00000000:00:04.0 Off | 0 | | N/A 36C P0 22W / 300W | 0MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Additionally, here are the OS details of the VM instance.
[root@omnisci-prod-0-vm omnisci-storage]# cat /etc/os-release NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"
-
Hi,
Unluckily the 465 drivers aren't supported anymore, so you need to upgrade them to at least the 470 (Cuda version 11.,4) to run the 6.0 version of the software .
We checked the 6.0 with 495 and the 510, but the 470 and 510 are somewhat more popular.
Can you upgrade your drivers or you are using specific software that needs the 465?
Regards, Candido
-
I see. No, we only use this instance to run Heavy software, so I'll proceed to upgrade the drivers now. I'm not very familiar with Nvidia or GPUs resource management in the cloud, so I imagined the drivers being used are coming from the container libraries, not the ones shipped with the OS.
By any change, do you know which packages need to be upgraded in CentOS 7? In either case, I'll read through the GCP documentation and post the solution if I find it first.
Thanks for your help!
-
Hi,
To upgrade the drivers there are some instructions on our website, while the best source for that would be the Nvidia website.
On Ubuntu, I personally use the apt command, and I think you can do the same on CentOS with yum.
While talking with a colleague that was working on Zendesk on your issue, we are seeing that you are using the omnisci-storage, that's has been changed into /var/lib/heavyai in the 6.0 version
so probably you will have to change
-v /var/lib/omnisci/omnisci-storage:/omnisci-storage \\
into
-v /var/lib/omnisci/omnisci-storage:/var/lib/heavyai \\
Regards, Candido.
-
Thank you Candido, the Heavy server is running successfully now.
In the end, I needed to upgrade the host OS and the Nvidia drivers got updated to a very recent version. After a reboot, the new drivers were recognized.
[root@omnisci-prod-0-vm omnisci-storage]# docker run --gpus=all --rm nvidia/cuda:11.0-runtime-ubuntu20.04 nvidia-smi Tue Jun 21 15:54:56 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:04.0 Off | 0 | | N/A 35C P0 40W / 300W | 312MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
By the way, I had some problems trying to upgrade the Nvidia packages in CentOS. I found out that I needed to update the public GPG keys from Nvidia and found this post useful: https://forums.developer.nvidia.com/t/updating-the-cuda-linux-gpg-repository-key/212897/49
Please sign in to leave a comment.
Comments
7 comments