Unable to start Heavy 6.0 in Docker

Comments

7 comments

  • Avatar
    Candido Dessanti

    Hi,

    This kind of errors

    [quote="Joshua_Mendoza, post:1, topic:3028"] Unsupported .version 7.4; current version is '7.3 [/quote]

    generally mean that you are using an outdated driver or a mismatch between cuda and the driver itself.

    The minimum supported version of the driver is the 470 but it should be changed from 5.10. Could you give me the output of the nvidia-smi?

    depending on the image and the OS this command would be the right one.

    sudo docker run --gpus=all \\
    --rm nvidia/cuda:11.0-runtime-ubuntu20.04 nvidia-smi
    
    0
    Comment actions Permalink
  • Avatar
    Joshua

    Thanks for you reply. Here's the output of the command you suggested:

    [root@omnisci-prod-0-vm omnisci-storage]# sudo docker run --gpus=all \\
    > --rm nvidia/cuda:11.0-runtime-ubuntu20.04 nvidia-smi
    Unable to find image 'nvidia/cuda:11.0-runtime-ubuntu20.04' locally
    11.0-runtime-ubuntu20.04: Pulling from nvidia/cuda
    d72e567cc804: Pull complete
    0f3630e5ff08: Pull complete
    b6a83d81d1f4: Pull complete
    651c4abefb41: Pull complete
    dfde59c9d941: Pull complete
    9b2bcdc98b8a: Pull complete
    3c0d268a007b: Pull complete
    598190a71a49: Pull complete
    Digest: sha256:74be12403e480fe1120f2fc16efef36fa4cb0165d3a3c96d2c09d8652b7312ef
    Status: Downloaded newer image for nvidia/cuda:11.0-runtime-ubuntu20.04
    Tue Jun 21 15:10:09 2022
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA Tesla V1...  On   | 00000000:00:04.0 Off |                    0 |
    | N/A   36C    P0    22W / 300W |      0MiB / 16160MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    

    Additionally, here are the OS details of the VM instance.

    [root@omnisci-prod-0-vm omnisci-storage]# cat /etc/os-release
    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:centos:centos:7"
    HOME_URL="https://www.centos.org/"
    BUG_REPORT_URL="https://bugs.centos.org/"
    
    CENTOS_MANTISBT_PROJECT="CentOS-7"
    CENTOS_MANTISBT_PROJECT_VERSION="7"
    REDHAT_SUPPORT_PRODUCT="centos"
    REDHAT_SUPPORT_PRODUCT_VERSION="7"
    
    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi,

    Unluckily the 465 drivers aren't supported anymore, so you need to upgrade them to at least the 470 (Cuda version 11.,4) to run the 6.0 version of the software .

    We checked the 6.0 with 495 and the 510, but the 470 and 510 are somewhat more popular.

    Can you upgrade your drivers or you are using specific software that needs the 465?

    Regards, Candido

    0
    Comment actions Permalink
  • Avatar
    Joshua

    I see. No, we only use this instance to run Heavy software, so I'll proceed to upgrade the drivers now. I'm not very familiar with Nvidia or GPUs resource management in the cloud, so I imagined the drivers being used are coming from the container libraries, not the ones shipped with the OS.

    By any change, do you know which packages need to be upgraded in CentOS 7? In either case, I'll read through the GCP documentation and post the solution if I find it first.

    Thanks for your help!

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi,

    To upgrade the drivers there are some instructions on our website, while the best source for that would be the Nvidia website.

    On Ubuntu, I personally use the apt command, and I think you can do the same on CentOS with yum.

    While talking with a colleague that was working on Zendesk on your issue, we are seeing that you are using the omnisci-storage, that's has been changed into /var/lib/heavyai in the 6.0 version

    so probably you will have to change

    -v /var/lib/omnisci/omnisci-storage:/omnisci-storage \\
    

    into

    -v /var/lib/omnisci/omnisci-storage:/var/lib/heavyai \\
    

    Regards, Candido.

    0
    Comment actions Permalink
  • Avatar
    Joshua

    Thank you Candido, the Heavy server is running successfully now.

    In the end, I needed to upgrade the host OS and the Nvidia drivers got updated to a very recent version. After a reboot, the new drivers were recognized.

    [root@omnisci-prod-0-vm omnisci-storage]# docker run --gpus=all --rm nvidia/cuda:11.0-runtime-ubuntu20.04 nvidia-smi
    Tue Jun 21 15:54:56 2022
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00000000:00:04.0 Off |                    0 |
    | N/A   35C    P0    40W / 300W |    312MiB / 16384MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+
    

    By the way, I had some problems trying to upgrade the Nvidia packages in CentOS. I found out that I needed to update the public GPG keys from Nvidia and found this post useful: https://forums.developer.nvidia.com/t/updating-the-cuda-linux-gpg-repository-key/212897/49

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi,

    I thought, I sent a private message with the instructions on how to upgrade the drivers, but probably I'm wrong.

    Said that I'm happy that everything is working right now.

    Candido

    0
    Comment actions Permalink

Please sign in to leave a comment.