Unable to instantiate CudaMgr

Comments

7 comments

  • Avatar
    Candido Dessanti

    Hi,

    It's because the system cannot detect correctly the GPUs

    Could you post the output of nvidia-smi command? What system are you on? (OS, Hardware) Which version of OmniSciDB have you installed? After the Cuda unknown error are you getting something like no gpus detected?

    I am sorry to ask you a lot of questions but the 999 error is quite generic

    Regards, Candido

    0
    Comment actions Permalink
  • Avatar
    Missasma

    [quote="candido.dessanti, post:2, topic:2731"] nvidia-smi [/quote]

    Hi candido,

    I have centos and this is the output of Nvidia-sim: Screen Shot 2021-06-18 at 1.33.20 PM|690x324

    no I did not get a no gpu detected error .

    thank for your help, I really appreciated it.

    my best regards, Sama

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi,

    I did some tests, also using a similar driver of your (455.23.04 can't find the 05 anywhere), and I can't reproduce your issue.

    It looks there is something that's preventing you from using the GPUs. We got troubles recently with Nvidia Fabric Manager on DGX and HGX systems, but I don't think your system has an NV-link switch, but maybe I'm wrong. Which kind of hardware are you using? It's an on-premise physical machine or it's an AWS Instance (on an AWS Instance I could reproduce)

    Also, the 999 could mean that the Nvidia driver is in a bad state, and a reboot (or a driver reset) is needed. Can you try to reboot the machine and re-try?

    0
    Comment actions Permalink
  • Avatar
    mjj203

    I had the same issue on a Dell Cauldron with 8x T4s and an HPE Apollo 6500 with 8x A100s sxm2. Running ubuntu 20.04 I uninstalled all nvidia- and cuda- packages installed with apt then rebooted. Then used the latest 460.73.01 driver installed from the run file. Then installed cuda-toolkit-11.2 from the run file so that it did not install drivers. Then install nvidia-fabricmanager and started the service daemon. Started nv-hostengine and persistenced daemons, and ensured the post install actions were completed https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions. Rebooted then started omnisci_server.

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Thanks @mjj203,

    We also had a similar issue with dgx and hgx systems. For this reason I asked to @missasma which system he is on.

    0
    Comment actions Permalink
  • Avatar
    Missasma

    I have fixed the issue , it turn out the Nvidia MPS service was casing the issue with detecting the GPU.

    1
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi @missasma,

    That's great news. I hope you will be satisfied by OmnisciDB.

    It would be nice to share what you did to make the software work in the environment; it could be useful for other community users having the same problem.

    Best Regards, Candido

    0
    Comment actions Permalink

Please sign in to leave a comment.