An issue with approx_count_distinct operator in 6.X version

Comments

12 comments

  • Official comment
    Avatar
    Candido Dessanti

    Hello,

    If you're interested in assessing whether version 7.x will address your concern (though I'm uncertain), you may want to consider trying the free version by registering and requesting a key through our website:

    https://www.heavy.ai/product/downloads

    It's possible that you might be overlooking the impact of the improvements detailed in the release notes.

    Should you wish to resolve your issues, kindly furnish the requested data that we have repeatedly asked for in order to identify and rectify the problem. If we're unable to reproduce your issue, it becomes less likely that we can provide a solution.

    Regards,
    Candido

    Comment actions Permalink
  • Avatar
    Neill Lewis

    Hi Kaylani,

    Please provide the exact version you're using, by going to http(s)://yourImmerseServer/version.txt and reporting the results. TOmniSci exception is surprising to observe in currently supported versions of the application, so confirming the version is helpful.


    The error message is warning that your system is likely unable to process your query without running out of memory. The messaging stems from the use of our watchdog process which is enabled by default. You can disable this by setting `enable-watchdog = false` in your heavy.conf file and restarting services. However, if your system does not have adequate memory to process the query, you will instead encounter the following

    ERR_OUT_OF_CPU_MEM: Not enough host memory to execute the query

    However, you should definitely consider the suggestion provided by the error message to, replace within your view all instances of `COUNT(DISTINCT expression) with APPROX_COUNT_DISTINCT(expression). Count Distinct is indeed a very memory intensive operation, and replacing with APPROX_COUNT_DISTINCT will significantly improve performance and reduce memory requirements.

     

    Here's some further background on APPROX_COUNT_DISTINCT function:

    APPROX_COUNT_DISTINCT(x, e) gives an approximate count of the value x, based on an expected error rate defined in e. The error rate is an integer value from 1 to 100. The lower the value of e, the higher the precision, and the higher the memory cost. Select a value for e based on the level of precision required. On large tables with large cardinalities, consider using APPROX_COUNT_DISTINCT when possible to preserve memory. When data cardinalities permit, HEAVY.AI uses the precise implementation of COUNT(DISTINCT x) for APPROX_COUNT_DISTINCT.
    1
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    HI Kayalani,

    In the 6.x the engineering did some changes on count distinct, so probably your case is a taking a new code path.

     

    could you run the query

     

    select min(m_645b86c1_location.m_id=,max(m_645b86c1_location."m_id"),approx_count_distinct(m_645b86c1_location."m_id",10) from m_645b86c1_location

    This could help us to reproduce what's going on with your data and fix the problem.

    Candido

    0
    Comment actions Permalink
  • Avatar
    Kalyani N

    Sure Candido.

    Will do that.

    Thanks,
    Kalyani

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    I'm sorry but I forgot to say that you should gove us the result of the query.

    Candido

    0
    Comment actions Permalink
  • Avatar
    rekha h

    Hello Team,

    At present, we are facing the same issue in version 6.4.0. To resolve this bottleneck and continue with our work smoothly, we are contemplating an upgrade to a newer version. We would greatly appreciate any potential workarounds you can suggest for this particular issue.

    Interestingly, we did not encounter any such problems in version 5.9.4. It appears that the newer version, 6.4.0, introduced this issue. We are keen to understand why this problem has surfaced in the newer version. Your insights would be valuable in helping us make an informed decision regarding the upgrade.

    Thank you for your assistance and support.

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hello @re,

    Are you encountering the same error as @Kalyani N while running a COUNT(DISTINCT ...) query?

    Could you please provide the SQL of the problematic query, as well as the minimum and maximum values of the column you are using in the COUNT(DISTINCT ...) function? Additionally, it would be helpful to know the approximate count of distinct values for that column and the number of groups that the query would return.

    To retrieve the minimum and maximum values, as well as the approximate distinct count of a column "column_name" from the table "table_name," you can use the following SQL query:

    SELECT MAX(field_distinct), 
    MIN(field_distinct),
          APPROX_COUNT_DISTINCT(field_distinct) FROM table_name;

    To get the number of groups that the query would return based on the column "column_name," use the following SQL (replace "table_name" with the actual table name and "column_name" with the name of the column used in the COUNT(DISTINCT ...) part of your problematic query):

    SELECT COUNT(*) 
    FROM (SELECT field1, field2, ..., fieldn 
    FROM table_name
    GROUP BY field1, field2, ..., fieldn) AS grouped_data;

    As I mentioned earlier, some changes have been made in version 6.x to improve performance. However, it seems that there might be some regression in certain scenarios.

    Thanks for your patience.

    Candido

    0
    Comment actions Permalink
  • Avatar
    Gopinath Jaganmohan

    Hi Candido, Facing the same issue, seems these were fixed in version 7 but there are known tags for version 7 in Git Hub.

    Would like to know the planned release date for version 7. x.x in github.

     

    Regards

    Gopinath

    0
    Comment actions Permalink
  • Avatar
    rekha h

    Hi Candido Dessanti / Neill Lewis

    Greetings Team,

    We're eagerly awaiting your reply. Kindly inform us about the anticipated release date of the 7.x.x version of HeavyDB on GitHub. This release will greatly assist us in addressing numerous issues and facilitate the seamless upgrade of our current 5.9.x version to the latest iteration.

    These are the fixes we are anticipating which will solve most of our issues.

    1.Fixes an out of CPU memory error that could occur when executing a query with a count distinct function call on a high cardinality column.
    2.Fix issue where “Cannot use fast path for COUNT DISTINCT” could be reported from a count distinct operation.
    3.Fixed an issue where COUNT DISTINCT or APPROX_COUNT_DISTINCT, when run on a CASE statement that outputs literal strings, could cause a crash. 
    4.Fixes a crash when using COUNT () or COUNT (1) with the window function, i.e., COUNT (*) OVER (PARTITION BY x). 
    5.Fixes an issue that cause incorrect results in multiple aggregation of date columns that include COUNT DISTINCT. 

    Only till version 6.4.0 we are able to see in github.

    Looking forward to your much-appreciated response.

    0
    Comment actions Permalink
  • Avatar
    Gopinath Jaganmohan

    Team, Whenever I reached out in the past, you guys have always replied back with solutions, workarounds, or issues. No reply is not giving us confident this time.

    Pls, can you reply back with whats happening and the timeline would be great?

     

    Regards

    Gopinath

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    I trust you're doing well.irstly, I'm yet to receive any feedback regarding the release of a 7.x OS version of the software. Consequently, I find myself in a bit of a bind as I'm unable to provide you with information on this matter. The master OS repository is nearly in sync with the 7.0 release, we are laking just some commits, and opting to build from the source would yield almost identical executables for both the EE/FREE versions. If you find yourself unable to perform the build on your own, or if you're more comfortable with me handling it, I'd be more than willing to build the executables and share them with you. Do let me know if this would be of assistance to you.

    Moving on to the second issue at hand, it appears that I've made several requests to all parties involved for the results of simple queries. Unfortunately, I haven't received any feedback in this regard. It's important to highlight that while the fixes implemented in the 7.0 version potentially address specific issues experienced by our customers, I'm uncertain if these fixes will resolve the error you've encountered.
    We can offer solutions and workarounds when we have some data to work with, but we can't do so much if we aren't getting any

    There is nothing particular going on, we are just too busy to update the OS version on a regular basis·

    Best Regards,
    Candido

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi guys,

    The OS version of release 7.x is available for download in the tarball, yum/apt, and docker format.

    You can download the tarballs at these links

    GPU accelerated: https://releases.heavy.ai/os/tar/heavyai-os-latest-Linux-x86_64.tar.gz
    CPU: https://releases.heavy.ai/os/tar/heavyai-os-latest-Linux-x86_64-cpu.tar.gz

    or you can install/upgrade using Docker and other package managers as described on our docs.

    Please note that the software requirements have changed: revise the new system requirements
    https://docs.heavy.ai/installation-and-configuration/system-requirements/software-requirements

     

    Thank you.

    0
    Comment actions Permalink

Please sign in to leave a comment.