Version v5.6.0 introduced a new aggregate function, APPROX_MEDIAN. This is the underlying function used when selecting "Median" on a measure in OmniSci Immerse. At the time this article was published, APPROX_MEDIAN() is the only median function offered in OmniSciDB.
Below, we'll answer the most frequently asked questions about this function.
How accurate is APPROX_MEDIAN?
The short answer is that for nearly all practical use cases, APPROX_MEDIAN delivers a very accurate result, insofar as one is interested in calculating a central value in which nearly 50% of the values are both less than and greater than it. In other words, for most users and use cases, the "approximate" aspect should be de-emphasized.
The longer answer is that, certain characteristics of your data may lead to greater or reduced levels of accuracy. it doesn't appear to be possible to have a simple formula to quantify the accuracy of APPROX_MEDIAN due to its dependency upon the distribution of data itself. Two polar examples help to illustrate this:
100,000,000 integers 1, 2, 3, ... 100M in random order. APPROX_MEDIAN can give a very accurate answer to this to 5+ significant figures.
100,000,001 integers: 50,000,000 x 0 and 50,000,001 x 1. (In other words, fifty million 0s, and fifty million and one 1s.) Clearly, the exact median is 1, but APPROX_MEDIAN will return a value close to 0.5 due to the way it approximates the distribution.
How does APPROX_MEDIAN work?
APPROX_MEDIAN works by creating a histogram-like data structure that attempts to approximate, with limited memory, the original distribution of numbers. This is why the first example (from the previous question) works very well (random values from 1-to-100M) but the second, which is sensitive to a single element being 0 or 1, does poorly.
When might APPROX_MEDIAN produce a less-than-perfect result?
When the exact median is sensitive to a small number of changes in the data, such as the second example above, then one should expect a similar magnitude of error given by APPROX_MEDIAN. In contrast, "natural" data distribution medians are stable against changes to a small number of its data points.
Do I have any controls over the accuracy of APPROX_MEDIAN?
The are two OmniSciDB configuration parameters that control the memory usage of APPROX_MEDIAN, which can have an impact on the accuracy.
approx_quantile_centroids - The size of an internal buffer used to approximate the data distribution for which the median is taken. The larger the value, the greater the accuracy of the answer. Default: 300.
approx_quantile_buffer - The size of a temporary buffer used to copy in the data, which, when full gets sorted, before being merged into the above distribution buffer. Default: 1000.
Unless you have a specific concern with results from APPROX_MEDIAN, we encourage using the default values.
Why does HeavyDB provide APPROX_MEDIAN but not MEDIAN?
OmniSci is built to provide performant data analysis at a massive scale. To provide a performant experience, it's essential that the core functions in OmniSciDB can be relied upon whether you're analyzing a thousand or a billion++ records. In contrast, the best-known algorithm for calculating the exact MEDIAN of unsorted data requires nearly as much extra memory as the data size, which would not be performant at scale. Instead, we've provided a function that is very accurate for nearly all use cases and yet performant at scale (and does in fact provide the exact median for sufficiently small data sets.) We're confident this is a compromise you'll appreciate as an end user.
OmniSciDB's APPROX_MEDIAN function is highly accurate and can produce reliable/accurate median calculations while enabling analysis at the speed of curiosity. If you have further questions, please don't hesitate to reach out to our Enterprise Support team by clicking "Create request" above or have a chat with your designated Customer Success Engineer.
Please sign in to leave a comment.