Index on table

Comments

2 comments

  • Avatar
    Candido Dessanti

    Hi Mohammad Sohail,

    Right now, it's impossible to create indexes on HeavyDB tables; we have one optimization pruning some data, and relying on statistics collected for each field at the fragment level, which could act like an index. Still, the actual filtering is done by the CPU or GPU scan after the data has been cached in the CPU or GPU cache.

    For example, let's take the flights_2008_7m table that we give as a free example. The data there is loaded more or less ordered by the field dep_timestamp and arr_timestamp; we cannon know precisely what is stored in the statistics, but we can get an idea with this query that creates buckets based on the rowid of the table using the fragment size (in this case is 20000000, while the default is 32000000)

    select cast(rowid/2000000 as int),min(dep_timestamp),
    max(dep_timestamp),min(depdelay),max(depdelay)
    from flights_2008_7M group by 1;

    0|2008-01-01 00:01:00|2008-04-30 23:55:00|-92|2457
    1|2008-04-01 00:01:00|2008-08-01 00:00:00|-71|2467
    2|2008-07-01 00:02:00|2008-11-30 23:45:00|-534|1552
    3|2008-10-01 05:36:00|2008-12-31 23:59:00|-49|1597

    so we know that in the first fragment, we have a timestamp that ranges from 2008-01-01 and 2008-04-30; in the second, 2008-04-01 and 2008-08-10, and so on.

    Filtering the table using the timestamp asking, as an example, the average departure and arrival delay for the week ranging from 2008-02-10 and 2002-02-17 will load and process just the data contained into the first fragment, saving disk, memory, and processing time, having at the same time a faster query

    select plane_model,avg(depdelay),avg(arrdelay) 
    from flights_2008_7m
    where dep_timestamp between '2008-02-10 00:00:00' and '2008-02-18 00:00:00'
    group by 1 order by 2 limit 10;

    We can check this using the \memory_summary command.

    HeavyDB Server CPU Memory Summary:
                MAX            USE      ALLOCATED           FREE
        98304.00 MB       30.52 MB     4096.00 MB     4065.48 MB

    HeavyDB Server GPU Memory Summary:
    [GPU]            MAX            USE      ALLOCATED           FREE
    [0]    32768.00 MB       30.52 MB     4096.00 MB     4065.48 MB

    Running a query filtering for the carrier that has a lot of overlapping values in the fragment, it's unlikely that fragment skipping occurs, and it can be easily checked by running  a query and checking for the size of the used memory

    select plane_model,avg(depdelay),avg(arrdelay) 
    from flights_2008_7m
    where uniquecarrier ='NW'  
    group by 1 order by 2 limit 10;

    HeavyDB Server CPU Memory Summary:
                MAX            USE      ALLOCATED           FREE
        98304.00 MB       80.22 MB     4096.00 MB     4015.78 MB
    HeavyDB Server GPU Memory Summary:
    [GPU]            MAX            USE      ALLOCATED           FREE
    [0]    32768.00 MB       80.22 MB     4096.00 MB     4015.78 MB

    In this case, all the data of the columns involved in the query has been placed into memory.

    In the end, you cannot create indexes, but you can organize sorting data and adjust the fragment size (trying to avoid being too much aggressive because the performances could suffer).

    Anyway could you better describe your use case? How much data are you inserting and which kind of queries you are running against the db?

    Candido

     

    0
    Comment actions Permalink
  • Avatar
    Mohammad Sohail

    Thanks Candido for your detailed explanation. 

    0
    Comment actions Permalink

Please sign in to leave a comment.