Index on table
We have huge data coming on OmnisciDb table every minutes, should we create index for better performance? And before that, does OmnisciDb allows Index creation on table? If yes then how? I tried normal way of creating index but it didn't allow.
-
Hi Mohammad Sohail,
Right now, it's impossible to create indexes on HeavyDB tables; we have one optimization pruning some data, and relying on statistics collected for each field at the fragment level, which could act like an index. Still, the actual filtering is done by the CPU or GPU scan after the data has been cached in the CPU or GPU cache.
For example, let's take the flights_2008_7m table that we give as a free example. The data there is loaded more or less ordered by the field dep_timestamp and arr_timestamp; we cannon know precisely what is stored in the statistics, but we can get an idea with this query that creates buckets based on the rowid of the table using the fragment size (in this case is 20000000, while the default is 32000000)select cast(rowid/2000000 as int),min(dep_timestamp),
max(dep_timestamp),min(depdelay),max(depdelay)
from flights_2008_7M group by 1;
0|2008-01-01 00:01:00|2008-04-30 23:55:00|-92|2457
1|2008-04-01 00:01:00|2008-08-01 00:00:00|-71|2467
2|2008-07-01 00:02:00|2008-11-30 23:45:00|-534|1552
3|2008-10-01 05:36:00|2008-12-31 23:59:00|-49|1597so we know that in the first fragment, we have a timestamp that ranges from 2008-01-01 and 2008-04-30; in the second, 2008-04-01 and 2008-08-10, and so on.
Filtering the table using the timestamp asking, as an example, the average departure and arrival delay for the week ranging from 2008-02-10 and 2002-02-17 will load and process just the data contained into the first fragment, saving disk, memory, and processing time, having at the same time a faster queryselect plane_model,avg(depdelay),avg(arrdelay)
from flights_2008_7m
where dep_timestamp between '2008-02-10 00:00:00' and '2008-02-18 00:00:00'
group by 1 order by 2 limit 10;We can check this using the \memory_summary command.
HeavyDB Server CPU Memory Summary:
MAX USE ALLOCATED FREE
98304.00 MB 30.52 MB 4096.00 MB 4065.48 MB
HeavyDB Server GPU Memory Summary:
[GPU] MAX USE ALLOCATED FREE
[0] 32768.00 MB 30.52 MB 4096.00 MB 4065.48 MBRunning a query filtering for the carrier that has a lot of overlapping values in the fragment, it's unlikely that fragment skipping occurs, and it can be easily checked by running a query and checking for the size of the used memory
select plane_model,avg(depdelay),avg(arrdelay)
from flights_2008_7m
where uniquecarrier ='NW'
group by 1 order by 2 limit 10;
HeavyDB Server CPU Memory Summary:
MAX USE ALLOCATED FREE
98304.00 MB 80.22 MB 4096.00 MB 4015.78 MB
HeavyDB Server GPU Memory Summary:
[GPU] MAX USE ALLOCATED FREE
[0] 32768.00 MB 80.22 MB 4096.00 MB 4015.78 MBIn this case, all the data of the columns involved in the query has been placed into memory.
In the end, you cannot create indexes, but you can organize sorting data and adjust the fragment size (trying to avoid being too much aggressive because the performances could suffer).
Anyway could you better describe your use case? How much data are you inserting and which kind of queries you are running against the db?
Candido
Please sign in to leave a comment.
Comments
2 comments