What is TTL?
TTL, or Time To Live, functionality limits retention of records based on their creation timestamp. This is a common feature in row-wise databases such as Oracle, Microsoft SQL Server, PostgreSQL.
Does HEAVY.AI support TTL features?
HEAVY.AI does not have a "TTL" property that limits to a specific # of rows or time-based expiration of rows.
What HEAVY.AI does have, is the max_rows with clause property (click for documentation) when creating a table to set a cap on the table size. However, when rows in excess of the max are added, the oldest "fragment" is dropped (32 million rows, unless configured with custom fragment_size table parameter( also a WITH clause property you can adjust).
For example, if you had configured a table with a max_rows of exactly 1,000,000,000 rows, and you attempted to load another 500 million, you'd end up with 988 million, because the oldest 16 fragments (512 million rows) would be dropped to make room for the 500 million rows. For this reason, it's best to configure max_rows as a value that is a product of multiplication of some desired integer against your fragment_size. (ex 1,024,000,000 not 1,000,000,0000 when using the default 32MM fragment size).
How to implement TTL using max_rows
You can manage this by using the table's max_rows property to limit the number of fragments based on the number of rows in the table.
To implement, you need to know the number of records loaded daily and ensure that the data is loaded in the correct order (if a retention based on a date is needed).
Example: let's say that you know that roughly every load is about 30 million rows, and you want to retain the latest 7 loads.
create table my_table() with (max_rows=320000000);
Assuming that you are using a default fragment, the table will grow up to 11 fragments and after that, the fragment with oldest data will be dropped while a new one will be created to accommodate the new data, saving disk space (and processing power)
Alternatively, you can load the data and then delete it using a subquery.
delete from my_table where my_date not in (select max(my_date)-retention_period from my_table)
or if the retention day starts from today
delete from my_table where my_date-retetion > timestampadd('day', -retention_perod_in_days, now());
We implemented this approach with a client who has a daily data load of around 2 billion rows, and it performed efficiently.
Both of these approaches are effective, but using the max_rows approach is recommended because it prevents any potential fragmentation of the data in the table. When using the delete method, you should consider periodically reorganizing the data in case any fragmentation occurs.
To reclaim disk space it's needed that deleted records are as contiguous as possible, because under a certain threshold the space isn't going to be re-used, so you will see the space recovered after a certain amount of delete operations.
Example
Consider a table where data is loaded daily or at various intervals throughout the day, with the specific timing being less significant. However, what's of utmost importance is that we do not load data with mixed timestamps. For example, at time t0, we load data with a timestamp of t0, at t1 with t1, and so on until tn, as illustrated in the example table.
heavysql> select date_trunc(day,dropoff_datetime) as day_,count(*) the_count from timseries_table group by 1;
day_|the_count
2024-06-01 00:00:00|4984750
2024-06-02 00:00:00|7477125
2024-06-03 00:00:00|49847500
2024-06-04 00:00:00|19939000
2024-06-05 00:00:00|129603500
2024-06-06 00:00:00|82248375
2024-06-07 00:00:00|42370375
2024-06-08 00:00:00|7477125
2024-06-09 00:00:00|24923750
heavysql> select cast(total_data_file_size as float)/1024/1024/1024 as UsedGB,cast(total_free_data_page_count*2*1024*1024 as float)/1024/1024/1024 as FreeGB from information_schema.storage_details where table_name='timseries_table';
UsedGB|FreeGB
37.5|0.2441406
On the 9th of June, we will delete the data because we have a 7-day retention period for the table.
heavysql> delete from timseries_table where date_trunc(day,dropoff_datetime) < timestampadd(day, -7, timestamp '2024-06-09 00:00:00');
heavysql> select date_trunc(day,dropoff_datetime) as day_,count(*) the_count from timseries_table group by 1;
day_|the_count
2024-06-02 00:00:00|7477125
2024-06-03 00:00:00|49847500
2024-06-04 00:00:00|19939000
2024-06-05 00:00:00|129603500
2024-06-06 00:00:00|82248375
2024-06-07 00:00:00|42370375
2024-06-08 00:00:00|7477125
2024-06-09 00:00:00|24923750
heavysql> select cast(total_data_file_size as float)/1024/1024/1024 as UsedGB,cast(total_free_data_page_count*2*1024*1024 as float)/1024/1024/1024 as FreeGB from information_schema.storage_details where table_name='timseries_table';
UsedGB|FreeGB
40|0.046875
The space used by the table has increased by 2.5GB. Then we do the same for the 10th, 11th, 12th, and 13th of June.
heavysql> select date_trunc(day,dropoff_datetime) as day_,count(*) the_count from timseries_table group by 1;
day_|the_count
2024-06-06 00:00:00|82248375
2024-06-07 00:00:00|42370375
2024-06-08 00:00:00|7477125
2024-06-09 00:00:00|24923750
2024-06-10 00:00:00|9969500
2024-06-11 00:00:00|19939000
2024-06-12 00:00:00|34893250
2024-06-13 00:00:00|4984750
heavysql> select cast(total_data_file_size as float)/1024/1024/1024 as UsedGB,cast(total_free_data_page_count*2*1024*1024 as float)/1024/1024/1024 as FreeGB from information_schema.storage_details where table_name='timseries_table';
UsedGB|FreeGB
45.5|6.339844
The table size has increased, but we also have over 6 GB available for the data. Doing the same for the 14th, 15th, 16th, and 17th, the table is still 45.5GB, with 22.4 GB of free space.
heavysql> select date_trunc(day,dropoff_datetime) as day_,count(*) the_count from timseries_table group by 1;
day_|the_count
2024-06-10 00:00:00|9969500
2024-06-11 00:00:00|19939000
2024-06-12 00:00:00|34893250
2024-06-13 00:00:00|4984750
2024-06-14 00:00:00|24923750
2024-06-15 00:00:00|54832250
2024-06-16 00:00:00|4984750
2024-06-17 00:00:00|9969500
heavysql> select cast(total_data_file_size as float)/1024/1024/1024 as UsedGB,cast(total_free_data_page_count*2*1024*1024 as float)/1024/1024/1024 as FreeGB from information_schema.storage_details where table_name='timseries_table';
UsedGB|FreeGB
45.5|22.44336
However, using the optimize command to reclaim as much space as possible from the table could take some time (in this case, it took 43 seconds, so we can use it once a week or just when needed).
heavysql> optimize table timseries_table with (vacuum='false');
heavysql> select cast(total_data_file_size as float)/1024/1024/1024 as UsedGB,cast(total_free_data_page_count*2*1024*1024 as float)/1024/1024/1024 as FreeGB from information_schema.storage_details where table_name='timseries_table';
UsedGB|FreeGB
23.5|0.4824219
We successfully recovered almost all the space that was marked as free.
This approach will work if the data is loaded in order because the automatic vacuum and optimization will recover contiguous chunks of free space. If there is fragmentation, this process won't be as efficient.
Comments
0 comments
Please sign in to leave a comment.