How to limit the data table size with a given number rows correctly
Hi, we created a table with max_rows
. We want the table to reach 10 million rows. And then, when new data is loaded, the oldest data (some data from the end of the table) will pop out. The total rows should be 10 million rows. We create the table by,
Create table IF NOT EXISTS geotweets (message_id BIGINT,tweet_date TIMESTAMP(0),tweet_text TEXT ENCODING NONE,
tweet_lang TEXT ENCODING DICT(32),
latitude DOUBLE,longitude DOUBLE,sentiment_score DOUBLE) WITH (max_rows = 10000000);
We try to load data to this table every day by pymapd,
conn.load_table("geotweets",df,create='infer',method='arrow')
But this way. we found out that when the data will exceeds 10 million, the next upload of data to the database will directly cause some data loss, instead of directly deleting the oldest data and keeping the table with 10 million rows.
Is there anything we did wrong here? Any suggestion for implementing our objective here with Heavydb itself?
-
Hi @Xiaokang_Fu,
As stated in the documentation the max_rows parameter is intended as a maximum parameter, not a constant one. When the max_rows has reached the system is dropping the oldest fragment of the table, so in your case is dropping the actual fragment and then creating a new one for the records exceeding the 10M mark.
My suggestion is to create a table with a fragment size of 10M and a max_rows of 20M. This way the rows in the tables will be between 10 and 20 Million with just one to two fragments for the table.
Create table IF NOT EXISTS geotweets (message_id BIGINT,tweet_date TIMESTAMP(0),tweet_text TEXT ENCODING NONE, tweet_lang TEXT ENCODING DICT(32), latitude DOUBLE,longitude DOUBLE,sentiment_score DOUBLE) WITH (max_rows = 20000000, fragment_size=10000000);
We have a community discussion here and we have also a FAQ and the description in the DOCS
Have you some reasons to use DOUBLES, BIGINTS, and whatever? Infer is good to have an idea, but after that is better to optimize the schema
Create table IF NOT EXISTS geotweets (message_id BIGINT , tweet_date TIMESTAMP ENCODING FIXED(32) , tweet_text TEXT ENCODING NONE, tweet_lang TEXT ENCODING DICT(16), latitude FLOAT,longitude FLOAT,sentiment_score FLOAT / DECIMAL(5,2) )
With these datatypes, you are saving half of your memory.
Regards, Candido
-
For the schema,
Create table IF NOT EXISTS geotweets (message_id BIGINT , tweet_date TIMESTAMP ENCODING FIXED(32) , tweet_text TEXT ENCODING NONE, tweet_lang TEXT ENCODING DICT(16), latitude FLOAT,longitude FLOAT,sentiment_score FLOAT / DECIMAL(5,2) )
Is it a good suggestion for dealing with geo-tweets?
-
HI,
you can try using load_data_columnar instead of using Arrow so
conn.load_table_columnar("geotweets",df, preserve_index=False)
or you can try doing a cast on the columns of PDF that ae float74 with the
astype('float32')
on latitude, longitude,sentiment_score.sometimes the arrow loader complains on slight different datatypes
Please sign in to leave a comment.
Comments
5 comments