How to limit the data table size with a given number rows correctly

Comments

5 comments

  • Avatar
    Candido Dessanti

    Hi @Xiaokang_Fu,

    As stated in the documentation the max_rows parameter is intended as a maximum parameter, not a constant one. When the max_rows has reached the system is dropping the oldest fragment of the table, so in your case is dropping the actual fragment and then creating a new one for the records exceeding the 10M mark.

    My suggestion is to create a table with a fragment size of 10M and a max_rows of 20M. This way the rows in the tables will be between 10 and 20 Million with just one to two fragments for the table.

    Create table IF NOT EXISTS geotweets (message_id BIGINT,tweet_date TIMESTAMP(0),tweet_text TEXT ENCODING NONE,
    tweet_lang TEXT ENCODING DICT(32),
    latitude DOUBLE,longitude DOUBLE,sentiment_score DOUBLE) WITH (max_rows = 20000000, fragment_size=10000000);
    

    We have a community discussion here and we have also a FAQ and the description in the DOCS

    Have you some reasons to use DOUBLES, BIGINTS, and whatever? Infer is good to have an idea, but after that is better to optimize the schema

    Create table IF NOT EXISTS geotweets (message_id BIGINT ,
    tweet_date TIMESTAMP ENCODING FIXED(32) ,
    tweet_text TEXT ENCODING NONE,
    tweet_lang TEXT ENCODING DICT(16),
    latitude FLOAT,longitude FLOAT,sentiment_score FLOAT / DECIMAL(5,2) )
    

    With these datatypes, you are saving half of your memory.

    Regards, Candido

    1
    Comment actions Permalink
  • Avatar
    Xiaokang Fu

    For the schema,

    Create table IF NOT EXISTS geotweets (message_id BIGINT , tweet_date TIMESTAMP ENCODING FIXED(32) , tweet_text TEXT ENCODING NONE, tweet_lang TEXT ENCODING DICT(16), latitude FLOAT,longitude FLOAT,sentiment_score FLOAT / DECIMAL(5,2) )

    Is it a good suggestion for dealing with geo-tweets?

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    It's the best suggestion I can give you without having data, and it's going to save you a lot of memory.

    I'm not sure what you are going to do with tweet_text

    Regards, Candido

    0
    Comment actions Permalink
  • Avatar
    Xiaokang Fu

    We want to investigate the text to understand what they talk about. I got some errors from the suggestion, image|690x33

    image|690x147

    image|690x345 s

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    HI,

    you can try using load_data_columnar instead of using Arrow so

    conn.load_table_columnar("geotweets",df, preserve_index=False)

    or you can try doing a cast on the columns of PDF that ae float74 with the astype('float32') on latitude, longitude,sentiment_score.

    sometimes the arrow loader complains on slight different datatypes

    0
    Comment actions Permalink

Please sign in to leave a comment.