GROUP BY without dictionary encoding
I'm working with Foreign Storage in OmnisciDB and it takes too much time to open files with dictionary encoding. (more than 20s to open a million records) Without dictionary encoding, it takes less than 5s to create table but I can't use a GROUP BY function with non-encoded string columns. Is there any way to run a GROUP BY query without a dictionary encoded string column?
-
Hi @SJJ0211,
You might try enabling
enable-string-dict-hash-cache
. I have not personally tried it with the arrow csv foreign storage interface, but we have had good success with it on large import jobs to our own storage. The reason this is not on by default is for small imports you pay a slight penalty.Alex
-
I tried your suggestion but it didn't seem to work for foreign storage interface. There was no difference in the amount of time spent on creating a table. I also tried to find code to use a group by function with plain text format, but I think it doesn't exist. Anyway, thank you for your reply.
-
Hi @SJJ0211,
To follow up on @anon45925161's comment, would it be possible to time importing your data from CSV?
Also, in general you'll find the biggest wins from
enable-string-dict-hash-cache
when the cardinality of your column is large (i.e. has a high number of unique values), so if you have low cardinality columns, you may not see much benefit. We hope at some point to turn it on by default, perhaps using it selectively for columns that most benefit it when running big imports. -
Actually I slightly changed source code and I'm currently using parquet files instead of a single csv file. But I can assure that it didn't affect the performance when importing dictionary data. A table with 130M rows and 27 columns (including 19 dictionary encoding string columns) takes 160~170 seconds to import regardless of 'enable-string-dict-hash-cache' configuration (without dictionary encoding, it takes 30s). About cardinality, only two string columns have high cardinality ratio (over than 50%) and most columns are low (less than 2%). In my project, actual data could be 100 times bigger than that so I don't think 160 seconds are fast enough to load data. Thank you for your advice @todd.
-
Hi @SJJ0211,
I tried out the FSI, that's in an unfinished state and it looks there are some serialization, that are limiting the load performance, if compared with the copy command.
Have you tried to use a traditiona table and load using the copy command? with a lot of dictionary encoded string, assuming you have enough CPU processing (I'm using a AMD 1920Xwith 24 logical cores) it's 6-7 times faster than FSI
omnisql> copy flights_test_load from '/opt/opendata/flights/flights.csv.gz'; Result Loaded: 10000000 recs, Rejected: 0 recs in 18.785000 secs 1 rows returned. Execution time: 19215 ms, Total time: 19216 ms omnisql> select count(*) from flights_fsi; \\d flights_fsi EXPR$0 10000000 1 rows returned. Execution time: 131733 ms, Total time: 131734 ms
the first queries will be slower depending by the number of columns referenced, but I think right now is the best option.
I also tried without any string encoding and it ended in a slower import of data
omnisql> create foreign table flight_fsi_en ( flight_year SMALLINT, flight_mmonth SMALLINT, flight_dayofmonth SMALLINT, flight_dayofweek SMALLINT, deptime SMALLINT, crsdeptime SMALLINT, arrtime SMALLINT, crsarrtime SMALLINT, uniquecarrier TEXT ENCODING NONE, flightnum SMALLINT, tailnum TEXT ENCODING NONE, actualelapsedtime SMALLINT, crselapsedtime SMALLINT, airtime SMALLINT, arrdelay SMALLINT, depdelay SMALLINT, origin TEXT ENCODING NONE, dest TEXT ENCODING NONE, distance SMALLINT, taxiin SMALLINT, taxiout SMALLINT, cancelled SMALLINT, cancellationcode TEXT ENCODING NONE, diverted SMALLINT, carrierdelay SMALLINT, weatherdelay SMALLINT, nasdelay SMALLINT, securitydelay SMALLINT, lateaircraftdelay SMALLINT, dep_timestamp TIMESTAMP(0) ENCODING FIXED(32), arr_timestamp TIMESTAMP(0) ENCODING FIXED(32), carrier_name TEXT ENCODING NONE, plane_type TEXT ENCODING NONE, plane_manufacturer TEXT ENCODING NONE, plane_issue_date DATE ENCODING DAYS(16), plane_model TEXT ENCODING NONE, plane_status TEXT ENCODING NONE, plane_aircraft_type TEXT ENCODING NONE, plane_engine_type TEXT ENCODING NONE, plane_year SMALLINT, origin_name TEXT ENCODING NONE, origin_city TEXT ENCODING NONE, origin_state TEXT ENCODING NONE, origin_country TEXT ENCODING NONE, origin_lat FLOAT, origin_lon FLOAT, dest_name TEXT ENCODING NONE, dest_city TEXT ENCODING NONE, dest_state TEXT ENCODING NONE, dest_country TEXT ENCODING NONE, dest_lat FLOAT, dest_lon FLOAT) server test_server with (file_path='flights/flights.csv.gz'); omnisql> select count(*) from flights_fsi_en; Exception: Exception occurred: org.apache.calcite.runtime.CalciteContextException: From line 2, column 6 to line 2, column 21: Object 'flights_fsi_en' not found omnisql> select count(*) from flight_fsi_en; EXPR$0 10000000 1 rows returned. Execution time: 313834 ms, Total time: 313836 ms
Please sign in to leave a comment.
Comments
5 comments