Resources required to run queries

Comments

8 comments

  • Avatar
    Candido Dessanti

    Hi @bycxgto,

    To try, to figure how much memory is required to run a query you have to start the server with one or both those parameters turned to true

    Verbose and enable-debug-timer.

    Run your query and re-post the more detailed logs as an attachment. I will try to analyze.

    I am sorry for the short answer but I am on a plane right now.

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi @bycxgto,

    I took a look at your queries, and I noticed that both of them are cartesian products, so the memory needed is going to be huge because, before the creation of the table, our engine needs to compute the entire result set, so the OOM Errors you are getting.

    The records generated by the second query is around 26 Billion so the memory needed is going to be huge; also the first one is generation around 4 Billion records and is needing 124360000272 bytes on my system.

    Have you tried to build the table doing a geojoin?

    create table yellowtaxiJan13_intersect_pickup as (
      SELECT
        ST_INTERSECTS(tx.omnisci_geo, ST_SetSRID(ST_Point(yt.pickup_longitude, yt.pickup_latitude), 4326)) as pickup_outcome,
        yt.trip_id as pickup_tripid, tx.borough as pickup_borough
    from yellowtaxiJan13 as yt, NYCtaxiZone as tx where ST_INTERSECTS(tx.omnisci_geo, ST_SetSRID(ST_Point(yt.pickup_longitude, yt.pickup_latitude), 4326)) = true
    

    to improve performance I suggest switching the pickup/dropoff lat/lon to POINT datatype, turn to true the parameters enable-overlaps-hashjoin and enable-hashjoin-many-to-many rewrite the query this way

    create table yellowtaxiJan13_intersect_pickup as (
      SELECT ST_INTERSECTS(tx.omnisci_geo, yt.pickup) as pickup_outcome, yt.rowid as pickup_tripid, tx.borough as pickup_borough 
    from yellowtaxiJan13_p as yt, nyctaxizones as tx 
    where ST_contains(tx.omnisci_geo, pickup) = true ;
    

    About the first query, I suggest running several batches to populate the target table using the rowid virtual column to define the range of records for each batch. As an example

    CREATE TABLE  yellowtaxiJan13_pickupsub_dist
    as 
    SELECT
    yt.trip_id, sb.Division, sb.Line, sb.StationName,
    ST_DISTANCE(ST_SetSRID(ST_Point(StationLongitude, StationLatitude), 4326), ST_SetSRID(ST_Point(EntranceLongitude, EntranceLatitude), 4326)) as dist_m
    from yellowtaxiJan13 as yt, NYCSubway as sb
    where rowid between 0 and 100000;
    
    INSERT INTO yellowtaxiJan13_intersect_pickupSELECT
    yt.trip_id, sb.Division, sb.Line, sb.StationName,
    ST_DISTANCE(ST_SetSRID(ST_Point(StationLongitude, StationLatitude), 4326), ST_SetSRID(ST_Point(EntranceLongitude, EntranceLatitude), 4326)) as dist_m
    from yellowtaxiJan13 as yt, NYCSubway as sb
    where rowid between 100001 and 200000;
    
    and so on..,
    

    Hopes this helps

    0
    Comment actions Permalink
  • Avatar
    bycxgto

    Hope you had a good flight! I managed to run what you suggested, want to share some findings here

    I created a omnisci.conf file in /omnisci-storage

    vi omnisci.conf
    enable-debug-timer = 1
    -v
    enable-overlaps-hashjoin = 1
    enable-hashjoin-many-to-many = 1
    

    I restarted my docker with the new configuration but failed. After I removed the .conf file it is up and running again. I don't have access to systemctl but I'm guessing the .conf file was doing something to the docker that prevented it to start. Also is enable-hashjoin-many-to-many available? I haven’t managed to locate this config in the website config page.

    I tried your first geojoin query without using the .conf file, and voila! It completed within seconds

    omnisci admin 419-7B3k {"query_str","client","nonce","execution_time_ms","total_time_ms"} {"create table yellowtaxiJan13_intersect_pickup as ( SELECT ST_INTERSECTS(tx.omnisci_geo, ST_SetSRID(ST_Point(yt.pickup_longitude, yt.pickup_latitude), 4326)) as pickup_outcome, yt.trip_id as pickup_tripid, tx.borough as pickup_borough from yellowtaxiJan13 as yt, NYCtaxiZone as tx where ST_INTERSECTS(tx.omnisci_geo, ST_SetSRID(ST_Point(yt.pickup_longitude, yt.pickup_latitude), 4326)) = true );","tcp:localhost:42522","","38943","38944"}
    

    As for ST_DISTANCE query, will increasing GPU/CPU slab size and GPU memory pool size help? Would you still suggest running queries in batches? Thanks :slight_smile:

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi,

    Thanks for the feedback, and don't worry, I guess the problem relies on -v rather than the hash joins parameter.

    Anyway, if you want a complete list of available parameters on your release, run this command.

    docker exec -it omnisciserver bash -c "/omnisci/bin/omnisci_server /omnisci-storage/data --help "
    

    The list is huge, so if you look for something more specific, you can filter using the grep command.

    mapd@zion16:/var/lib/omnisci$ docker exec -it omnisciserver bash -c "/omnisci/bin/omnisci_server /omnisci-storage/data --help " | grep hashjoin
      --enable-overlaps-hashjoin [=arg(=1)] (=0)
      --enable-hashjoin-many-to-many [=arg(=1)] (=0)
    

    An omnisci.conf like that would work is this one

    enable-overlaps-hashjoin=true
    enable-hashjoin-many-to-many=true
    enable-debug-timer=true
    verbose=true
    

    Anyway, the most relevant parameter would be enable-overlaps-hash join. It will enable a kind of join specifically designed for geo-join with st_contains, but it's not going to work when generating a point at runtime with ST_point or similar.

    so to take advantage of it, I defined the taxi's table with pickup and dropoff as a point

    CREATE TABLE yellowtaxiJan13_p (
      vendorid TEXT ENCODING DICT(8),
      tpep_pickup_datetime TIMESTAMP(0) ENCODING FIXED(32),
      tpep_dropoff_datetime TIMESTAMP(0) ENCODING FIXED(32),
      passenger_count SMALLINT,
      trip_distance FLOAT,
      pickup GEOMETRY(POINT, 4326) ENCODING COMPRESSED(32),
      ratecodeid TEXT ENCODING DICT(8),
      store_and_fwd_flag TEXT ENCODING DICT(8),
      dropoff GEOMETRY(POINT, 4326) ENCODING COMPRESSED(32),
      payment_type TEXT ENCODING DICT(8),
      fare_amount FLOAT,
      extra FLOAT,
      mta_tax FLOAT,
      tip_amount FLOAT,
      tolls_amount FLOAT,
      improvement_surcharge FLOAT,
      total_amount FLOAT);
    

    Loaded the table with copy command, the lat/lon are converted into points.

    You are using a GPU with Volta archs to get a speedup of 3x with this feature enabled.

    About the second query, the problem is that's is going to produce 24 Billion records and I don't get you to want to calculate, so I can't help you so much (calculating the distance between entrance and center of each station, without any relationship with the dropoff or pickup point of the taxi's trip). Let me know what you want to get with the query.

    Regards

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi @bycxgto,

    about your second query, maybe you wanted to get the trips within a certain distance from subway stations?

    a query would be this one

    SELECT yt.rowid as trip_id, sb.Division, sb.Line, sb.StationName, ST_DISTANCE(CastToGeography(dropoff), CastToGeography(station)) as dist_m 
    FROM  yellowtaxiJan13_p as yt, 
                NYCSubway as sb 
    WHERE ST_DISTANCE(CastToGeography(dropoff), CastToGeography(station)) < 50;
    

    This query returns all the trip_id with station names that have ended within 50 meters of a Subway Station.

    to run this query you would add this parameter to omnisci.conf trivial-loop-join-threshold=10000.

    you can further elaborate the query using group by to get as an example the subway stations that has more trips within 100 meters.

    SELECT sb.StationName, count(*) as the_count 
      FROM yellowtaxiJan13_p as yt, NYCSubway as sb 
      WHERE ST_DISTANCE(CastToGeography(dropoff), CastToGeography(station)) < 100  
      GROUP BY 1 
      ORDER BY 2 DESC 
      LIMIT 20;
    

    Hope this helps

    0
    Comment actions Permalink
  • Avatar
    mjj203

    I was provided this spreadsheet calculator to determine amount of GPU ram needed by a query based on the number and type of hot columns * number or records used. For my use case of 10 hot columns (3x linestring, 3x double, 2x int, 2x dict32) and 400m records it was going roughly need 23GB of GPU ram.GPU RAM Sizing 2020 (make a copy).zip|attachment (21.5 KB)

    1
    Comment actions Permalink
  • Avatar
    bycxgto

    Hi @candido.dessanti

    Thanks for the suggestions, the geojoin has helped a lot with reducing size of query. And yes I popped the wrong data fields in second query. It should be the taxi pickup/ drop-off coordinates and distance away from subway stations.

    My docker doesn’t seem to like the new .conf file though. I added trivial-loop-join-threshold=10000 and got Dictionary path /omnisci-storage/data/mapd_data/DB_1_DICT_48/DictPayload does not exist when I ran queries.

    Then I removed the .conf file, started the docker, it was up for a few seconds then shut down again (then I restarted a few times but failed). Looks like there is some docker rebuild in order :smiley:

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Yap,

    the problem doesn't look to be the docker or the config file, but the filesystem where the database is stored is missing some files (a corruption?); never got a problem like that.

    as for performance, if you want to load all the stations with relative distance to pickup/dropoff trips, I suggest to split in batches of 25000 to get the best troughtput (at least on my system looks to be the sweet spot). I'm getting 5189096 rowd writter per seond using 25000 and just 3662781 using 100000.

    At the end of load the number of rows load will be of over 27 Billion so it's unlikely you will be able to query the table with GPU because of memory needed, and the query will fall back for CPU execution, so maybe it's better to narrow the selection.

    Another trick if you have just lon/lat coordinates insted of geometry objects like points, you can try t use the distance_in_meter function, expecially if you are on RTX or no tesla cards

    omnisql> select count(*) from NYCSubway_ll,yellowtaxiJan13 where distance_in_meters(station_longitude,station_latitude,pickup_longitude,pickup_latitude) < 100;

    0
    Comment actions Permalink

Please sign in to leave a comment.