How to downgrade the version

Comments

26 comments

  • Avatar
    Sumit

    I updated system last month Feb and it was working perfectly fine. However, today after updating to latest version, not getting the response after 2 hours

    please note- Data volume is less than few GB's and system RAM IS 192GB.

    0
    Comment actions Permalink
  • Avatar
    Sumit

    I am not sure how to come out from this problem

    0
    Comment actions Permalink
  • Avatar
    Randy Zwitch

    How did you install 5.1.2?

    0
    Comment actions Permalink
  • Avatar
    Sumit

    Upgrading the OMNISCI Version

    systemctl stop omnisci_server

    yum update omnisci

    systemctl start omnisci_server

    0
    Comment actions Permalink
  • Avatar
    Randy Zwitch

    What does systemctl status omnisci_server show?

    0
    Comment actions Permalink
  • Avatar
    Sumit

    omnisql> \status Server Version : 5.1.2-20200304-e118922084


    Server Start Time : 2020-03-06 : 16:37:37

    [root@utility1 ~]# systemctl status omnisci_server ● omnisci_server.service - OmniSci database server Loaded: loaded (/usr/lib/systemd/system/omnisci_server.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2020-03-06 20:37:34 +04; 1h 18min ago Main PID: 199238 (omnisci_server) Tasks: 103 CGroup: /system.slice/omnisci_server.service ├─199238 /opt/omnisci//bin/omnisci_server --config /opt/data/omnisci.conf └─199253 -Xmx1024m -DMAPD_LOG_DIR=/opt/data/data -jar /opt/omnisci/bin/calcite-1.0-SNAPSHOT-jar-with-dependencies.jar -e /opt/omnisci/QueryEngine/ -d /opt/data/data -p 6279 -m 6274 -T -P -Y -Z -c /opt/data/omnisci.conf

    Mar 06 20:37:34 utility1 systemd[1]: Started OmniSci database server. Mar 06 20:37:35 utility1 omnisci_server[199238]: 2020-03-06T20:37:35.483735 E 199238 MapDHandler.cpp:225 This build isn't CUDA enabled, will run on CPU Mar 06 20:37:35 utility1 omnisci_server[199238]: 2020-03-06T20:37:35.494217 E 199238 Calcite.cpp:216 Appears to be orphaned Calcite server already running, shutting it down Mar 06 20:37:35 utility1 omnisci_server[199238]: 2020-03-06T20:37:35.494328 E 199238 Calcite.cpp:218 Please check that you are not trying to run two servers on same port Mar 06 20:37:35 utility1 omnisci_server[199238]: 2020-03-06T20:37:35.494383 E 199238 Calcite.cpp:219 Attempting to shutdown orphaned Calcite server Mar 06 20:37:35 utility1 omnisci_server[199238]: 2020-03-06T20:37:35.549881 E 199238 Calcite.cpp:224 orphaned Calcite server shutdown [root@utility1 ~]#

    omnisql> \
    ^C Interrupt signal (2) received. [wicp@ utility1 ~]$


    OmniSci Server CPU Memory Summary: MAX USE ALLOCATED FREE 102529.43 MB 44.43 MB 4096.00 MB 4051.57 MB

    OmniSci Server GPU Memory Summary: [GPU] MAX USE ALLOCATED FREE

    omnisql>

    0
    Comment actions Permalink
  • Avatar
    Sumit

    Please let us know how i can come out from this problem.... I am stuck...

    Need Help

    Even its required.. i can restart and share more detailed information.

    0
    Comment actions Permalink
  • Avatar
    Randy Zwitch

    If you have gotten to this point, it feels like you have a running server:

    OmniSci Server CPU Memory Summary:
    MAX USE ALLOCATED FREE
    102529.43 MB 44.43 MB 4096.00 MB 4051.57 MB
    
    OmniSci Server GPU Memory Summary:
    [GPU] MAX USE ALLOCATED FREE
    
    omnisql>
    

    The only thing that stands out to me from the snippet you posted is that another Calcite server was running, and that OmniSciDB tried to disable that. Have you completely restarted the server (not just OmniSci), to see if that fixes the issues you are seeing?

    0
    Comment actions Permalink
  • Avatar
    Sumit

    [wicp@utility1 ~]$ ps -ef| grep -i omni omnisci 199238 1 11 Mar06 ? 00:43:00 /opt/omnisci//bin/omnisci_server --config /opt/data/omnisci.conf omnisci 199253 199238 0 Mar06 ? 00:00:42 -Xmx1024m -DMAPD_LOG_DIR=/opt/data/data -jar /opt/omnisci/bin/calcite-1.0-SNAPSHOT-jar-with-dependencies.jar -e /opt/omnisci/QueryEngine/ -d /opt/data/data -p 6279 -m 6274 -T -P -Y -Z -c /opt/data/omnisci.conf wicp 222119 221845 0 02:39 pts/0 00:00:00 grep --color=auto -i omni [wicp@utility1 ~]$

    Please let us know if any other way possible to check if any other OmniSci Server running ?

    0
    Comment actions Permalink
  • Avatar
    anon23740792

    Hi Sumit, sorry to hear you're stuck. If you've not tried this already, 1. Do you have a backup of your OmniSci data directory? Can you make sure you do that first? 2. Test with a clean install of 5.1.2 from a tarball, startup omnisci from the bin directory there and see if it works on a separate data directory (i.e without pointing it at your existing data directory)?

    0
    Comment actions Permalink
  • Avatar
    Sumit

    Hi @anon23740792,

    a) Thank you for kind words and will check with team if they have taken the Data Directory Backup. If not then we can create a clear procedure for Backup and Restoration (Please let us know if such procedure is available in OmniSci Documentation or any other location.

    b) On the other Note, i would like to highlight below critical point in OmniSci team notice on user point of view. b1) Yesterday full day i struggled to make the system up and running even after multiple process restart and server reboot. suddenly it started working after 5-6 hours but still not sure what will happen if i will restart again. b2) I am running OmniSci database in standalone server and no other process running in this high end server (40 Core, 192GB RAM, 10HDD's). The Data volume is less than few GB's. However,yesterday after restarting the OmniSci Database, Server Load average for touching 30-35 (in other days, system load average not cross more than 1-2). b3) The biggest area of my concern is on Omnisci INFO Log. As a technical person i can understand definitely there was some issue in system or omnisci database, however i was not able to analyze anything from this log as log was showing normal even that time i am not able to get the response of simplest query \ to list the available tables. May be my knowledge on omnisci logs are not great, however, i was not able to share the feedback also with community to help me. b4) please let us know any Coredump/detailed Trace option is available in omnisci database, so we can run for short duration during issue and analyze it. As well as share with omnisci team to verify and advise. In absence of no logs, during issue time i feel that i am in clueless state and only god can safeguard me.

    0
    Comment actions Permalink
  • Avatar
    Randy Zwitch

    Thank you for the feedback, we will take the logging into consideration.

    Please note that there are actually 4 types of OmniSci logs, each with increasing severity:

    omnisci_server.INFO 
    omnisci_server.WARNING
    omnisci_server.ERROR
    omnisci_server.FATAL
    

    So while the INFO log may not show any issues, the other log levels might.

    From your description of the server doing something for hours, then eventually coming back, it sounds like your table metadata might have issues. How is the data being loaded into your database, in batches or individual row inserts?

    0
    Comment actions Permalink
  • Avatar
    anon23740792

    Sumit - what version did you upgrade from?

    I understand your frustration. There are several large omnisci customers running in production without encountering similar issues. We would like to help you get to the bottom of what happened, but you need to provide us with additional info in the form of logs to help. It is hard to diagnose problems just through your forum messages. Can you upload your full set of logs somewhere for us to take a look?

    0
    Comment actions Permalink
  • Avatar
    Sumit

    Hi @randyzwitch

    thanks for your feedback. I am planning to start once again omniserver to ensure that in future if any day it will restart again then i will not face this issue.

    This time i will capture the above mentioned logs and share with omnisci team to diagnose the issue if any.

    0
    Comment actions Permalink
  • Avatar
    Sumit

    Hi @anon23740792,

    thank you for your feedback. As you mentioned about my frustration but this has other aspects of it.

    I personally connect myself with omnisci database and feel that this database design philosophy has capability to change the world of analytical database.

    However, when i face challenge and not able to resolve the issue, i feel alone and react accordingly.

    At present biggest challenge which i am facing is on Log details specially during the issue time. For example, when i give \ to list available Tables or select count(*), column x from Table Y and not able to get output. Now i don't know what to do next.

    0
    Comment actions Permalink
  • Avatar
    anon23740792

    We appreciate your feeling as strongly as you do about OmniSci, @Sumit - Thank you for that! While we're trying to figure out the problem, can you help answer the following questions 1) How frequently do you do any sort of update/delete/insert operations on the database? i.e how many times a day/week/month and how many records do you usually have in these batches? 2) How long have you been doing these updates? e.g. past year? a few months? 3) What version did you most recently upgrade from?

    Another note - your server is definitely quite capable on the compute capacity, but using HDDs may also be a contributing factor here (again, I'm guessing, given what you've described above). Usually we recommend faster NVMe storage with OmniSci because as we discussed earlier, we dont store all the database in CPU memory, so you can actually run databases larger than available memory with OmniSci unlike several in-memory databases. But - for that to be fast, we prefer high-speed IO in the form of SSDs/NVMe.

    My guess is that you are facing some latency issues stemming from long-running metadata operations as Randy points out. We can get on a call to discuss this further if you want.

    0
    Comment actions Permalink
  • Avatar
    Sumit

    Hi @anon23740792 and @randyzwitch,

    I am able to simulate the reported issue and connected many dots, however still few are missing points and would request your help to assist.

    A) Today again I restarted the omni Database using the following command

    Stopping the OMNISCI server

    systemctl stop omnisci_server

    Starting the OMNISCI server

    systemctl start omnisci_server

    Action 1 1. I observed that again I am not getting the response for my simple queries after restarting Omnisci Database module. Then started analyzing using step by step approach. 2. Observations were as following a. Disk Partition was approximately 40% full (400GB) b. Disk IOPS increased drastically high by running one simple query (select count(*) from Table X) and touching TPS of 4000 and reading 100MB/sec reading. c. \memory_summary output was showing available Memory is 100GB but use memory is 0.00MB d. No Error in INFO and Error log

    Action 2 1. Based on the Action1, I felt that some issue in Table Fragmentation and started analyzing the Table size using the sqllite queries. 2. During the analysis observed that Tables are highly fragmented and actual records are very less in Tables (in few thousands), however Table size is in terms of 40GB, 62GB, 100GB & 192GB. 3. Defragmented Tables in following way 1.3.1. Select count(*) from X; 1.3.2. Create Table TEMP_X as select * from X; 1.3.3. Drop table X; 1.3.4. Create table X as select * from TEMP_X; 1.3.5. Drop Table TEMP_X;

    Observations on Action 2 Case1 1. When i executed defragmented steps for a Table where No data Insert was happening, that time system IOPS was 2000+ and reading 100MB/s. However this query execution took more 6-7 minutes. 2. In the select count() from X, observed that \memory_summary Use Memory value was constantly 0.00MB. 3. No intermittent status update log between when the query accepted by Calcite and when received the output for query. However, between two events time gap was more than 6-7 minutes and I was clueless what is happening in system. a. 2020-03-08T12:42:01.182590 I 422998 MapDHandler.cpp:883 stdlog_begin sql_execute 54 0 WICP_MAPD_DB WICP_MAPD 239-eBLm {"query_str"} {"select count() from EVR;"} b. 2020-03-08T12:47:04.128277 I 422998 FileMgr.cpp:204 Completed Reading table's file metadata, Elapsed time : 302904ms Epoch: 424416 files read: 6011 table location: '/opt/data/data/mapd_data/table_2_34' Case 2 1. When I executed select query for a Table where data Insert was happening, that time system IOPS was 4000+ and reading 100MB/s. However this query execution took more 150 minutes. This time I was in panic condition as no information in INFO Log, \memory_summary output showing memory usage in 0.00MB, system Load Average is above 30, IOPS above 4000.

    Questions to OmniSci Team

    1. No progress information available in INFO logs available between “2020-03-08T12:42:01.217941 I 422998 Calcite.cpp:504 Time in Thrift 1 (ms), Time in Java Calcite server 32 (ms)" and “2020-03-08T12:47:04.128277 I 422998 FileMgr.cpp:204 Completed Reading table's file metadata, Elapsed time : 302904ms Epoch: 424416 files read: 6011 table location: '/opt/data/data/mapd_data/table_2_34”. This is more than 5 minutes duration and not good for any administrator to be clueless about the progress on query.
    2. Why Data was not loading in memory for “select count(*) from X or select count(column1) from X”. Is it correct ?
    3. When Data loading was in progress for Table X, that time “select count(*) from Y” took more than 150 minutes and during this time I was not able to get a output of any query. This was system outage.
    4. How fragmented table can create this big issue and force for X hours of service outage. What is method available to avoid such scenarios.
    5. System output was 4000 IOPS and reading was 125MB/sec, then how reading took more than 150 minutes. With this speed we can read more than 200GB data also in less than 30 minutes.
    0
    Comment actions Permalink
  • Avatar
    Sumit

    waiting for expert advise

    0
    Comment actions Permalink
  • Avatar
    Randy Zwitch

    [quote="anon23740792, post:17, topic:2269"] * How frequently do you do any sort of update/delete/insert operations on the database? i.e how many times a day/week/month and how many records do you usually have in these batches? * How long have you been doing these updates? e.g. past year? a few months? [/quote]

    @Sumit, please see above. Both myself and @anon23740792 believe that metadata could be the issue you are seeing, but without knowing the insert pattern you are doing, it's hard for us to provide any other advice.

    0
    Comment actions Permalink
  • Avatar
    anon23740792

    Hey @Sumit - sorry for the delay. Let me take this back to our engg team. It seems (still not entirely sure) that a number of issues relate to your update/insert history and the amount of resulting metadata that needs to be tracked (i.e if you did several small batch updates/inserts/deletes in the past, we keep track of a fair amount of metadata in order to allow rollbacks). This is why we still need the answer to my earlier question on frequency of updates/deletes/inserts...

    Does any other query also take this long, or is it just the select count(*)? What about groupby queries like select count(1), x...group by(x)?

    0
    Comment actions Permalink
  • Avatar
    Sumit

    OMNISCI Query: how many times a day/week/month and how many records do you usually have in these batches? Sumit Feedback : we are loading Data in two Ways 1. Text File Data Loading ( in a Batch of 100+ records in different Tables and a frequency of 15 Minutes) 2. Kafka Importer : Continuous Loading and loading approximately between a Million to 5.

    • How long have you been doing these updates? e.g. past year? a few months? Sumit Feedback: Updates we do rarely. However, system do deletion and reloading in same table. this happen on daily basis. This is happening last few (2-3) months.
    0
    Comment actions Permalink
  • Avatar
    anon45925161

    Hi @Sumit,

    Yes, the issue here is around table metadata generated during your imports. OmniSciDB uses a checkpoint system to ensure failed imports are cleanly rolled back without corrupting a table. Each checkpoint creates some metadata on disk which must then be read at startup, or when the table is first touched. Due to your continuous loading with kafka importer as well as the small text file batches, I suspect that you have many, many pages of metadata which is causing a slow load (particularly off HDDs -- these are not one shot reads but multiple ordered reads, as the metadata tells us what to read from the data files).

    This is a problem we are aware of and are currently working on. In the interim, you will need to copy out the table using copy to, drop it, and reimport it using copy from. For the near future, I suggest increasing your Kafka batch size following the parameters here: https://docs.omnisci.com/latest/6_loading_data.html#kafkaimporter That will reduce the number of checkpoints (and subsequent metadata pages). You may also want to look at StreamInserter: https://docs.omnisci.com/latest/6_loading_data.html#streaminsert The batch size is configurable there as well.

    Let us know if that helps, and if you have any other questions. Thanks!

    Alex

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    @Sumit

    As @anon45925161 pointed out, I also believe the problem relies on metadata generated during the DMLs.

    This should explain also the Calcite problems you recently got that we discussed on is this thread; it's likely that having a lot of metadata in the table, also with a limited concurrency on parsing you can get the Calcite time-out error.

    So I suggest scheduling a periodic rebuild of the tables that are getting more DMLs during no working hours. You can create a new table and load it with a batch of IAS, dropping the old table then renaming the new one as the old; during this operation, the DMLs on the table should be suspended.

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi @Sumit

    I forgot to reply to 2-3 questions.

    When you run a simple select count(*), metadata only is used, so no data has is loaded into the cache. Almost all runtime is the time needed to load metadata and compile the query

    on a cold started instance, so with no cache allocated and no metadata cached

    omnisql> select count(*) from adsb_airplanes_test;
    EXPR$0
    497551202
    1 rows returned.
    Execution time: 853 ms, Total time: 943 ms
    omnisql> \\memory_summary
    OmniSci Server CPU Memory Summary:
                MAX            USE      ALLOCATED           FREE
        51485.61 MB        0.00 MB        0.00 MB        0.00 MB
    
    OmniSci Server GPU Memory Summary:
    [GPU]            MAX            USE      ALLOCATED           FREE
      [0]     5245.21 MB        0.00 MB     2048.00 MB     2048.00 MB
      [1]     5245.21 MB        0.00 MB     2048.00 MB     2048.00 MB
    

    As you can see the runtime isn't so fast because of the time needed to load the table metadata and the time to generate and compile the query, but if you run a select count(nullable field) the buffers are populated on every level of caching

    omnisql> select count(Altitude) from adsb_airplanes_test;
    EXPR$0
    497551202
    1 rows returned.
    Execution time: 290 ms, Total time: 291 ms
    omnisql> \\memory_summary1
    OmniSci Server CPU Memory Summary:
                MAX            USE      ALLOCATED           FREE
        51485.61 MB     1898.01 MB     4096.00 MB     2197.99 MB
    
    OmniSci Server GPU Memory Summary:
    [GPU]            MAX            USE      ALLOCATED           FREE
      [0]     5245.21 MB     1135.07 MB     2048.00 MB      912.93 MB
      [1]     5245.21 MB      762.94 MB     2048.00 MB     1285.06 MB
    

    So 290ms is the time needed to read data from disk, populate caches and run the query

    When you run a select count(not nullable field) the query is run against CPU only, and just the CPU caches are used

    omnisql> select count(Altitude) from test_airplane_altitude;
    EXPR$0
    497551202
    1 rows returned.
    Execution time: 665 ms, Total time: 666 ms
    omnisql> \\memory_summary
    OmniSci Server CPU Memory Summary:
                MAX            USE      ALLOCATED           FREE
        51485.61 MB     1898.01 MB     4096.00 MB     2197.99 MB
    
    OmniSci Server GPU Memory Summary:
    [GPU]            MAX            USE      ALLOCATED           FREE
      [0]     5245.21 MB        0.00 MB        0.00 MB        0.00 MB
      [1]     5245.21 MB        0.00 MB        0.00 MB        0.00 MB
    

    This behavior is changed if the3 table has deleted records

    So what you observed is correct, and the query took so much time because of the high fragmentation of the table's metadata

    0
    Comment actions Permalink
  • Avatar
    Sumit

    Hi @candido.dessanti,

    Thank you for this valuable information and very useful.

    One question remains.... How as a system administrator, i can get to know now this table is highly fragmented and time to shrink (by create Table New Table and drop the old one).

    0
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    Hi @Sumit,

    I don't know if there is a way to get information like that, but I would try to use the \gte command from omnisql; my rule of thumb would be, more the epochs, more the table is fragmented.

    The \gte command isn't documented by it returns the actual epoch of the table e.g.

    omnisql> \\gte adsb_airplanes_test
    table epoch is 53
    omnisql> insert into adsb_airplanes_test select * from adsb_airplanes_source;
    omnisql> \\gte adsb_airplanes_test
    table epoch is 54
    omnisql> insert into adsb_airplanes_test select * from adsb_airplanes_source;
    omnisql> \\gte adsb_airplanes_test
    table epoch is 55
    omnisql> insert into adsb_airplanes_test select * from adsb_airplanes_source;
    omnisql> \\gte adsb_airplanes_test
    table epoch is 56
    omnisql> delete from adsb_airplanes_test where rowid between 7000 and 8000;
    omnisql> \\gte adsb_airplanes_test
    table epoch is 57
    omnisql> delete from adsb_airplanes_test where rowid between 53000 and 53001;
    omnisql> \\gte adsb_airplanes_test
    table epoch is 58
    

    As you can see each operation that affects the number of records (update statement wouldn't change the epoch)

    I did a brief test inserting 2k rows into the adsb_airplanes_test table for 1k and 2k times to raise the epoch to 3075, and I observed that around 40 small files (16mb) presumably containing metadata have been created.

    This has an impact, albeith small,on performance while reading table metadata as you can see here

    2020-30-67T14:24:57.130767 I 10631 FileMgr.cpp:204 Completed Reading table's file metadata, Elapsed time : 43ms Epoch: 58 files read: 132 table location: '/opt/mapd_storage/data50/mapd_data/table_2_3'
    
    2020-30-67T14:37:48.165981 I 70584 FileMgr.cpp:204Completed Reading table's file metadata, Elapsed time : 284ms Epoch: 1075 files read: 145 table location: '/opt/mapd_storage/data50/mapd_data/table_2_3'
    nvme2n1           0.00     0.00 10723.00    0.00 106492.00     0.00    19.86     1.51    0.14    0.14    0.00   0.02  23.60
    nvme1n1           0.00     0.00 10701.00    0.00 106172.00     0.00    19.84     1.30    0.12    0.12    0.00   0.02  22.40
    nvme0n1           0.00     0.00 10778.00    0.00 106708.00     0.00    19.80     1.41    0.13    0.13    0.00   0.02  22.00
    
    2019-06-30T20:07:53.103724 I 57426 FileMgr.cpp:204 Completed Reading table's file metadata, Elapsed time : 374ms Epoch: 3075 files read: 171 table location: '/opt/mapd_storage/data50/mapd_data/table_2_3'
    nvme2n1           0.00     0.00 5929.00    0.00 104684.00     0.00    35.31     1.42    0.22    0.22    0.00   0.03  15.60
    nvme1n1           0.00     0.00 5810.00    0.00 103744.00     0.00    35.71     1.40    0.22    0.22    0.00   0.03  15.20
    nvme0n1           0.00     0.00 5892.00    0.00 104000.00     0.00    35.30     1.34    0.21    0.21    0.00   0.03  15.6
    
    nvme2n1           0.00     0.00 7399.00    0.00 159556.00     0.00    43.13     2.23    0.32    0.32    0.00   0.02  17.60
    nvme1n1           0.00     0.00 7362.00    0.00 159704.00     0.00    43.39     2.06    0.30    0.30    0.00   0.03  19.60
    nvme0n1           0.00     0.00 7383.00    0.00 159856.00     0.00    43.30     2.19    0.31    0.31    0.00   0.03  19.20
    

    the performance isn't compromised (from 43ms to 374 isn't a big issue), but probably with the largest number of metadata files the degradation in performance could be a lot bigger.

    Could you try the \gte command against on of the offendi tables?

    0
    Comment actions Permalink

Please sign in to leave a comment.