Unable to get Output from MAPD
Hello Mapd Team
We are facing one strange issue in our MAPD system where by we are unable to get the response from MAPD . When we fire below query ,this is what we see on Mapd command line terminal
Mapd version
omnisql> \version OmniSci Server Version: 5.10.2-20220218-4112053580
Thrift: Thu Jul 21 11:31:22 2022 TSocket::open() connect() : Connection refused User wdbsreport connected to database wdbsreportdb omnisql> select count(1) from WDBS_ZONE; Thrift error: No more data to read. Thrift connection error: No more data to read. Retrying connection Thrift error: No more data to read. Thrift connection error: No more data to read. Retrying connection Thrift: Thu Jul 21 11:31:38 2022 TSocket::write_partial() send() : Broken pipe Thrift error: write() send(): Broken pipe Thrift connection error: write() send(): Broken pipe Retrying connection
When we try to check the logs ,we see below
==> omnisci_server.INFO <== 2022-07-21T11:31:29.562848 I 89496 0 4 DBHandler.cpp:2503 stdlog get_tables_for_database 7 0 wdbsreportdb calcite 128-19xu {"client"} {"tcp:localhost:11048"} 2022-07-21T11:31:29.570516 I 89496 0 5 DBHandler.cpp:2327 stdlog get_internal_table_details_for_database 8 0 wdbsreportdb calcite 128-19xu {"table_name","client"} {"WDBS_ZONE","tcp:localhost:11050"}
==> omnisci_server.INFO.20220721-113003.log <== 2022-07-21T11:31:29.562848 I 89496 0 4 DBHandler.cpp:2503 stdlog get_tables_for_database 7 0 wdbsreportdb calcite 128-19xu {"client"} {"tcp:localhost:11048"} 2022-07-21T11:31:29.570516 I 89496 0 5 DBHandler.cpp:2327 stdlog get_internal_table_details_for_database 8 0 wdbsreportdb calcite 128-19xu {"table_name","client"} {"WDBS_ZONE","tcp:localhost:11050"}
==> omnisci_server.INFO <== 2022-07-21T11:31:29.961273 I 89496 0 2 Calcite.cpp:573 Time in Thrift 19 (ms), Time in Java Calcite server 1271 (ms) 2022-07-21T11:31:29.961596 F 89496 0 2 FileMgr.cpp:1118 UNREACHABLE 2022-07-21T11:31:30.728515 I 89496 0 6 MapDServer.cpp:323 Interrupt signal (6) received.
==> omnisci_server.INFO.20220721-113003.log <== 2022-07-21T11:31:29.961273 I 89496 0 2 Calcite.cpp:573 Time in Thrift 19 (ms), Time in Java Calcite server 1271 (ms) 2022-07-21T11:31:29.961596 F 89496 0 2 FileMgr.cpp:1118 UNREACHABLE 2022-07-21T11:31:30.728515 I 89496 0 6 MapDServer.cpp:323 Interrupt signal (6) received.
==> omnisci_server.WARNING <== 2022-07-21T11:31:29.961596 F 89496 0 2 FileMgr.cpp:1118 UNREACHABLE
Any feedback/Help will be much appreciated .
-
HI @Raj_Kiran,
to get an idea, of when you start to have this issue? Is this issue circumvented to this particular table or a particular database?
The select count(*) from the table wouldn't even access to the data but just the metadata. What happens if you run select(fielad_name_nullable) from the table?.
Thanks in advance, Candido
-
Then you can use check the status of the files in the filesystem this way
run
heavysql> show databases; Database|Owner omnisci|admin adsb|admin asof|admin
and starting from omnisci database that the database with id of 1 count until your database. In my example I connected to database asof that's has the id of 3
then run
show table details WDBS_ZONE, then get the first number, that's the table_id and check in your data directory (typically /var/lib/omnisci) and check the status of the directory and the files with ls command. S if the table_id is 10
ls -la /var/lib/omnisci/data/mapd_data/table_3_10
you should get an output like this
drwxr-xr-x 2 mapd mapd 4096 lug 21 12:02 . drwxrwxr-x 401 mapd mapd 20480 lug 21 11:53 .. -rw-r--r-- 1 mapd mapd 536870912 giu 30 2019 0.2097152.mapd -rw-r--r-- 1 mapd mapd 16777216 giu 30 2019 1.4096.mapd -rw-r--r-- 1 mapd mapd 4 giu 30 2019 epoch -rw-rw-r-- 1 mapd mapd 16 lug 21 12:02 epoch_metadata -rw-rw-r-- 1 mapd mapd 4 lug 21 12:02 filemgr_version
after that try this command xdd /var/lib/omnisci/data/mapd_data/table_3_10/filemgr_version
and share the output of the commands with us.
Can I ask if you have tried an upgrade to the 6.0 that's failed and you did a sort of rollback?
Regards, Candido
-
Hello @candido.dessanti
This happens only for few tables . Sorry we are unable to run even show details commands as below
omnisql> show databases; Database|Owner mapd|mapd wdbsreportdb|wdbsreport omnisql> omnisql> show table details WDBS_ZONE ..> ;
When we run the above commands ,MAPD freezes and it doesn't even allow login and prints below error when tried to login from other terminal. Error as below
/opt/omnisci/bin/omnisql XXXX -u XXXXXX -p XXXXXXXX
Thrift: Thu Jul 21 16:18:01 2022 TSocket::open() connect() : Connection refused Thrift error: No more data to read. Thrift connection error: No more data to read. Retrying connection Thrift: Thu Jul 21 16:18:18 2022 TSocket::write_partial() send() : Broken pipe Thrift error: write() send(): Broken pipe Thrift connection error: write() send(): Broken pipe Retrying connection Thrift: Thu Jul 21 16:18:22 2022 TSocket::write_partial() send() : Broken pipe Thrift error: write() send(): Broken pipe Thrift connection error: write() send(): Broken pipe Retrying connection Thrift: Thu Jul 21 16:18:30 2022 TSocket::write_partial() send() : Broken pipe Thrift error: write() send(): Broken pipe Thrift connection error: write() send(): Broken pipe Retrying connection
Have tried to extract the file system details as below
sqlite> select dbid,name from mapd_databases; 1|mapd 2|wdbsreportdb
sqlite> select name,tableid from mapd_tables where name='WDBS_ZONE'; WDBS_ZONE|19 sqlite>
[wdbs@pcrfreporting mapd_data]$ ll|grep _19 drwxr-xr-x 2 root root 56 Feb 27 2019 DB_1_DICT_19 drwxr-xr-x 2 root root 63 May 26 11:30 table_2_19 [wdbs@pcrfreporting mapd_data]$
[wdbs@pcrfreporting DB_1_DICT_19]$ ls -la /opt/data/data/mapd_data/DB_1_DICT_19 total 8208 drwxr-xr-x 2 root root 56 Feb 27 2019 . drwxr-xr-x 307 root root 12288 Jul 21 15:45 .. -rw-r--r-- 1 root root 4194304 Feb 27 2019 DictOffsets -rw-r--r-- 1 root root 4194304 Feb 27 2019 DictPayload [wdbs@pcrfreporting DB_1_DICT_19]$ ls -la /opt/data/data/mapd_data/table_2_19 total 24 drwxr-xr-x 2 root root 63 May 26 11:30 . drwxr-xr-x 307 root root 12288 Jul 21 15:45 .. -rw-r--r-- 1 root root 16 May 26 11:30 epoch_metadata -rw-r--r-- 1 root root 5 May 26 11:30 filemgr_version [wdbs@pcrfreporting DB_1_DICT_19]$
-
debug.txt|attachment (2.4 KB) have uploaded traces in attachment
-
well,
from which I can see here the table is empty (probably has been truncated?) and the file filemgr_version is badly formed can you post the output of the command.
xdd /opt/data/data/mapd_data/table_2_19/filemgr_version
(you can also try this. backup the directory containing the table this way cp /opt/data/data/mapd_data/table_2_19/ /opt/data/data/mapd_data/table_2_19_backup and run echo -n -e '\x1\x0\x0\x0' >/opt/data/data/mapd_data/table_2_19/filemgr_version )
-
Hello ,
Have tried suggested commands ,have tried below
[root@xxxxxxx ~]# cp -r /opt/data/data/mapd_data/table_2_19/ /opt/data/data/mapd_data/table_2_19_backup
[root@xxxxxxx ~]# echo -n -e ‘\x1\x0\x0\x0’ >/opt/data/data/mapd_data/table_2_19/filemgr_version
[wdbs@xxxxxxx ~]$ xxd /opt/data/data/mapd_data/table_2_19/filemgr_version {0000000: e280 9878 3178 3078 3078 30e2 8099 ...x1x0x0x0...} [wdbs@xxxxxxx ~]$
Sorry .Still have same error
-
Hi,
i have been able to reproduce setting the error using a negative number in the filemgr_version, so setting to 1 it's impossible get the error.
probably you are querying another table?
could you run the command xdd /opt/data/data/mapd_data/table_2_19/filemgr_version
and on another table thats working (18 maybe) xdd /opt/data/data/mapd_data/table_2_18/filemgr_version
-
hi @candido.dessanti
Please see the attached doc with details of working and non working table debug1.txt|attachment (1.8 KB)
-
Hi @raj,
looking at your data [wdbs@pcrfreporting log]$ xxd /opt/data/data/mapd_data/table_2_19/filemgr_version
0000000: e280 9878 3178 3078 3078 30e2 8099 ...x1x0x0x0...
this file look corrupted. when you run the command
echo -n -e ‘\\x1\\x0\\x0\\x0’ >/opt/data/data/mapd_data/table_2_19/filemgr_version
the resulting file would be 4 bytes and like this one
0000000: 0100 0000
Have you moved the database on other disks lately? Could you try do do this un-mount and remount the filesystem where you data is located?
-
Hi,
I am not sure the tables are corrupted, but it looks the filesystem is because if you run the echo command you should get a 4 bytes file with the content 01000000, not that random number you are getting. You can try removing the filemgr_version of the table 2 19 restart the database and see what happens.
Looks a filesystem corruption to me, maybe some ssd are failing on some parts. It happened once to me
-
debug2.txt|attachment (311 Bytes) Also please see the attached txt file where i have fired xxd on backup file that i took before running echo command and then with the latest file on which echo was fired debug2.txt|attachment (311 Bytes)
-
have you did echo in the 2_19 table?
I'm seeing Backup
[wdbs@pcrfreporting ~]$ xxd /opt/data/data/mapd_data/table_2_19/filemgr_version_210722 0000000: 0000 00ff ff .....
Post doing echo
[wdbs@pcrfreporting ~]$ xxd /opt/data/data/mapd_data/table_2_46/filemgr_version 0000000: 0100 0000
-
Hi Sorry
Please refer this debug3
debug3.txt|attachment (385 Bytes)
-
try to run the
echo -n -e ‘\\x1\\x0\\x0\\x0’ >/opt/data/data/mapd_data/table_2_19/filemgr_version
and then this on the same filexxd /opt/data/data/mapd_data/table_2_19/filemgr_version
the database is crashing because an unexpected value is read and it's aborting the server to limit a possible corruption.
so the possible solutions, are fixing the filemgr_version files with the echo -n -e ‘\x1\x0\x0\x0’ command, or removing tham and making the syste re-create, but I'm not sure it's going to work, because the values into the files cannot be come from the software, so check you disk and the filesystem to be sure that you havent a corruption
Please sign in to leave a comment.
Comments
21 comments