Why is the Chinese display of query results messy with dbeaver?

Comments

5 comments

  • Avatar
    Candido Dessanti

    Hi,

    I believe it depends on how the data was encoded when it was ingested into the database. If it was encoded as UTF-8, everything should be displayed correctly. However, if a legacy encoding was used, you may encounter incorrect output. You can adjust the encoding settings in DBeaver. In my installation, you can select UTF-8, UTF-16, and a few others.

    Here's a small example with some UTF-encoded data inserted into the database:

     

    If you have a file with a legacy encoding like BIG5, you can use the iconv utility to convert it into UTF-8.

    When you load the original file with the legacy encoding into the database, you may encounter incorrect output due to the source file's encoding. For example:

    head -n 5 adminbk1.txt
    id","title (WG)","title (Pinyin)","title (English)","title (Chinese)","author","Boundary","Name","Code","Other (specify)","Period","# of Pages","Pub_Info","Location","Call #","ISBN","Language","Description"
    1,"Ching tai ti li yen ko piao","Qing dai di li yan ge biao",,"�� �� �� �� �� �� ��","Zhao, Quan-cheng ( �� Ȫ  �� )",1,1,0,,"Qing Dynasty","204","�� �� �� �� �� 1940 �� �� �� �� ��, 1979 ��","UW East Asian Library","DS755 .S532 v.628",,"Chinese","China - historical - geography;   China - Administrative - and - political -divisions.   It  contains    descriptions   and   charts   about  the   administrative   boundary changes   in   Qing   Danasty."
    2,"Chung-kuo shih hsien shou tse","Zhong guo shi xian shou ce",,"�� �� �� �� �� ��","Wang, Yueh",0,1,0,,"-1986","641","�� �� ʡ �� �� �� �� �� 1987","UW East Asian Library","JS7351 A3 C59 1987",,"Chinese","China - administrative - and - political - divisions.    It   contains   the   name, geography,   and   other   information   about   Chinese   cities   and   counties   up   to 1986."
    3,"Ko sheng chu yu yen ko i lan piao","Ge sheng qu yu yan ge yi lan biao",,"�� ʡ �� �� �� �� һ �� ��",,0,1,0,,,"47","�� �� �� �� ӡ �� �� 1914","UW East Asian Library","DS737 .H7",,"Chinese","Names - geographical - China;   China - administrative - and - political - divisions.  Colophon  title,  errata  slip  inserted.    Changes   of   county   names   and   provincial   names."
    4,,"Zhong hua ren min gong he guo xing zheng qu hua jian ce","Simplified handbook on administrative divisions of the People's Republic of China,1977","�� �� �� �� �� �� �� �� �� �� �� �� ��","�� �� ��",0,1,0,,"-1976","154","Arlington, Va: Joint Publications Research Services.  Sold by NTIS, 1978.","UW East Asian Library","JS7351 .A3 1978",,"Chinese/English","The  report  contains   a   breakdown   of   all   administrative   divisions   of   the   PRC at   county   level   and   above   throughout  the  country.    It is   a   translation   of the   Chinese   version."

    However, by using iconv or similar utilities to convert the file to UTF-8 and then loading the converted file into the database, the data will be correctly displayed when queried. The conversion process ensures that the data is in the correct character encoding for proper display.

    mapd@zion-tr:~$ head -n 5 adminbk1.txt | iconv -f BIG-5 -t UTF-8 
    "id","title (WG)","title (Pinyin)","title (English)","title (Chinese)","author","Boundary","Name","Code","Other (specify)","Period","# of Pages","Pub_Info","Location","Call #","ISBN","Language","Description"
    1,"Ching tai ti li yen ko piao","Qing dai di li yan ge biao",," 測 華 燴 朓 賂 桶","Zhao, Quan-cheng ( 梊   割 )",1,1,0,,"Qing Dynasty","204","恅 漆 堤 唳 扦 1940 爛 唳 腔 婬 唳, 1979 爛","UW East Asian Library","DS755 .S532 v.628",,"Chinese","China - historical - geography;   China - Administrative - and - political -divisions.   It  contains    descriptions   and   charts   about  the   administrative   boundary changes   in   Qing   Danasty."
    2,"Chung-kuo shih hsien shou tse","Zhong guo shi xian shou ce",,"笢 弊 庈 瓮 忒 聊","Wang, Yueh",0,1,0,,"-1986","641","涳 蔬 吽 諒 郤 堤 唳 扦 1987","UW East Asian Library","JS7351 A3 C59 1987",,"Chinese","China - administrative - and - political - divisions.    It   contains   the   name, geography,   and   other   information   about   Chinese   cities   and   counties   up   to 1986."
    3,"Ko sheng chu yu yen ko i lan piao","Ge sheng qu yu yan ge yi lan biao",,"跪 吽  郖 朓 賂 珨 擬 桶",,0,1,0,,,"47","奻 漆 妀 昢 荂 抎 奩 1914","UW East Asian Library","DS737 .H7",,"Chinese","Names - geographical - China;   China - administrative - and - political - divisions.  Colophon  title,  errata  slip  inserted.    Changes   of   county   names   and   provincial   names."
    4,,"Zhong hua ren min gong he guo xing zheng qu hua jian ce","Simplified handbook on administrative divisions of the People's Republic of China,1977","笢 貌  鏍 僕 睿 弊 俴 淉  赫 潠 聊","囀 昢 窒",0,1,0,,"-1976","154","Arlington, Va: Joint Publications Research Services.  Sold by NTIS, 1978.","UW East Asian Library","JS7351 .A3 1978",,"Chinese/English","The  report  contains   a   breakdown   of   all   administrative   divisions   of   the   PRC at   county   level   and   above   throughout  the  country.    It is   a   translation   of the   Chinese   version."

    If you have any further questions or need assistance, please feel free to ask.

    Regards,
    Candido

    1
    Comment actions Permalink
  • Avatar
    jieguo
    source just like this,but the same problem with SQLuirrel 


    0
    Comment actions Permalink
  • Avatar
    jieguo

    How to adjust the encoding settings in DBeaver and SQLlurirrel?

    The source database character is GBK2312,thanks!
    0
    Comment actions Permalink
  • Avatar
    jieguo
    I have solved the problem.Thanks a lot!
    . You need to make sure that the exported csv file is in utf8 format
    heavyai@node13:/var/lib/heavyai/storage/import/sample_datasets$ enca -L zh_CN sid_latn1.csv 
    Simplified Chinese National Standard; GB2312
    heavyai@node13:/var/lib/heavyai/storage/import/sample_datasets$ enca -L zh_CN -x UTF-8 < sid_latn1.csv > sid_latn2.csv
    heavyai@node13:/var/lib/heavyai/storage/import/sample_datasets$ enca -L zh_CN sid_latn2.csv 
    Universal transformation format 8 bits; UTF-8
    1
    Comment actions Permalink
  • Avatar
    Candido Dessanti

    HI @jjieguo,

    It's not important what's is in your environment, but the ecofing used by your terminal

    So if I set my terminal this way, the characters are displayed as expected

    but almost every tool uses UTF-8 as a default so that it would be safer loading data encoded in UTF-8, rather than legacy encodings like BIG5 or GB2312 with tools like iconv

    e.g.

    mapd@zion-tr:~$ iconv -c -f GB2312, -t utf-8  -o adminbk1_202309151038_utf.csv adminbk1_202309151038.csv
    mapd@zion-tr:~$ /opt/heavyai/bin/heavysql -p HyperInteractive
    User admin connected to database heavyai
    heavysql> truncate table adminbk1;
    heavysql> copy adminbk1 from '/home/mapd/adminbk1_202309151038_utf.csv' with (header='true');
    Result
    Loaded: 73 recs, Rejected: 0 recs in 0.156000 secs

    Now using Squirrel, DBeaver, or other tools, you'll get the proper output.


    AFAIK, there isn't an option to adjust the encoding of DBeaver or SquirrelSQL, I'm not the developer or the maintainer of those tools, and I can be wrong, so if you don't want to covert your data you can ask to the developer of those tools if there is a way to do it.
    (I also tried with other tools, and I'm not able to change the charset encoding of the output).

    Regards,
    Candido

    0
    Comment actions Permalink

Please sign in to leave a comment.