首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >奇怪的JAVA UTF-8编码行为,新字符串(字节,"UTF-8")在大多数相似的设置上给出不同的结果。

奇怪的JAVA UTF-8编码行为,新字符串(字节,"UTF-8")在大多数相似的设置上给出不同的结果。
EN

Stack Overflow用户
提问于 2020-01-28 05:19:27
回答 1查看 682关注 0票数 0

我的应用程序需要支持日语字符,所以我们在整个堆栈上使用了UTF-8作为默认编码。我们面临一个新字符串(字节,"UTF-8")给出不同结果的wierd问题。

用户输入:東京

从浏览器生成并发送给API: 5p2x5Lqs的Base64编码字符串

两个系统都生成相同的字节数组。

但只有系统1解码的字符串才以東京的形式出现。

在系统2中,解码的字符串作为??

系统1:

集装箱: Tomee 7.1.0

JDK: 1.8.0_201-b09

操作系统版本: 3.10.0-957.12.2.el7.x86_64

体系结构: amd64

现场:

代码语言:javascript
复制
[logs]$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=
[logs]$ locale status
locale: unknown name "status"

系统2:

集装箱: Tomee 7.1.0

JDK: 1.8.0_201-b09

操作系统版本: 2.6.32-696.18.7.el6.x86_64

体系结构: amd64

现场:

代码语言:javascript
复制
[ logs]$ locale
LANG=en_GB
LC_CTYPE="en_GB"
LC_NUMERIC="en_GB"
LC_TIME="en_GB"
LC_COLLATE="en_GB"
LC_MONETARY="en_GB"
LC_MESSAGES="en_GB"
LC_PAPER="en_GB"
LC_NAME="en_GB"
LC_ADDRESS="en_GB"
LC_TELEPHONE="en_GB"
LC_MEASUREMENT="en_GB"
LC_IDENTIFICATION="en_GB"
LC_ALL=
[ logs]$ locale status
locale: unknown name "status"

正在使用的Java代码

代码语言:javascript
复制
LogUtil.logMessage("searchString before decoding="+searchString);
                 //s =  new String (Base64.decodeBase64(searchString),utf8_test);
                 byte[] decodedBytes=Base64.getDecoder().decode(searchString);
                 byte[] decodedBaytesFromapache=org.apache.commons.codec.binary.Base64.decodeBase64(searchString);
                 System.out.println("java native array :: ");
                 for(byte b:decodedBytes)
                 {
                     System.out.print(b);
                 }
                 System.out.println("\njava apache array :: \n");
                 for(byte b:decodedBaytesFromapache)
                 {
                     System.out.print(b);
                 }
                 s=new String(decodedBytes,"UTF-8"); //Charset.forName("UTF-8") was also tried here
                 System.out.println("\n String post decode:: "+s);
                 System.out.println("");
            //String s = 
                 System.out.println("loaded charset is utf-8:: "+Charset.isSupported("UTF-8"));
                 Set<String> listOfCharsets=Charset.availableCharsets().keySet();
                 System.out.println("Listing supported charsets:: ");
                 for(String item: listOfCharsets)
                 {System.out.println(item); }

System1上的输出

代码语言:javascript
复制
searchString before decoding=5p2x5Lqs
java native array ::
-26-99-79-28-70-84
java apache array ::

-26-99-79-28-70-84
 String post decode:: 東京

loaded charset is utf-8:: true
Listing supported charsets::
Big5
Big5-HKSCS
CESU-8
EUC-JP
EUC-KR
GB18030
GB2312
GBK
IBM-Thai
IBM00858
IBM01140
IBM01141
IBM01142
IBM01143
IBM01144
IBM01145
IBM01146
IBM01147
IBM01148
IBM01149
IBM037
IBM1026
IBM1047
IBM273
IBM277
IBM278
IBM280
IBM284
IBM285
IBM290
IBM297
IBM420
IBM424
IBM437
IBM500
IBM775
IBM850
IBM852
IBM855
IBM857
IBM860
IBM861
IBM862
IBM863
IBM864
IBM865
IBM866
IBM868
IBM869
IBM870
IBM871
IBM918
ISO-2022-CN
ISO-2022-JP
ISO-2022-JP-2
ISO-2022-KR
ISO-8859-1
ISO-8859-13
ISO-8859-15
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
JIS_X0201
JIS_X0212-1990
KOI8-R
KOI8-U
Shift_JIS
TIS-620
US-ASCII
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UTF-8
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
windows-31j
x-Big5-HKSCS-2001
x-Big5-Solaris
x-COMPOUND_TEXT
x-euc-jp-linux
x-EUC-TW
x-eucJP-Open
x-IBM1006
x-IBM1025
x-IBM1046
x-IBM1097
x-IBM1098
x-IBM1112
x-IBM1122
x-IBM1123
x-IBM1124
x-IBM1166
x-IBM1364
x-IBM1381
x-IBM1383
x-IBM300
x-IBM33722
x-IBM737
x-IBM833
x-IBM834
x-IBM856
x-IBM874
x-IBM875
x-IBM921
x-IBM922
x-IBM930
x-IBM933
x-IBM935
x-IBM937
x-IBM939
x-IBM942
x-IBM942C
x-IBM943
x-IBM943C
x-IBM948
x-IBM949
x-IBM949C
x-IBM950
x-IBM964
x-IBM970
x-ISCII91
x-ISO-2022-CN-CNS
x-ISO-2022-CN-GB
x-iso-8859-11
x-JIS0208
x-JISAutoDetect
x-Johab
x-MacArabic
x-MacCentralEurope
x-MacCroatian
x-MacCyrillic
x-MacDingbat
x-MacGreek
x-MacHebrew
x-MacIceland
x-MacRoman
x-MacRomania
x-MacSymbol
x-MacThai
x-MacTurkish
x-MacUkraine
x-MS932_0213
x-MS950-HKSCS
x-MS950-HKSCS-XP
x-mswin-936
x-PCK
x-SJIS_0213
x-UTF-16LE-BOM
X-UTF-32BE-BOM
X-UTF-32LE-BOM
x-windows-50220
x-windows-50221
x-windows-874
x-windows-949
x-windows-950
x-windows-iso2022jp
searchString after decoding=東京

系统2上的输出

代码语言:javascript
复制
searchString before decoding=5p2x5Lqs
java native array ::
-26-99-79-28-70-84
java apache array ::

-26-99-79-28-70-84
 String post decode:: ??

loaded charset is utf-8:: true
Listing supported charsets::
Big5
Big5-HKSCS
CESU-8
EUC-JP
EUC-KR
GB18030
GB2312
GBK
IBM-Thai
IBM00858
IBM01140
IBM01141
IBM01142
IBM01143
IBM01144
IBM01145
IBM01146
IBM01147
IBM01148
IBM01149
IBM037
IBM1026
IBM1047
IBM273
IBM277
IBM278
IBM280
IBM284
IBM285
IBM290
IBM297
IBM420
IBM424
IBM437
IBM500
IBM775
IBM850
IBM852
IBM855
IBM857
IBM860
IBM861
IBM862
IBM863
IBM864
IBM865
IBM866
IBM868
IBM869
IBM870
IBM871
IBM918
ISO-2022-CN
ISO-2022-JP
ISO-2022-JP-2
ISO-2022-KR
ISO-8859-1
ISO-8859-13
ISO-8859-15
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
JIS_X0201
JIS_X0212-1990
KOI8-R
KOI8-U
Shift_JIS
TIS-620
US-ASCII
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UTF-8
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
windows-31j
x-Big5-HKSCS-2001
x-Big5-Solaris
x-COMPOUND_TEXT
x-euc-jp-linux
x-EUC-TW
x-eucJP-Open
x-IBM1006
x-IBM1025
x-IBM1046
x-IBM1097
x-IBM1098
x-IBM1112
x-IBM1122
x-IBM1123
x-IBM1124
x-IBM1166
x-IBM1364
x-IBM1381
x-IBM1383
x-IBM300
x-IBM33722
x-IBM737
x-IBM833
x-IBM834
x-IBM856
x-IBM874
x-IBM875
x-IBM921
x-IBM922
x-IBM930
x-IBM933
x-IBM935
x-IBM937
x-IBM939
x-IBM942
x-IBM942C
x-IBM943
x-IBM943C
x-IBM948
x-IBM949
x-IBM949C
x-IBM950
x-IBM964
x-IBM970
x-ISCII91
x-ISO-2022-CN-CNS
x-ISO-2022-CN-GB
x-iso-8859-11
x-JIS0208
x-JISAutoDetect
x-Johab
x-MacArabic
x-MacCentralEurope
x-MacCroatian
x-MacCyrillic
x-MacDingbat
x-MacGreek
x-MacHebrew
x-MacIceland
x-MacRoman
x-MacRomania
x-MacSymbol
x-MacThai
x-MacTurkish
x-MacUkraine
x-MS932_0213
x-MS950-HKSCS
x-MS950-HKSCS-XP
x-mswin-936
x-PCK
x-SJIS_0213
x-UTF-16LE-BOM
X-UTF-32BE-BOM
X-UTF-32LE-BOM
x-windows-50220
x-windows-50221
x-windows-874
x-windows-949
x-windows-950
x-windows-iso2022jp
searchString after decoding=??

那个??不是由于终端窗口造成的,因为两者都是从具有所有匹配设置的同一个putty终端上提取的。那个??然后传递给jdbctemplate,它在System 2上返回0的结果,而在System 1上,我们将得到预期的结果。在所有系统上使解码一致的可能解决方案是什么?

EN

回答 1

Stack Overflow用户

发布于 2020-02-25 08:31:52

正如其中一个注释所建议的,您的问题可能是由于您使用了System.out()。变量System.out()是一个PrintStream,它可能使用JVM的默认编码,也可能不是UTF-8。有关此问题的更多信息,请参见未解析的OpenJDK bug JDK-8187041使用UTF-8作为默认字符集。。该bug报告的摘要(我强调指出)如下:

使用UTF-8作为Java虚拟机的默认字符集,这样依赖于默认字符集的API在所有平台上都会一致运行()。

还可以看到SO问题java控制台输出的默认字符编码

还请注意,两个系统的区域设置数据是不同的。例如:在日文字符呈现正确的系统上使用LANG=en_GB.UTF-8,而在系统上则不正确呈现日文字符的LANG=en_GB

为了避免JVM在默认情况下不使用UTF-8编码的系统上可能出现的问题,只需为显式使用UTF-8的输出创建自己的PrintStream

代码语言:javascript
复制
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;

...

    // Write the output to a UTF-8 PrintStream:
    PrintStream ps = new PrintStream(System.out, true, StandardCharsets.UTF_8.name());
    ps.println("java native array :: ");
    // etc...

备注:

  • 创建一个UTF-8字符串是很好的,但这本身并不能保证它将正确呈现。
  • 您的代码记录的语句之一是loaded charset is utf-8:: true,但它只记录了true,因为Charset.isSupported("UTF-8")返回true。支持一个特定的字符集并不能说明它是否被使用(或者“加载”,借用你的术语)。正如您的输出所示,您有几十个受支持的字符集。关键是实际使用UTF-8来渲染日语字符.

如果更改println()调用不能解决您的问题,请相应更新您的问题。

票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59942542

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档