我的应用程序需要支持日语字符,所以我们在整个堆栈上使用了UTF-8作为默认编码。我们面临一个新字符串(字节,"UTF-8")给出不同结果的wierd问题。
用户输入:東京
从浏览器生成并发送给API: 5p2x5Lqs的Base64编码字符串
两个系统都生成相同的字节数组。
但只有系统1解码的字符串才以東京的形式出现。
在系统2中,解码的字符串作为??
系统1:
集装箱: Tomee 7.1.0
JDK: 1.8.0_201-b09
操作系统版本: 3.10.0-957.12.2.el7.x86_64
体系结构: amd64
现场:
[logs]$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=
[logs]$ locale status
locale: unknown name "status"系统2:
集装箱: Tomee 7.1.0
JDK: 1.8.0_201-b09
操作系统版本: 2.6.32-696.18.7.el6.x86_64
体系结构: amd64
现场:
[ logs]$ locale
LANG=en_GB
LC_CTYPE="en_GB"
LC_NUMERIC="en_GB"
LC_TIME="en_GB"
LC_COLLATE="en_GB"
LC_MONETARY="en_GB"
LC_MESSAGES="en_GB"
LC_PAPER="en_GB"
LC_NAME="en_GB"
LC_ADDRESS="en_GB"
LC_TELEPHONE="en_GB"
LC_MEASUREMENT="en_GB"
LC_IDENTIFICATION="en_GB"
LC_ALL=
[ logs]$ locale status
locale: unknown name "status"正在使用的Java代码
LogUtil.logMessage("searchString before decoding="+searchString);
//s = new String (Base64.decodeBase64(searchString),utf8_test);
byte[] decodedBytes=Base64.getDecoder().decode(searchString);
byte[] decodedBaytesFromapache=org.apache.commons.codec.binary.Base64.decodeBase64(searchString);
System.out.println("java native array :: ");
for(byte b:decodedBytes)
{
System.out.print(b);
}
System.out.println("\njava apache array :: \n");
for(byte b:decodedBaytesFromapache)
{
System.out.print(b);
}
s=new String(decodedBytes,"UTF-8"); //Charset.forName("UTF-8") was also tried here
System.out.println("\n String post decode:: "+s);
System.out.println("");
//String s =
System.out.println("loaded charset is utf-8:: "+Charset.isSupported("UTF-8"));
Set<String> listOfCharsets=Charset.availableCharsets().keySet();
System.out.println("Listing supported charsets:: ");
for(String item: listOfCharsets)
{System.out.println(item); }System1上的输出
searchString before decoding=5p2x5Lqs
java native array ::
-26-99-79-28-70-84
java apache array ::
-26-99-79-28-70-84
String post decode:: 東京
loaded charset is utf-8:: true
Listing supported charsets::
Big5
Big5-HKSCS
CESU-8
EUC-JP
EUC-KR
GB18030
GB2312
GBK
IBM-Thai
IBM00858
IBM01140
IBM01141
IBM01142
IBM01143
IBM01144
IBM01145
IBM01146
IBM01147
IBM01148
IBM01149
IBM037
IBM1026
IBM1047
IBM273
IBM277
IBM278
IBM280
IBM284
IBM285
IBM290
IBM297
IBM420
IBM424
IBM437
IBM500
IBM775
IBM850
IBM852
IBM855
IBM857
IBM860
IBM861
IBM862
IBM863
IBM864
IBM865
IBM866
IBM868
IBM869
IBM870
IBM871
IBM918
ISO-2022-CN
ISO-2022-JP
ISO-2022-JP-2
ISO-2022-KR
ISO-8859-1
ISO-8859-13
ISO-8859-15
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
JIS_X0201
JIS_X0212-1990
KOI8-R
KOI8-U
Shift_JIS
TIS-620
US-ASCII
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UTF-8
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
windows-31j
x-Big5-HKSCS-2001
x-Big5-Solaris
x-COMPOUND_TEXT
x-euc-jp-linux
x-EUC-TW
x-eucJP-Open
x-IBM1006
x-IBM1025
x-IBM1046
x-IBM1097
x-IBM1098
x-IBM1112
x-IBM1122
x-IBM1123
x-IBM1124
x-IBM1166
x-IBM1364
x-IBM1381
x-IBM1383
x-IBM300
x-IBM33722
x-IBM737
x-IBM833
x-IBM834
x-IBM856
x-IBM874
x-IBM875
x-IBM921
x-IBM922
x-IBM930
x-IBM933
x-IBM935
x-IBM937
x-IBM939
x-IBM942
x-IBM942C
x-IBM943
x-IBM943C
x-IBM948
x-IBM949
x-IBM949C
x-IBM950
x-IBM964
x-IBM970
x-ISCII91
x-ISO-2022-CN-CNS
x-ISO-2022-CN-GB
x-iso-8859-11
x-JIS0208
x-JISAutoDetect
x-Johab
x-MacArabic
x-MacCentralEurope
x-MacCroatian
x-MacCyrillic
x-MacDingbat
x-MacGreek
x-MacHebrew
x-MacIceland
x-MacRoman
x-MacRomania
x-MacSymbol
x-MacThai
x-MacTurkish
x-MacUkraine
x-MS932_0213
x-MS950-HKSCS
x-MS950-HKSCS-XP
x-mswin-936
x-PCK
x-SJIS_0213
x-UTF-16LE-BOM
X-UTF-32BE-BOM
X-UTF-32LE-BOM
x-windows-50220
x-windows-50221
x-windows-874
x-windows-949
x-windows-950
x-windows-iso2022jp
searchString after decoding=東京系统2上的输出
searchString before decoding=5p2x5Lqs
java native array ::
-26-99-79-28-70-84
java apache array ::
-26-99-79-28-70-84
String post decode:: ??
loaded charset is utf-8:: true
Listing supported charsets::
Big5
Big5-HKSCS
CESU-8
EUC-JP
EUC-KR
GB18030
GB2312
GBK
IBM-Thai
IBM00858
IBM01140
IBM01141
IBM01142
IBM01143
IBM01144
IBM01145
IBM01146
IBM01147
IBM01148
IBM01149
IBM037
IBM1026
IBM1047
IBM273
IBM277
IBM278
IBM280
IBM284
IBM285
IBM290
IBM297
IBM420
IBM424
IBM437
IBM500
IBM775
IBM850
IBM852
IBM855
IBM857
IBM860
IBM861
IBM862
IBM863
IBM864
IBM865
IBM866
IBM868
IBM869
IBM870
IBM871
IBM918
ISO-2022-CN
ISO-2022-JP
ISO-2022-JP-2
ISO-2022-KR
ISO-8859-1
ISO-8859-13
ISO-8859-15
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
JIS_X0201
JIS_X0212-1990
KOI8-R
KOI8-U
Shift_JIS
TIS-620
US-ASCII
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UTF-8
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
windows-31j
x-Big5-HKSCS-2001
x-Big5-Solaris
x-COMPOUND_TEXT
x-euc-jp-linux
x-EUC-TW
x-eucJP-Open
x-IBM1006
x-IBM1025
x-IBM1046
x-IBM1097
x-IBM1098
x-IBM1112
x-IBM1122
x-IBM1123
x-IBM1124
x-IBM1166
x-IBM1364
x-IBM1381
x-IBM1383
x-IBM300
x-IBM33722
x-IBM737
x-IBM833
x-IBM834
x-IBM856
x-IBM874
x-IBM875
x-IBM921
x-IBM922
x-IBM930
x-IBM933
x-IBM935
x-IBM937
x-IBM939
x-IBM942
x-IBM942C
x-IBM943
x-IBM943C
x-IBM948
x-IBM949
x-IBM949C
x-IBM950
x-IBM964
x-IBM970
x-ISCII91
x-ISO-2022-CN-CNS
x-ISO-2022-CN-GB
x-iso-8859-11
x-JIS0208
x-JISAutoDetect
x-Johab
x-MacArabic
x-MacCentralEurope
x-MacCroatian
x-MacCyrillic
x-MacDingbat
x-MacGreek
x-MacHebrew
x-MacIceland
x-MacRoman
x-MacRomania
x-MacSymbol
x-MacThai
x-MacTurkish
x-MacUkraine
x-MS932_0213
x-MS950-HKSCS
x-MS950-HKSCS-XP
x-mswin-936
x-PCK
x-SJIS_0213
x-UTF-16LE-BOM
X-UTF-32BE-BOM
X-UTF-32LE-BOM
x-windows-50220
x-windows-50221
x-windows-874
x-windows-949
x-windows-950
x-windows-iso2022jp
searchString after decoding=??那个??不是由于终端窗口造成的,因为两者都是从具有所有匹配设置的同一个putty终端上提取的。那个??然后传递给jdbctemplate,它在System 2上返回0的结果,而在System 1上,我们将得到预期的结果。在所有系统上使解码一致的可能解决方案是什么?
发布于 2020-02-25 08:31:52
正如其中一个注释所建议的,您的问题可能是由于您使用了System.out()。变量System.out()是一个PrintStream,它可能使用JVM的默认编码,也可能不是UTF-8。有关此问题的更多信息,请参见未解析的OpenJDK bug JDK-8187041使用UTF-8作为默认字符集。。该bug报告的摘要(我强调指出)如下:
使用UTF-8作为Java虚拟机的默认字符集,这样依赖于默认字符集的API在所有平台上都会一致运行()。
还可以看到SO问题java控制台输出的默认字符编码。
还请注意,两个系统的区域设置数据是不同的。例如:在日文字符呈现正确的系统上使用LANG=en_GB.UTF-8,而在系统上则不正确呈现日文字符的LANG=en_GB。
为了避免JVM在默认情况下不使用UTF-8编码的系统上可能出现的问题,只需为显式使用UTF-8的输出创建自己的PrintStream:
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;
...
// Write the output to a UTF-8 PrintStream:
PrintStream ps = new PrintStream(System.out, true, StandardCharsets.UTF_8.name());
ps.println("java native array :: ");
// etc...备注:
loaded charset is utf-8:: true,但它只记录了true,因为Charset.isSupported("UTF-8")返回true。支持一个特定的字符集并不能说明它是否被使用(或者“加载”,借用你的术语)。正如您的输出所示,您有几十个受支持的字符集。关键是实际使用UTF-8来渲染日语字符.如果更改println()调用不能解决您的问题,请相应更新您的问题。
https://stackoverflow.com/questions/59942542
复制相似问题