首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Java -以独立于系统的方式将文件中的UTF8字节读入字符串

Java -以独立于系统的方式将文件中的UTF8字节读入字符串
EN

Stack Overflow用户
提问于 2015-10-08 12:34:27
回答 1查看 5.2K关注 0票数 0

如何将UTF8编码的文件准确地读入字符串?

当我将这个.java文件的编码更改为UTF-8 (Eclipse >右键单击App.java > Properties > Resource >文本文件编码)时,它在Eclipse-8中工作得很好,而不是命令行。在运行App时,eclipse似乎正在设置file.encoding参数。

为什么源文件的编码对从字节创建字符串有任何影响。当已知编码时,从字节中创建字符串的可靠方法是什么?我可能有不同编码的文件。一旦一个文件的编码是已知的,我必须能够读取到字符串,而不管file.encoding的值如何?

utf8文件的内容如下

代码语言:javascript
复制
English Hello World.
Korean 안녕하세요.
Japanese 世界こんにちは。
Russian Привет мир.
German Hallo Welt.
Spanish Hola mundo.
Hindi हैलो वर्ल्ड।
Gujarati હેલો વર્લ્ડ.
Thai สวัสดีชาวโลก.

文件的-end -

代码在下面。我的意见载于评论中。

代码语言:javascript
复制
public class App {
public static void main(String[] args) {
    String slash = System.getProperty("file.separator");
    File inputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text.txt");
    File outputUtfFile = new File("C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_out.txt");
    File outputUtfByteWrittenFile = new File(
            "C:" + slash + "sources" + slash + "TestUtfRead" + slash + "utf8text_byteout.txt");
    outputUtfFile.delete();
    outputUtfByteWrittenFile.delete();

    try {

        /*
         * read a utf8 text file with internationalized strings into bytes.
         * there should be no information loss here, when read into raw bytes.
         * We are sure that this file is UTF-8 encoded. 
         * Input file created using Notepad++. Text copied from Google translate.
         */
        byte[] fileBytes = readBytes(inputUtfFile);

        /*
         * Create a string from these bytes. Specify that the bytes are UTF-8 bytes.
         */
        String str = new String(fileBytes, StandardCharsets.UTF_8);

        /*
         * The console is incapable of displaying this string.
         * So we write into another file. Open in notepad++ to check.
         */
        ArrayList<String> list = new ArrayList<>();
        list.add(str);
        writeLines(list, outputUtfFile);

        /*
         * Works fine when I read bytes and write bytes. 
         * Open the other output file in notepad++ and check. 
         */
        writeBytes(fileBytes, outputUtfByteWrittenFile);

        /*
         * I am using JDK 8u60.
         * I tried running this on command line instead of eclipse. Does not work.
         * I tried using apache commons io library. Does not work. 
         *  
         * This means that new String(bytes, charset); does not work correctly. 
         * There is no real effect of specifying charset to string.
         */
    } catch (IOException e) {
        e.printStackTrace();
    }

}

public static void writeLines(List<String> lines, File file) throws IOException {
    BufferedWriter writer = null;
    OutputStreamWriter osw = null;
    OutputStream fos = null;
    try {
        fos = new FileOutputStream(file);
        osw = new OutputStreamWriter(fos);
        writer = new BufferedWriter(osw);
        String lineSeparator = System.getProperty("line.separator");
        for (int i = 0; i < lines.size(); i++) {
            String line = lines.get(i);
            writer.write(line);
            if (i < lines.size() - 1) {
                writer.write(lineSeparator);
            }
        }
    } catch (IOException e) {
        throw e;
    } finally {
        close(writer);
        close(osw);
        close(fos);
    }
}

public static byte[] readBytes(File file) {
    FileInputStream fis = null;
    byte[] b = null;
    try {
        fis = new FileInputStream(file);
        b = readBytesFromStream(fis);
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        close(fis);
    }
    return b;
}

public static void writeBytes(byte[] inBytes, File file) {
    FileOutputStream fos = null;
    try {
        fos = new FileOutputStream(file);
        writeBytesToStream(inBytes, fos);
        fos.flush();
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        close(fos);
    }
}

public static void close(InputStream inStream) {
    try {
        inStream.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    inStream = null;
}

public static void close(OutputStream outStream) {
    try {
        outStream.close();
    } catch (IOException e) {
        e.printStackTrace();
    }
    outStream = null;
}

public static void close(Writer writer) {
    if (writer != null) {
        try {
            writer.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        writer = null;
    }
}

public static long copy(InputStream readStream, OutputStream writeStream) throws IOException {
    int bytesread = -1;
    byte[] b = new byte[4096]; //4096 is default cluster size in Windows for < 2TB NTFS partitions
    long count = 0;
    bytesread = readStream.read(b);
    while (bytesread != -1) {
        writeStream.write(b, 0, bytesread);
        count += bytesread;
        bytesread = readStream.read(b);
    }
    return count;
}
public static byte[] readBytesFromStream(InputStream readStream) throws IOException {
    ByteArrayOutputStream writeStream = null;
    byte[] byteArr = null;
    writeStream = new ByteArrayOutputStream();
    try {
        copy(readStream, writeStream);
        writeStream.flush();
        byteArr = writeStream.toByteArray();
    } finally {
        close(writeStream);
    }
    return byteArr;
}
public static void writeBytesToStream(byte[] inBytes, OutputStream writeStream) throws IOException {
    ByteArrayInputStream bis = null;
    bis = new ByteArrayInputStream(inBytes);
    try {
        copy(bis, writeStream);
    } finally {
        close(bis);
    }
}
};

编辑: For @JB和每个人:)

代码语言:javascript
复制
//writeLines(list, outputUtfFile, StandardCharsets.UTF_16BE); //does not work
//writeLines(list, outputUtfFile, Charset.defaultCharset()); //does not work. 
writeLines(list, outputUtfFile, StandardCharsets.UTF_16LE); //works

当将字节读入字符串时,我需要指定字节的编码。当我将字节从字符串写入文件时,我需要指定字节的编码。

在JVM中有一个字符串之后,我就不需要记住源字节编码了,对吗?

当我写入文件时,它应该将字符串转换为机器的默认字符集(无论是UTF8、ASCII还是cp1252)。这是失败的。UTF16也失败了。为什么有些字符集会失败呢?

EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2015-10-08 12:40:29

Java源文件编码实际上是不相关的。代码的读取部分是正确的(尽管效率很低)。不正确的是写作部分:

代码语言:javascript
复制
osw = new OutputStreamWriter(fos);

应改为

代码语言:javascript
复制
osw = new OutputStreamWriter(fos, StandardCharsets.UTF_8);

否则,您将使用默认编码(在系统上似乎不是UTF8 ),而不是使用UTF8。

请注意,Java允许在文件路径中使用正斜杠,即使在Windows上也是如此。你可以简单地写

代码语言:javascript
复制
File inputUtfFile = new File("C:/sources/TestUtfRead/utf8text.txt");

编辑:

在JVM中有一个字符串之后,我就不需要记住源字节编码了,对吗?

是的,你是对的。

当我写入文件时,它应该将字符串转换为机器的默认字符集(无论是UTF8、ASCII还是cp1252)。这是失败的。

如果不指定任何编码,Java实际上将使用平台默认编码将字符转换为字节。如果您指定了一个编码(如这个答案的开头所建议的),那么它将使用您告诉它使用的编码。

但是,所有编码都不能像UTF8那样表示所有unicode字符。例如,ASCII只支持128个不同的字符。Cp1252,AFAIK,只支持256个字符。因此,编码成功了,但是它用一个特殊字符(我不记得是哪个字符)替换了不可编码的字符,这意味着:我不能编码这个泰国或俄罗斯字符,因为它不是我支持的字符集的一部分。

UTF16编码应该很好。但是,确保还将文本编辑器配置为在读取和显示文件内容时使用UTF16。如果将其配置为使用另一种编码,则显示的内容将不正确。

票数 6
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/33015959

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档