文章/答案/技术大牛

发布

社区首页 >问答首页 >将UTF-8文件读入UCS-4字符串。

问将UTF-8文件读入UCS-4字符串。
EN

Stack Overflow用户

提问于 2016-01-27 03:29:38

回答 1查看 1.1K关注 0票数 3

我正在尝试将UTF-8编码文件读入UTF-32 (UCS-4)字符串.基本上，在内部，我希望应用程序内部有一个固定大小的字符。

在这里，我想确保翻译是作为流进程的一部分完成的(因为区域设置应该用于这个过程)。已经发布了替代问题来在字符串上进行转换(但这是浪费的，因为您必须在内存中执行转换阶段，然后必须进行第二次传递才能将其发送到流)。通过对流中的区域设置执行此操作，您只需进行一次传递，并且不需要复制(假设您想要维护原始副本)。

这就是我试过的。

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>

int main()
{
    std::locale     converter(std::locale(), new std::codecvt_utf8<char32_t>);
    std::basic_ifstream<char32_t>   iFile;
    iFile.imbue(converter);
    iFile.open("test.data");

    std::u32string     line;
    while(std::getline(iFile, line))
    {
    }
}

由于这些都是标准类型，所以我发现了这个编译错误：

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/istream:275:41:
error: no matching function for call to 'use_facet'

            const ctype<_CharT>& __ct = use_facet<ctype<_CharT> >(__is.getloc());
                                        ^~~~~~~~~~~~~~~~~~~~~~~~~

编撰：

g++ -std=c++14 test.cpp

c++

utf-8

ucs-4

回答 1

Stack Overflow用户

回答已采纳

发布于 2016-01-27 20:08:18

看来char32_t不是我想要的。简单地搬到wchar_t为我工作。我怀疑，这只是我想要的方式在Linux类似的系统和Windows这个转换将是UTF-16 (UCS-2) (但我不能测试)。

int main()
{
   std::locale           utf8_to_utf32(std::locale(), new std::codecvt_utf8<wchar_t>);

    // Input stream reads UTF-8 and converts to UTF-32 (UCS-4) String
    std::wifstream        iFile("test.data");
    iFile.imbue(utf8_to_utf32);

    // Output UTF-32 (UCS-4) string converts to UTF-8 stream
    std::wofstream        oFile("test.res");
    oFile.imbue(utf8_to_utf32);


    // Now just read like you would normally.
    std::wstring     line;
    while(std::getline(iFile, line))
    {
        // UTF-32 characters are fixed size.
        // So reverse is simple just do it in-place.
        std::reverse(std::begin(line), std::end(line));

        // UTF-32 unfortunately also has grapheme clusters (these are groups of characters
        // that are displayed as a single glyph). By doing the reverse above we have split
        // these incorrectly. We need to do a second pass to reverse the characters inside
        // each cluster. This is beyond the scope of this question and left as an excursive
        // (but I may come back to it later).
        oFile << line << "\n";
    }
}

上面的一条评论表明，这比阅读数据要慢得多，而不是将其翻译成内联。所以我做了一些测试：

// read1.cpp在流中使用codecvt和Locale翻译

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>


int main()
{
    std::locale           utf8_to_utf32(std::locale(), new std::codecvt_utf8<wchar_t>);

    std::wifstream        iFile("test.data");
    iFile.imbue(utf8_to_utf32);

    std::wofstream        oFile("test.res1");
    oFile.imbue(utf8_to_utf32);

    std::wstring     line;
    while(std::getline(iFile, line))
    {
        std::reverse(std::begin(line), std::end(line));
        oFile << line << "\n";
    }
}

// read2.cpp阅读后使用codecvt翻译。

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>
#include <string>

int main()
{
    std::ifstream        iFile("test.data");
    std::ofstream        oFile("test.res2");

    std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8_to_utf32;

    std::string     line;
    std::wstring    wideline;
    while(std::getline(iFile, line))
    {
        wideline = utf8_to_utf32.from_bytes(line);
        std::reverse(std::begin(wideline), std::end(wideline));
        oFile << utf8_to_utf32.to_bytes(wideline) << "\n";
    }
}

// read3.cpp使用UTF-8

#include <algorithm>
#include <iostream>
#include <string>
#include <fstream>

static bool is_lead(uint8_t ch) { return ch < 0x80 || ch >= 0xc0; }

/* Reverse a utf-8 string in-place */
void reverse_utf8(std::string& s) {
  std::reverse(s.begin(), s.end());
  for (auto p = s.begin(), end = s.end(); p != end; ) {
    auto q = p;
    p = std::find_if(p, end, is_lead);
    std::reverse(q, ++p);
  }
}

int main(int argc, char** argv)
{
    std::ifstream        iFile("test.data");
    std::ofstream        oFile("test.res3");

    std::string     line;
    while(std::getline(iFile, line))
    {
        reverse_utf8(line);
        oFile << line << "\n";
    }
    return 0;
}

测试文件为58米的unicode日语。

> ls -lah test.data
-rw-r--r--  1 loki  staff    58M Jan 28 11:28 test.data

> g++ -O3 -std=c++14 read1.cpp -o a1
> g++ -O3 -std=c++14 read2.cpp -o a2
> g++ -O3 -std=c++14 read3.cpp -o a3
>
> # This is the one using Locale in stream
> time ./a1

real    0m0.645s
user    0m0.521s
sys 0m0.108s
>
> # This is the one doing translation after reading.
> time ./a2

real    0m1.058s
user    0m0.916s
sys 0m0.123s
>
> # This is the one using UTF-8
> time ./a3

real    0m0.785s
user    0m0.663s
sys 0m0.104s

在流中进行转换速度更快，但并不明显(不是大量的数据)。所以，选择一个容易阅读的。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/35028339

复制

相似问题

问将UTF-8文件读入UCS-4字符串。
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将UTF-8文件读入UCS-4字符串。EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将UTF-8文件读入UCS-4字符串。
EN