首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在C++中从UTF-8转换到in 8859-15

在C++中从UTF-8转换到in 8859-15
EN

Stack Overflow用户
提问于 2018-11-12 20:14:14
回答 1查看 1K关注 0票数 1

我想在C/C++中完成从UTF-8到ISO 8859-15的转换,而不包括额外的库。

我怎样才能做到这一点?

我已经找到了适用于ISO 8859-1的以下代码,但我不知道如何处理ISO 8859-15和ISO 8859-1 (8859-15)之间的差异:

代码语言:javascript
复制
std::string UTF8toISO8859_1(const char * in) {
    std::string out;
    if (in == NULL)
        return out;

    unsigned int codepoint;
    while (*in != 0) {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;
        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
            if (codepoint <= 255) {
                out.append(1, static_cast<char>(codepoint));
            }
            else {
                out.append("?");
            }
        }
    }
    return out;
}
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-11-12 21:55:59

我喜欢这个密码。它太短了。大部分代码只处理将多字节序列解码为编码点的问题。一旦解码了代码点,转换到ISO-8859-1非常简单:

  • 如果它小于等于255,它也是一个有效的ISO-8859-1字符:out.append(1, static_cast<char>(codepoint));
  • 如果没有,则不能用ISO-8859-1表示,而代之以问号:out.append("?");

因此,为了使其适用于ISO8859-15,需要更多的代码来处理在引入ISO-8859-15时被替换的字符(请参阅比较ISO-8859-1和ISO-8859-15)。不幸的是,它大大增加了代码大小。

下面的代码应该很容易理解。如果这是一个主要问题的话,它可以被优化以获得更好的性能。

代码语言:javascript
复制
std::string UTF8toISO8859_1(const char * in) {
    std::string out;
    if (in == NULL)
        return out;

    unsigned int codepoint;
    while (*in != 0) {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;

        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
            // a valid codepoint has been decoded; convert it to ISO-8859-15               
            char outc;
            if (codepoint <= 255) {
                // codepoints up to 255 can be directly converted wit a few exceptions
                if (codepoint != 0xa4 && codepoint != 0xa6 && codepoint != 0xa8
                        && codepoint != 0xb4 && codepoint != 0xb8 && codepoint != 0xbc
                        && codepoint != 0xbd && codepoint != 0xbe) {
                    outc = static_cast<char>(codepoint);
                }
                else {
                    outc = '?';
                }
            }
            else {
                // With a few exceptions, codepoints above 255 cannot be converted
                if (codepoint == 0x20AC) {
                    outc = 0xa4;
                }
                else if (codepoint == 0x0160) {
                    outc = 0xa6;
                }
                else if (codepoint == 0x0161) {
                    outc = 0xa8;
                }
                else if (codepoint == 0x017d) {
                    outc = 0xb4;
                }
                else if (codepoint == 0x017e) {
                    outc = 0xb8;
                }
                else if (codepoint == 0x0152) {
                    outc = 0xbc;
                }
                else if (codepoint == 0x0153) {
                    outc = 0xbd;
                }
                else if (codepoint == 0x0178) {
                    outc = 0xbe;
                }
                else {
                    outc = '?';
                }
            }
            out.append(1, outc);
        }
    }
    return out;
}
票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/53269432

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档