文章/答案/技术大牛

发布

社区首页 >问答首页 >手动将unicode码点编码为UTF-8

问手动将unicode码点编码为UTF-8
EN

Code Review用户

提问于 2016-12-11 13:02:51

回答 2查看 864关注 0票数 4

我想手动将unicode码点编码到UTF-8。我编写了以下C#代码。我用我知道的一些例子测试了它，但是我想知道它是否对所有输入都是正确的。我知道Unicode代码点在0x10FFFF之外是未定义的，但我并不关心这一点。因此，我的方法的输出可能超过4个字节。

private byte[] CodePointToUtf8 (int codepoint)
{
    if (codepoint < 0x80) {
        return new byte[]{ 
            (byte)(codepoint) 
        };
    } else if (codepoint < 0x800) {         
        return new byte[]{ 
            (byte)(0xC0 | (codepoint << 21 >> 27)), 
            (byte)(0x80 | (codepoint << 26 >> 26))
        };
    } else if (codepoint < 0x10000) {
        return new byte[] {
            (byte)(0xE0 | (codepoint << 16 >> 28)),
            (byte)(0x80 | (codepoint << 20 >> 26)) ,
            (byte)(0x80 | (codepoint << 26 >> 26))
        };
    } else if (codepoint < 0x200000) {
        return new byte[] {
            (byte)(0xF0 | (codepoint << 11 >> 29)),
            (byte)(0x80 | (codepoint << 14 >> 26)),
            (byte)(0x80 | (codepoint << 20 >> 26)) ,
            (byte)(0x80 | (codepoint << 26 >> 26))
        };
    } else if (codepoint < 0x4000000) {
        return new byte[] {
            (byte)(0xF8 | (codepoint << 6 >> 30)),
            (byte)(0x80 | (codepoint << 8 >> 26)),
            (byte)(0x80 | (codepoint << 14 >> 26)),
            (byte)(0x80 | (codepoint << 20 >> 26)) ,
            (byte)(0x80 | (codepoint << 26 >> 26))
        };
    } else {
        return new byte[] {
            (byte)(0xFC | (codepoint << 1 >> 31)),
            (byte)(0x80 | (codepoint << 2 >> 26)),
            (byte)(0x80 | (codepoint << 8 >> 26)),
            (byte)(0x80 | (codepoint << 14 >> 26)),
            (byte)(0x80 | (codepoint << 20 >> 26)) ,
            (byte)(0x80 | (codepoint << 26 >> 26))
        };
    }
}

额外的问题:是否有一个构建的方式来做到这一点？

unicode

utf-8

回答 2

Code Review用户

回答已采纳

发布于 2016-12-11 20:25:42

是的，对于所有有效的代码点，代码都是正确的。我第一次对两班制感到困惑，因为我以前从未见过他们，但他们的工作做得很好。其他作者通常只做一个>>，后面跟着一个位掩码，例如(codepoint >> 12) & 0x3F跳过12位到右边，然后取下6位。这样，可以更容易地验证数字，因为它们更小。此外，所有01xxxxxx字节都具有相同的位掩码。

您的代码省略了一些有效性检查：

codepoint可能是< 0
codepoint可能在0xD800和0xDFFF之间

除此之外，它是完美的。

我确信这个转换是内置到C#中的，我只是不知道在哪里。尝试使用UTF-8编码将文件加载到字符串中。在加载过程中，将调用内置转换代码。

票数 3

Code Review用户

发布于 2019-07-14 11:30:10

UTF-8验证.NET

额外的问题:是否有一个构建的方式来做到这一点？

有一种内置的方式来编码unicode代码，指向UTF-8.我已经检查了一些结果与2003年规范-8，我相信这个方法符合它。另一个有趣的链接是源8编码参考源，以了解这种编码是如何工作的。

private byte[] CodePointToUtf8_BuiltIn(int codepoint)
{
    return new UTF8Encoding(true).GetBytes(new[] { (char)codepoint });
}

如果我们循环遍历代码点并过滤出代理项，我们就会得到算法和内置算法之间的一些差异。

internal const char HIGH_SURROGATE_START = '\ud800';
internal const char HIGH_SURROGATE_END = '\udbff';
internal const char LOW_SURROGATE_START = '\udc00';
internal const char LOW_SURROGATE_END = '\udfff';        

for (int i = 0; i <= 0x10FFFF; i++)
{
    if (i >= HIGH_SURROGATE_START && i <= HIGH_SURROGATE_END) continue;
    if (i >= LOW_SURROGATE_START && i <= LOW_SURROGATE_END) continue;

    var op = CodePointToUtf8(i);
    var net = CodePointToUtf8_BuiltIn(i);
    CollectionAssert.AreEqual(net, op);
}

下面是一种显示差异的方法

var builder = new StringBuilder();
builder.AppendLine("0x" + i.ToString("X4"));
builder.AppendLine(string.Join(" - ", op.Select(x => Convert.ToString(x, 2).PadLeft(8, '0'))));
builder.AppendLine(string.Join(" - ", net.Select(x => Convert.ToString(x, 2).PadLeft(8, '0'))));
var text = builder.ToString();

和一些不同之处

0x00A0
11000010 - 11100000
11000010 - 10100000

0x0400
11110000 - 10000000
11010000 - 10000000

0x0720
11111100 - 11100000
11011100 - 10100000

..

你能解释一下不同之处吗？

票数 1

页面原文内容由Code Review提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://codereview.stackexchange.com/questions/149549

复制

相似问题

问手动将unicode码点编码为UTF-8
EN

回答 2

Code Review用户

Code Review用户

UTF-8验证.NET

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问手动将unicode码点编码为UTF-8EN

回答 2

Code Review用户

Code Review用户

UTF-8验证.NET

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问手动将unicode码点编码为UTF-8
EN