文章/答案/技术大牛

发布

社区首页 >问答首页 >获取和取消获取unicode字符的句柄

问获取和取消获取unicode字符的句柄
EN

Stack Overflow用户

提问于 2013-01-06 13:59:01

回答 2查看 1.1K关注 0票数 2

我想我在使用Unicode和IO::Handle时遇到了问题。很可能我做错了什么。我想从IO::Handle中获取和取消获取单个unicode字符(而不是字节)。但我得到了一个令人惊讶的错误。

#!/usr/local/bin/perl

use 5.016;
use utf8;
use strict;
use warnings;

binmode(STDIN,  ':encoding(utf-8)');
binmode(STDOUT, ':encoding(utf-8)');
binmode(STDERR, ':encoding(utf-8)');

my $string = qq[a Å];
my $fh = IO::File->new();

$fh->open(\$string, '<:encoding(UTF-8)');

say $fh->getc(); # a
say $fh->getc(); # SPACE
say $fh->getc(); # Å LATIN CAPITAL LETTER A WITH RING ABOVE (U+00C5)
$fh->ungetc(ord("Å"));
say $fh->getc(); # should be A RING again.

来自ungetc()行的错误消息是“unicode.pl第21行的错误的UTF-8字符(字符串的意外结尾)。"\x{00c5}”未映射到unicode.pl第21行的utf8。“但这是字符的正确十六进制，它应该映射到字符。

我使用十六进制编辑器来确保A环的字节对于UTF-8是正确的。

对于任何两个字节的字符，这似乎都是一个问题。

最后说输出'\xC5‘(字面上有四个字符:反斜杠、x、C、5)。

我已经通过读取文件而不是标量变量来测试这一点。结果是一样的。

这是为darwin-2级别构建的perl 5，version 16，subversion 2 (v5.16.2)

脚本以UTF-8格式保存。这是我检查的第一件事。

getc

perl-io

ungetc

perl

unicode

回答 2

Stack Overflow用户

回答已采纳

发布于 2013-01-07 07:02:34

我非常确定这证明存在一个严重的Unicode处理错误，输出如下：

perl5.16.0 ungettest
ungettest 98896 @ Sun Jan  6 16:01:08 2013: sending normal line to kid
ungettest 98896 @ Sun Jan  6 16:01:08 2013: await()ing kid
ungettest 98897 @ Sun Jan  6 16:01:08 2013: ungetting litte z
ungettest 98897 @ Sun Jan  6 16:01:08 2013: ungetting big sigma
ungettest 98897 @ Sun Jan  6 16:01:08 2013: kid looping on parental input
98897: Unexpected fatalized warning: utf8 "\xA3" does not map to Unicode at ungettest line 40, <STDIN> line 1.
 at ungettest line 10, <STDIN> line 1.
    main::__ANON__('utf8 "\xA3" does not map to Unicode at ungettest line 40, <ST...') called at ungettest line 40
98896: parent pclose failed: 65280,  at ungettest line 28.
Exit 255

由该程序生成：

#!/usr/bin/env perl

use v5.16;
use strict;
use warnings;
use open qw( :utf8    :std );

use Carp;

$SIG{__WARN__} = sub {  confess "$$: Unexpected fatalized warning: @_" };

sub ungetchar($) {
    my $char = shift();
    confess "$$: expected single character pushback, not <$char>" if length($char) != 1;
    STDIN->ungetc(ord $char);
}

sub debug {
    my $now = localtime(time());
    print STDERR "$0 $$ \@ $now: @_\n";
}

if (open(STDOUT, "|-")                          // confess "$$: cannot fork: $!") {
    $| = 1;
    debug("sending normal line to kid");
    say "From \N{greek:alpha} to \N{greek:omega}.";
    debug("await()ing kid");
    close(STDOUT)                               || confess "$$: parent pclose failed: $?, $!";
    debug("child finished, parent exiting normally");
    exit(0);
}

debug("ungetting litte z");
ungetchar("z")                                  || confess "$$: ASCII ungetchar failed: $!";

debug("ungetting big sigma");
ungetchar("\N{greek:Sigma}")                    || confess "$$: Unicode ungetchar failed: $!";

debug("kid looping on parental input");
while (<STDIN>) {
    chomp;
    debug("kid got $_");
}
close(STDIN)                                    || confess "$$: child pclose failed: $?, $!";
debug("parent closed pipe, child exiting normally");
exit 0;

票数 2

Stack Overflow用户

发布于 2013-01-06 14:17:32

ungetc在底层输入流前面加上一个字节。要返回U+00C5，流必须包含C3 A5 (该字符的UTF8编码)，而不是C5 (ord("Å"))。改用IO::Unread的unread。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/14179751

复制

相似问题

问获取和取消获取unicode字符的句柄
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问获取和取消获取unicode字符的句柄EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问获取和取消获取unicode字符的句柄
EN