文章/答案/技术大牛

发布

社区首页 >问答首页 >Perl -文件编码和字比较

问Perl -文件编码和字比较
EN

Stack Overflow用户

提问于 2011-05-05 17:13:22

回答 1查看 1.2K关注 0票数 5

我有一个文件，其中包含一个短语/术语，每一行我从STDIN读取给perl。我有一个单词列表(如"á“、”s o“、"é")，我希望将它们与每个术语进行比较，如果它们相等，则删除。问题是我不确定文件的编码格式。

我从file命令中得到了这样的信息：

words.txt: Non-ISO extended-ASCII English text

我的linux终端在UTF-8中，它显示了一些词的正确内容，而另一些则没有。

condi<E3>
conte<FA>dos
ajuda, mas não resolve
mo<E7>ambique
pedagógico são fenómenos

你可以看到，第三行和第五行正确识别带有重音和特殊字符的单词，而其他行则不正确。其他行的正确输出应该是:condi mo、conteúdos和mo莫桑比克。

如果我使用binmode(STDOUT, utf8)，“不正确”行现在将正确输出，而其他行则不会。例如，第3行：

ajuda，mas n ajuda解析

伙计们，我该怎么办？

perl

unicode

character-encoding

回答 1

Stack Overflow用户

回答已采纳

发布于 2011-05-05 18:21:49

它的工作方式如下：

C:\Dev\Perl :: chcp
Aktive Codepage: 1252.

C:\Dev\Perl :: type mixed-encoding.txt
eins zwei drei KÃ¤se vier fÃ¼nf Wurst
eins zwei drei Käse vier fünf Wurst

C:\Dev\Perl :: perl mixed-encoding.pl < mixed-encoding.txt
eins zwei drei vier fünf
eins zwei drei vier fünf

mixed-encoding.pl的情况是这样的：

use strict;
use warnings;
use utf8; # source in UTF-8
use Encode 'decode_utf8';
use List::MoreUtils 'any';

my @stopwords = qw( Käse Wurst );

while ( <> ) { # read octets
    chomp;
    my @tokens;
    for ( split /\s+/ ) {
        # Try UTF-8 first. If that fails, assume legacy Latin-1.
        my $token = eval { decode_utf8 $_, Encode::FB_CROAK };
        $token = $_ if $@;
        push @tokens, $token unless any { $token eq $_ } @stopwords;
    }
    print "@tokens\n";
}

请注意，脚本不必用UTF-8编码。只是如果您的脚本中有时髦的字符数据，您必须确保编码匹配，所以如果您的编码是UTF-8，那么use utf8，如果不是，则不要。

基于建议的更新

use strict;
use warnings;
# source in Latin1
use Encode 'decode';
use List::MoreUtils 'any';

my @stopwords = qw( Käse Wurst );

while ( <> ) { # read octets
        chomp;
        my @tokens;
        for ( split /\s+/ ) {
                # Try UTF-8 first. If that fails, assume 8-bit encoding.
                my $token = eval { decode utf8 => $_, Encode::FB_CROAK };
                $token    = decode Windows1252 => $_, Encode::FB_CROAK if $@;
                push @tokens, uc $token unless any { $token eq $_ } @stopwords;
        }
        print "@tokens\n";
}

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/5901633

复制

相似问题

问Perl -文件编码和字比较
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Perl -文件编码和字比较EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Perl -文件编码和字比较
EN