请参阅这段perl代码:
#!/usr/bin/perl -w -CS
use feature 'unicode_strings';
open IN, "<", "wiki.txt";
open OUT, ">", "wikicorpus.txt";
binmode( IN, ':utf8' );
binmode( OUT, ':utf8' );
## Condition plain text English sentences or word lists into a form suitable for constructing a vocabulary and language model
while (<IN>) {
# Remove starting and trailing tags (e.g. <s>)
# s/\<[a-z\/]+\>//g;
# Remove ellipses
s/\.\.\./ /g;
# Remove unicode 2500 (hex E2 94 80) used as something like an m-dash between words
# Unicode 2026 (horizontal ellipsis)
# Unicode 2013 and 2014 (m- and n-dash)
s/[\x{2500}\x{2026}\x{2013}\x{2014}]/ /g;
# Remove dashes surrounded by spaces (e.g. phrase - phrase)
s/\s-+\s/ /g;
# Remove dashes between words with no spaces (e.g. word--word)
s/([A-Za-z0-9])\-\-([A-Za-z0-9])/$1 $2/g;
# Remove dash at a word end (e.g. three- to five-year)
s/(\w)-\s/$1 /g;
# Remove some punctuation
s/([\"\?,;:%???!()\[\]{}<>_\.])/ /g;
# Remove quotes
s/[\p{Initial_Punctuation}\p{Final_Punctuation}]/ /g;
# Remove trailing space
s/ $//;
# Remove double single-quotes
s/'' / /g;
s/ ''/ /g;
# Replace accented e with normal e for consistency with the CMU pronunciation dictionary
s/?/e/g;
# Remove single quotes used as quotation marks (e.g. some 'phrase in quotes')
s/\s'([\w\s]+[\w])'\s/ $1 /g;
# Remove double spaces
s/\s+/ /g;
# Remove leading space
s/^\s+//;
chomp($_);
print OUT uc($_) . "\n";
# print uc($_) . " ";
} print OUT "\n"; 在第49行上似乎有一个非英语字符,即行s/?/e/g;。所以当我运行这段代码时,Quantifier follows nothing in regex;会发出警告。
我该如何处理这个问题?如何让perl识别字符?我必须使用perl 5.10运行这段代码。
另一个小问题是,第一行中的"-CS“是什么意思。
感谢所有人。
发布于 2012-08-16 13:16:31
我认为您的问题是您的编辑器不能处理unicode字符,因此程序甚至在到达perl之前就被丢弃了,而且由于这显然不是您的程序,它可能在到达您之前就已经被丢弃了。
在整个工具链正确处理unicode之前,必须小心以保留非ascii字符的方式对它们进行编码。这是一种痛苦,而且没有简单的解决方案。有关如何安全地嵌入unicode字符的信息,请参阅perl手册。
发布于 2012-08-16 14:09:20
根据错误行之前的注释行,要替换的字符是带重音的"e";大概意思是带有尖锐重音的e:"é“。假设您的输入是Unicode,它可以在Perl中表示为\x{00E9}。另请参阅http://www.fileformat.info/info/unicode/char/e9/index.htm
我猜您从服务器上的网页复制/粘贴了此脚本,该服务器未正确配置为显示所需的字符编码。另请参阅http://en.wikipedia.org/wiki/Mojibake
https://stackoverflow.com/questions/11980994
复制相似问题