假设我有一句话:
$body = 'the quick brown fox jumps over the lazy dog';我想把这句话放入“关键字”的散列中,但我想允许多个单词的关键字;我有以下几个方法来获取单个单词的关键字:
$words{$_}++ for $body =~ m/(\w+)/g;完成此操作后,我有一个类似于以下内容的散列:
'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1下一步,我可以获得两个单词的关键字,如下所示:
$words{$_}++ for $body =~ m/(\w+ \w+)/g;但这只能得到所有的“其他”配对;看起来像这样:
'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1我还需要一个单词的偏移量:
'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1有没有比下面更简单的方法呢?
my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;发布于 2010-08-19 05:35:44
虽然所描述的任务对于手工编码来说可能很有趣,但使用处理n-gram的现有CPAN模块不是更好吗?看起来Text::Ngrams (而不是Text::Ngram)可以处理基于单词的n元语法分析。
发布于 2010-08-19 05:43:03
你可以用lookaheads做一些时髦的事情:
如果我这样做了:
$words{$_}++ for $body =~ m/(?=(\w+ \w+))\w+/g;这句话说的是向前看两个单词(并捕获它们),但消耗1。
我得到了:
%words: {
'brown fox' => 1,
'fox jumps' => 1,
'jumps over' => 1,
'lazy dog' => 1,
'over the' => 1,
'quick brown' => 1,
'the lazy' => 1,
'the quick' => 1
}似乎我可以通过为count添加一个变量来推广这一点:
my $n = 4;
$words{$_}++ for $body =~ m/(?=(\w+(?: \w+){$n}))\w+/g;发布于 2010-08-19 05:28:19
我会使用look-ahead来收集除第一个单词之外的所有内容。这样,位置就会自动正确地前进:
my $body = 'the quick brown fox jumps over the lazy dog';
my %words;
++$words{$1} while $body =~ m/(\w+)/g;
++$words{"$1 $2"} while $body =~ m/(\w+) \s+ (?= (\w+) )/gx;
++$words{"$1 $2 $3"} while $body =~ m/(\w+) \s+ (?= (\w+) \s+ (\w+) )/gx;如果您想使用单个空格而不是\s+ (如果这样做,不要忘记删除/x修饰符),那么可以稍微简化一下,因为您可以在$2中收集任意数量的单词,而不是每个单词使用一个组。
https://stackoverflow.com/questions/3516628
复制相似问题