我正在分析许多网站的源码,一个有数千个页面的庞大网络。现在我想在每个ĺ中搜索东西,我想找出关键字出现的次数。
为了解析网页,我使用curl并通过管道将输出发送到"grep perl“,这不起作用,所以我想使用-c。可以完全利用perl来抓取页面吗?
例如。
cat RawJSpiderOutput.txt | grep parsed | awk -F " " '{print $2}' | xargs -I replaceStr curl replaceStr?myPara=en | perl -lne '$c++while/myKeywordToSearchFor/g;END{print$c}' 说明:在上面的文本文件中,我有可用的和不可用的URL。使用"Grep parsed“获取可用的URL。在awk中,我选择了包含纯可用URL的第二列。到目前一切尚好。现在来看这个问题:我使用Curl获取源代码(还附加了一些参数),并通过管道将每个页面的整个源代码传递给perl,以便计算"myKeywordToSearchFor“的出现次数。只有在可能的情况下,我才愿意用perl来做这件事。
谢谢!
发布于 2011-12-06 23:12:39
这仅使用Perl (未经测试):
use strict;
use warnings;
use File::Fetch;
my $count;
open my $SPIDER, '<', 'RawJSpiderOutput.txt' or die $!;
while (<$SPIDER>) {
chomp;
if (/parsed/) {
my $url = (split)[1];
$url .= '?myPara=en';
my $ff = File::Fetch->new(uri => $url);
$ff->fetch or die $ff->error;
my $fetched = $ff->output_file;
open my $FETCHED, '<', $fetched or die $!;
while (<$FETCHED>) {
$count++ if /myKeyword/;
}
unlink $fetched;
}
}
print "$count\n";发布于 2011-12-06 22:56:10
再试试像这样的,
perl -e 'while(<>){my @words = split ' ';for my $word(@words){if(/myKeyword/){++$c}}} print "$c\n"'即
while (<>) # as long as we're getting input (into “$_”)
{ my @words = split ' '; # split $_ (implicit) into whitespace, so we examine each word
for my $word (@words) # (and don't miss two keywords on one line)
{ if (/myKeyword/) # whenever it's found,
{ ++$c } } } # increment the counter (auto-vivified)
print "$c\n" # and after end of file is reached, print the counter或者,拼写为strict-like
use strict;
my $count = 0;
while (my $line = <STDIN>) # except that <> is actually more magical than this
{ my @words = split ' ' => $line;
for my $word (@words)
{ ++$count; } } }
print "$count\n";https://stackoverflow.com/questions/8401109
复制相似问题