我正在尝试在Symfony2中使用PHPCrawl。我首先使用composer安装了PHPCrawl库,然后在捆绑包中创建了一个文件夹"DependencyInjection“,其中放置了扩展PHPCrawler的类"MyCrawler”。我将其配置为服务。现在,当我启动抓取进程时,Symfony给出了前面提到的错误:
尝试在类"PHPCrawlerUtils“上调用方法"getURIContent”
我不知道为什么,因为类存在,方法也存在。
下面是我的控制器操作:
/**
* Parcours le site concerné
*
* @Route("/crawl", name="blog_crawl")
* @Template()
*/
public function crawlAction($url = 'http://urlexample.net')
{
// Au lieu de créer une instance de la classe MyCrawler, je l'appelle en tant que service (config.yml)
$crawl = $this->get('my_crawler');
$crawl->setURL($url);
// Analyse la balise content-type du document, autorise les pages de type text/html
$crawl->addContentTypeReceiveRule("#text/html#");
// Filtre les url trouvées dans la page en question - ici on garde les pages html uniquement
$crawl->addURLFilterRule("#(jpg|gif|png|pdf|jpeg|svg|css|js)$# i");
$crawl->enableCookieHandling(TRUE);
// Sets a limit to the number of bytes the crawler should receive alltogether during crawling-process.
$crawl->setTrafficLimit(0);
// Sets a limit to the total number of requests the crawler should execute.
$crawl->setRequestLimit(20);
// Sets the content-size-limit for content the crawler should receive from documents.
$crawl->setContentSizeLimit(0);
// Sets the timeout in seconds for waiting for data on an established server-connection.
$crawl->setStreamTimeout(20);
// Sets the timeout in seconds for connection tries to hosting webservers.
$crawl->setConnectionTimeout(20);
$crawl->obeyRobotsTxt(TRUE);
$crawl->setUserAgentString("Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0");
$crawl->go();
// At the end, after the process is finished, we print a short
// report (see method getProcessReport() for more information)
$report = $crawl->getProcessReport();
echo "Summary:".'<br/>';
echo "Links followed: ".$report->links_followed.'<br/>';
echo "Documents received: ".$report->files_received.'<br/>';
echo "Bytes received: ".$report->bytes_received." bytes".'<br/>';
echo "Process runtime: ".$report->process_runtime." sec".'<br/>';
echo "Abort reason: ".$report->abort_reason.'<br/>';
return array(
'varstuff' => 'something'
);
}下面是我在DependencyInjection文件夹中的服务类MyCrawler:
<?php
namespace AppBundle\DependencyInjection;
use PHPCrawler;
use PHPCrawlerDocumentInfo;
/**
* Description of MyCrawler
*
* @author Norman
*/
class MyCrawler extends PHPCrawler{
/**
* Récupère les infos d'une url
*
* @param PHPCrawlerDocumentInfo $pageInfo
*/
public function handleDocumentInfo(PHPCrawlerDocumentInfo $pageInfo)
{
$page_url = $pageInfo->url;
$source = $pageInfo->source;
$status = $pageInfo->http_status_code;
// Si page "OK" (pas de code erreur) et non vide, affiche l'url
if($status == 200 && $source!=''){
echo $page_url.'<br/>';
flush();
}
}
}我也在sourceforge PHPCrawl论坛上搜索过帮助,但到目前为止还没有成功……我应该补充说,我在这里使用的是PHPCrawl 0.83:
https://github.com/mmerian/phpcrawl/
下面是似乎出现问题的类:
<?php
/**
* Class for parsing robots.txt-files.
*
* @package phpcrawl
* @internal
*/
class PHPCrawlerRobotsTxtParser
{
public function __construct()
{
// Init PageRequest-class
if (!class_exists("PHPCrawlerHTTPRequest")) include_once($classpath."/PHPCrawlerHTTPRequest.class.php");
$this->PageRequest = new PHPCrawlerHTTPRequest();
}
/**
* Parses a robots.txt-file and returns regular-expression-rules corresponding to the containing "disallow"-rules
* that are adressed to the given user-agent.
*
* @param PHPCrawlerURLDescriptor $BaseUrl The root-URL all rules from the robots-txt-file should relate to
* @param string $user_agent_string The useragent all rules from the robots-txt-file should relate to
* @param string $robots_txt_uri Optional. The location of the robots.txt-file as URI.
* If not set, the default robots.txt-file for the given BaseUrl gets parsed.
*
* @return array Numeric array containing regular-expressions for each "disallow"-rule defined in the robots.txt-file
* that's adressed to the given user-agent.
*/
public function parseRobotsTxt(PHPCrawlerURLDescriptor $BaseUrl, $user_agent_string, $robots_txt_uri = null)
{
PHPCrawlerBenchmark::start("processing_robotstxt");
// If robots_txt_uri not given, use the default one for the given BaseUrl
if ($robots_txt_uri === null)
$robots_txt_uri = self::getRobotsTxtURL($BaseUrl->url_rebuild);
// Get robots.txt-content
$robots_txt_content = PHPCrawlerUtils::getURIContent($robots_txt_uri, $user_agent_string);
$non_follow_reg_exps = array();
// If content was found
if ($robots_txt_content != null)
{
// Get all lines in the robots.txt-content that are adressed to our user-agent.
$applying_lines = $this->getUserAgentLines($robots_txt_content, $user_agent_string);
// Get valid reg-expressions for the given disallow-pathes.
$non_follow_reg_exps = $this->buildRegExpressions($applying_lines, PHPCrawlerUtils::getRootUrl($BaseUrl->url_rebuild));
}
PHPCrawlerBenchmark::stop("processing_robots.txt");
return $non_follow_reg_exps;
}发布于 2015-03-18 05:59:20
好了,我想我已经解决了我自己的问题。这里发生的情况是,当安装在Symfony2中时,mmerian PHPCrawler包会自动加载libs目录中的每个类。现在,有两个名为PHPCrawlerUtils的类。第一个在它自己的文件夹中,第二个缺少getURIcontent方法。并且在自动加载结束后,第二个优先。在主类PHPCrawler中,“如果类还不存在”,构造器加载他所需要的每个正确的类。这就是为什么没有加载正确的类。最后,我包含了PHPCrawlerUtils类,它的存在没有任何条件。
https://stackoverflow.com/questions/29103301
复制相似问题