首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >PHPCrawl -尝试在类"PHPCrawlerUtils“上调用方法"getURIContent”

PHPCrawl -尝试在类"PHPCrawlerUtils“上调用方法"getURIContent”
EN

Stack Overflow用户
提问于 2015-03-17 23:35:30
回答 1查看 422关注 0票数 0

我正在尝试在Symfony2中使用PHPCrawl。我首先使用composer安装了PHPCrawl库,然后在捆绑包中创建了一个文件夹"DependencyInjection“,其中放置了扩展PHPCrawler的类"MyCrawler”。我将其配置为服务。现在,当我启动抓取进程时,Symfony给出了前面提到的错误:

尝试在类"PHPCrawlerUtils“上调用方法"getURIContent”

我不知道为什么,因为类存在,方法也存在。

下面是我的控制器操作:

代码语言:javascript
复制
    /**
 * Parcours le site concerné
 * 
 * @Route("/crawl", name="blog_crawl")
 * @Template()
 */
public function crawlAction($url = 'http://urlexample.net')
{               
    // Au lieu de créer une instance de la classe MyCrawler, je l'appelle en tant que service (config.yml)
    $crawl = $this->get('my_crawler');

    $crawl->setURL($url);

    // Analyse la balise content-type du document, autorise les pages de type text/html
    $crawl->addContentTypeReceiveRule("#text/html#"); 

    // Filtre les url trouvées dans la page en question - ici on garde les pages html uniquement
    $crawl->addURLFilterRule("#(jpg|gif|png|pdf|jpeg|svg|css|js)$# i"); 

    $crawl->enableCookieHandling(TRUE);

    // Sets a limit to the number of bytes the crawler should receive alltogether during crawling-process.
    $crawl->setTrafficLimit(0);

    // Sets a limit to the total number of requests the crawler should execute.
    $crawl->setRequestLimit(20);

    // Sets the content-size-limit for content the crawler should receive from documents.
    $crawl->setContentSizeLimit(0);

    // Sets the timeout in seconds for waiting for data on an established server-connection.
    $crawl->setStreamTimeout(20);

    // Sets the timeout in seconds for connection tries to hosting webservers.
    $crawl->setConnectionTimeout(20);

    $crawl->obeyRobotsTxt(TRUE);
    $crawl->setUserAgentString("Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0");

    $crawl->go();

    // At the end, after the process is finished, we print a short 
    // report (see method getProcessReport() for more information) 
    $report = $crawl->getProcessReport(); 

    echo "Summary:".'<br/>'; 
    echo "Links followed: ".$report->links_followed.'<br/>'; 
    echo "Documents received: ".$report->files_received.'<br/>'; 
    echo "Bytes received: ".$report->bytes_received." bytes".'<br/>'; 
    echo "Process runtime: ".$report->process_runtime." sec".'<br/>';
    echo "Abort reason: ".$report->abort_reason.'<br/>';


    return array(
        'varstuff' => 'something'
    );
}

下面是我在DependencyInjection文件夹中的服务类MyCrawler:

代码语言:javascript
复制
<?php

namespace AppBundle\DependencyInjection;

use PHPCrawler;
use PHPCrawlerDocumentInfo;

/**
 * Description of MyCrawler
 *
 * @author Norman
 */
class MyCrawler extends PHPCrawler{

    /**
     * Récupère les infos d'une url
     * 
     * @param PHPCrawlerDocumentInfo $pageInfo
     */
    public function handleDocumentInfo(PHPCrawlerDocumentInfo $pageInfo)
    {                
        $page_url = $pageInfo->url;        
        $source = $pageInfo->source;
        $status = $pageInfo->http_status_code;

        // Si page "OK" (pas de code erreur) et non vide, affiche l'url
        if($status == 200 && $source!=''){
            echo $page_url.'<br/>';

            flush();            
        }
    }    
}

我也在sourceforge PHPCrawl论坛上搜索过帮助,但到目前为止还没有成功……我应该补充说,我在这里使用的是PHPCrawl 0.83:

https://github.com/mmerian/phpcrawl/

下面是似乎出现问题的类:

代码语言:javascript
复制
<?php
/**
 * Class for parsing robots.txt-files.
 *
 * @package phpcrawl
 * @internal
 */  
class PHPCrawlerRobotsTxtParser
{ 
  public function __construct()
  {
    // Init PageRequest-class
    if (!class_exists("PHPCrawlerHTTPRequest"))    include_once($classpath."/PHPCrawlerHTTPRequest.class.php");
    $this->PageRequest = new PHPCrawlerHTTPRequest();

  }

  /**
   * Parses a robots.txt-file and returns regular-expression-rules corresponding to the containing "disallow"-rules
   * that are adressed to the given user-agent.
   *
   * @param PHPCrawlerURLDescriptor $BaseUrl           The root-URL all rules from the robots-txt-file should relate to
   * @param string                  $user_agent_string The useragent all rules from the robots-txt-file should relate to
   * @param string                  $robots_txt_uri    Optional. The location of the robots.txt-file as URI.
   *                                                   If not set, the default robots.txt-file for the given BaseUrl gets parsed.
   *
   * @return array Numeric array containing regular-expressions for each "disallow"-rule defined in the robots.txt-file
   *               that's adressed to the given user-agent.
   */
  public function parseRobotsTxt(PHPCrawlerURLDescriptor $BaseUrl,   $user_agent_string, $robots_txt_uri = null)
  {
    PHPCrawlerBenchmark::start("processing_robotstxt");

    // If robots_txt_uri not given, use the default one for the given BaseUrl
    if ($robots_txt_uri === null)
      $robots_txt_uri = self::getRobotsTxtURL($BaseUrl->url_rebuild);

    // Get robots.txt-content
    $robots_txt_content = PHPCrawlerUtils::getURIContent($robots_txt_uri, $user_agent_string);

    $non_follow_reg_exps = array();

    // If content was found
    if ($robots_txt_content != null)
    {
      // Get all lines in the robots.txt-content that are adressed to our user-agent.
      $applying_lines = $this->getUserAgentLines($robots_txt_content, $user_agent_string);

      // Get valid reg-expressions for the given disallow-pathes.
      $non_follow_reg_exps = $this->buildRegExpressions($applying_lines, PHPCrawlerUtils::getRootUrl($BaseUrl->url_rebuild));
    }

    PHPCrawlerBenchmark::stop("processing_robots.txt");

    return $non_follow_reg_exps;
}
EN

回答 1

Stack Overflow用户

发布于 2015-03-18 05:59:20

好了,我想我已经解决了我自己的问题。这里发生的情况是,当安装在Symfony2中时,mmerian PHPCrawler包会自动加载libs目录中的每个类。现在,有两个名为PHPCrawlerUtils的类。第一个在它自己的文件夹中,第二个缺少getURIcontent方法。并且在自动加载结束后,第二个优先。在主类PHPCrawler中,“如果类还不存在”,构造器加载他所需要的每个正确的类。这就是为什么没有加载正确的类。最后,我包含了PHPCrawlerUtils类,它的存在没有任何条件。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/29103301

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档