文章/答案/技术大牛

发布

社区首页 >问答首页 >PHPCrawl -尝试在类"PHPCrawlerUtils“上调用方法"getURIContent”

问PHPCrawl -尝试在类"PHPCrawlerUtils“上调用方法"getURIContent”
EN

Stack Overflow用户

提问于 2015-03-17 23:35:30

回答 1查看 422关注 0票数 0

我正在尝试在Symfony2中使用PHPCrawl。我首先使用composer安装了PHPCrawl库，然后在捆绑包中创建了一个文件夹"DependencyInjection“，其中放置了扩展PHPCrawler的类"MyCrawler”。我将其配置为服务。现在，当我启动抓取进程时，Symfony给出了前面提到的错误：

尝试在类"PHPCrawlerUtils“上调用方法"getURIContent”

我不知道为什么，因为类存在，方法也存在。

下面是我的控制器操作：

    /**
 * Parcours le site concerné
 * 
 * @Route("/crawl", name="blog_crawl")
 * @Template()
 */
public function crawlAction($url = 'http://urlexample.net')
{               
    // Au lieu de créer une instance de la classe MyCrawler, je l'appelle en tant que service (config.yml)
    $crawl = $this->get('my_crawler');

    $crawl->setURL($url);

    // Analyse la balise content-type du document, autorise les pages de type text/html
    $crawl->addContentTypeReceiveRule("#text/html#"); 

    // Filtre les url trouvées dans la page en question - ici on garde les pages html uniquement
    $crawl->addURLFilterRule("#(jpg|gif|png|pdf|jpeg|svg|css|js)$# i"); 

    $crawl->enableCookieHandling(TRUE);

    // Sets a limit to the number of bytes the crawler should receive alltogether during crawling-process.
    $crawl->setTrafficLimit(0);

    // Sets a limit to the total number of requests the crawler should execute.
    $crawl->setRequestLimit(20);

    // Sets the content-size-limit for content the crawler should receive from documents.
    $crawl->setContentSizeLimit(0);

    // Sets the timeout in seconds for waiting for data on an established server-connection.
    $crawl->setStreamTimeout(20);

    // Sets the timeout in seconds for connection tries to hosting webservers.
    $crawl->setConnectionTimeout(20);

    $crawl->obeyRobotsTxt(TRUE);
    $crawl->setUserAgentString("Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0");

    $crawl->go();

    // At the end, after the process is finished, we print a short 
    // report (see method getProcessReport() for more information) 
    $report = $crawl->getProcessReport(); 

    echo "Summary:".'<br/>'; 
    echo "Links followed: ".$report->links_followed.'<br/>'; 
    echo "Documents received: ".$report->files_received.'<br/>'; 
    echo "Bytes received: ".$report->bytes_received." bytes".'<br/>'; 
    echo "Process runtime: ".$report->process_runtime." sec".'<br/>';
    echo "Abort reason: ".$report->abort_reason.'<br/>';


    return array(
        'varstuff' => 'something'
    );
}

下面是我在DependencyInjection文件夹中的服务类MyCrawler：

<?php

namespace AppBundle\DependencyInjection;

use PHPCrawler;
use PHPCrawlerDocumentInfo;

/**
 * Description of MyCrawler
 *
 * @author Norman
 */
class MyCrawler extends PHPCrawler{

    /**
     * Récupère les infos d'une url
     * 
     * @param PHPCrawlerDocumentInfo $pageInfo
     */
    public function handleDocumentInfo(PHPCrawlerDocumentInfo $pageInfo)
    {                
        $page_url = $pageInfo->url;        
        $source = $pageInfo->source;
        $status = $pageInfo->http_status_code;

        // Si page "OK" (pas de code erreur) et non vide, affiche l'url
        if($status == 200 && $source!=''){
            echo $page_url.'<br/>';

            flush();            
        }
    }    
}

我也在sourceforge PHPCrawl论坛上搜索过帮助，但到目前为止还没有成功……我应该补充说，我在这里使用的是PHPCrawl 0.83：

https://github.com/mmerian/phpcrawl/

下面是似乎出现问题的类：

<?php
/**
 * Class for parsing robots.txt-files.
 *
 * @package phpcrawl
 * @internal
 */  
class PHPCrawlerRobotsTxtParser
{ 
  public function __construct()
  {
    // Init PageRequest-class
    if (!class_exists("PHPCrawlerHTTPRequest"))    include_once($classpath."/PHPCrawlerHTTPRequest.class.php");
    $this->PageRequest = new PHPCrawlerHTTPRequest();

  }

  /**
   * Parses a robots.txt-file and returns regular-expression-rules corresponding to the containing "disallow"-rules
   * that are adressed to the given user-agent.
   *
   * @param PHPCrawlerURLDescriptor $BaseUrl           The root-URL all rules from the robots-txt-file should relate to
   * @param string                  $user_agent_string The useragent all rules from the robots-txt-file should relate to
   * @param string                  $robots_txt_uri    Optional. The location of the robots.txt-file as URI.
   *                                                   If not set, the default robots.txt-file for the given BaseUrl gets parsed.
   *
   * @return array Numeric array containing regular-expressions for each "disallow"-rule defined in the robots.txt-file
   *               that's adressed to the given user-agent.
   */
  public function parseRobotsTxt(PHPCrawlerURLDescriptor $BaseUrl,   $user_agent_string, $robots_txt_uri = null)
  {
    PHPCrawlerBenchmark::start("processing_robotstxt");

    // If robots_txt_uri not given, use the default one for the given BaseUrl
    if ($robots_txt_uri === null)
      $robots_txt_uri = self::getRobotsTxtURL($BaseUrl->url_rebuild);

    // Get robots.txt-content
    $robots_txt_content = PHPCrawlerUtils::getURIContent($robots_txt_uri, $user_agent_string);

    $non_follow_reg_exps = array();

    // If content was found
    if ($robots_txt_content != null)
    {
      // Get all lines in the robots.txt-content that are adressed to our user-agent.
      $applying_lines = $this->getUserAgentLines($robots_txt_content, $user_agent_string);

      // Get valid reg-expressions for the given disallow-pathes.
      $non_follow_reg_exps = $this->buildRegExpressions($applying_lines, PHPCrawlerUtils::getRootUrl($BaseUrl->url_rebuild));
    }

    PHPCrawlerBenchmark::stop("processing_robots.txt");

    return $non_follow_reg_exps;
}

symfony

phpcrawl

回答 1

Stack Overflow用户

发布于 2015-03-18 05:59:20

好了，我想我已经解决了我自己的问题。这里发生的情况是，当安装在Symfony2中时，mmerian PHPCrawler包会自动加载libs目录中的每个类。现在，有两个名为PHPCrawlerUtils的类。第一个在它自己的文件夹中，第二个缺少getURIcontent方法。并且在自动加载结束后，第二个优先。在主类PHPCrawler中，“如果类还不存在”，构造器加载他所需要的每个正确的类。这就是为什么没有加载正确的类。最后，我包含了PHPCrawlerUtils类，它的存在没有任何条件。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/29103301

复制

相似问题

问PHPCrawl -尝试在类"PHPCrawlerUtils“上调用方法"getURIContent”
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问PHPCrawl -尝试在类"PHPCrawlerUtils“上调用方法"getURIContent”EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问PHPCrawl -尝试在类"PHPCrawlerUtils“上调用方法"getURIContent”
EN