文章/答案/技术大牛

发布

社区首页 >问答首页 >尽可能多地获取调查的SearchEngine结果

问尽可能多地获取调查的SearchEngine结果
EN

Stack Overflow用户

提问于 2013-05-27 18:27:16

回答 2查看 552关注 0票数 1

为了进行一次调查，我试图从我的php页面上抓取谷歌搜索结果。我抓取6个结果，然后单击下一步按钮，在9页后获得下一页records.but，例如64个结果，它给出以下错误：

stdClass Object
(
[responseData] => 
[responseDetails] => out of range start
[responseStatus] => 400
)

我只想要尽可能多的数据。我不介意它是谷歌搜索引擎或任何其他搜索引擎。但为了获得准确的调查结果，我需要大量的结果集。有人知道我是怎么做到的吗？

可以通过cron抓取结果吗？还有别的办法吗？

php

search-engine

回答 2

Stack Overflow用户

回答已采纳

发布于 2013-05-30 16:18:23

注意事项

谷歌试图阻止抓取，因此服务器将被阻止，当他们怀疑抓取时，请求将被丢弃。所以，如果你偶尔需要获取一些谷歌结果，你可以使用这个。在google-scraper.squabbel.com上查看基于代理的抓取器和更多关于googles阻塞机制的信息。这也违反了他们的政策，因此是不合理的。

google api不允许超过64个结果，所以如果你需要更多的结果，你需要自己抓取站点。因为这是一个有趣的项目，所以我创建了一个类来为您做这件事。

它需要免费的PHP Simple HTML DOM Parser，所以你也需要下载这段代码。

它将输出一个数组，如下所示

array(100) {
  [0]=>
  array(3) {
    ["title"]=>
    string(67) "Online Tests - Online aptitude tests for interview, competitive ..."
    ["href"]=>
    string(36) "http://www.indiabix.com/online-test/"
    ["description"]=>
    string(168) "Online aptitude tests for competitive examination, entrance examination and 
    campus interview. Take various tests and find out how much you score before 
    you ... "
  }
  [1]=>
  array(3) {
    ["title"]=>
    string(37) "Test your English - Cambridge English"
    ["href"]=>
    string(50) "http://www.cambridgeenglish.org/test-your-english/"
    ["description"]=>
    string(179) "Test Your English. This is a quick, free online test. It will tell you which Cambridge 
    English exam may be best for you. Click 'Begin Test' and answer each of the ... "
  }

  //removed for better visibility

}

使用方法：

//start the scraper for google.com (english results)
$gs = new GoogleScraper();

//start the scraper for google.nl (dutch results)
//$gs = new GoogleScraper('https://www.google.nl');

//set your search query
$gs->SearchQuery('online exams');

//start loading the pages. You can enter any integer above 0
$gs->LoadPages(10);

//dump the results, but its just an array so you can also do other things with it.  
echo '<pre>';
var_dump($gs->GetResults());
echo '</pre>';
?>

然后是GoogleScraper.php

<?php
require_once('simple_html_dom.php');
class GoogleScraper
{
  private $_results;
  private $_baseUrl;
  private $_searchQuery;
  private $_resultsPerPage;

  /**
   *  constructor
   *  I use the constructor to set all the defaults to keep it all in one place
   */
  final public function __construct($baseUrl='')
  {
    $this->_results = array();
    $this->_resultsPerPage = 100;

    if (empty($baseUrl)) {
      $this->_baseUrl = 'https://www.google.com';
    } else {
      $this->_baseUrl = $baseUrl;
    }
  }

  /**
   *  cleanup
   */
  final public function __destruct()
  {
    unset($this->_results);
    unset($this->_baseUrl);
    unset($this->_searchQuery);
  }

  /**
   *  Set the query
   */
  final public function SearchQuery($searchQuery)
  {
    if (!(is_string($searchQuery) || is_numeric($searchQuery)))
    {
      throw new Exception('Invalid query type');      
    }

    $this->_searchQuery = $searchQuery;
  }

  /**
   *  Set the number of results per page
   */
  final public function ResultsPerPage($resultsPerPage)
  {
    if (!is_int($resultsPerPage) || $resultsPerPage<10 || $resultsPerPage>100)
    {
      throw new Exception('Results per page must be value between 10 and 100');      
    }

    $this->_resultsPerPage = $resultsPerPage;
  }

  /**
   *  Get the result
   */
  final public function GetResults()
  {
    return $this->_results;
  }


  /**
   *  Scrape the search results
   */
  final public function LoadPages($pages=1)
  {
    if (!is_int($pages) || $pages<1)
    {
      throw new Exception('Invalid number of pages');      
    }
    if (empty($this->_searchQuery))
    {
      throw new Exception('Missing search query');      
    }

    $url = $this->_baseUrl . '/search?num='.$this->_resultsPerPage.'&q=' . urlencode($this->_searchQuery);
    $currentPage = 1;
    while($pages--) {
      if ($content = $this->LoadUrl($url)) {
        /*
        Load content in to simple html dom
        */    
        $html = new simple_html_dom();
        $html->load($content);

        /*
        Find and handle search results
        */  
        $items = $html->find('div#ires li');
        foreach($items as $item) {
          /*
          Only normal search results have this container. Special results like found images or news dont have it.
          */  
          $check = $item->find('div.s');
          if (count($check)!=1) {
            continue;
          }

          $head = $item->find('h3.r a', 0);
          $result['title'] = $head->plaintext;

          /*
          If we dont have a title, there is no point in continuing
          */  
          if (empty($result['title'])) {
            continue;
          }

          $result['href'] = $head->href;

          /*
          Check if we can parse the URL for the actual url
          */  
          if (!empty($result['href'])) {
            $qs = explode('?', $result['href']);
            if (!empty($qs[1])) {
              parse_str($qs[1], $querystring);
              if (!empty($querystring['q'])) {
                $result['href'] = $querystring['q'];
              }
            }
          }

          /*
          Try to find the description
          */  
          $info = $item->find('span.st', 0);
          $result['description'] = $info->plaintext;

          /*
          Add the results to the total
          */
          $this->_results[] = $result;
        }

        /*
        Find next page
        */              
        $url = $this->_baseUrl . '/search?num='.$this->_resultsPerPage.'&q=' . urlencode($this->_searchQuery) . '$start=' . ($currentPage*$this->_resultsPerPage);
      } else {
        throw new Exception('Failed to load page');
      }

      $currentPage++;
    }
  }


  /**
   *  Load the url
   */
  final private function LoadUrl($url)
  {
    if (!is_string($url))
    {
      throw new Exception('Invalid url');      
    }

    $options['http'] = array(
      'user_agent' => "GoogleScraper",
      'timeout' => 0.5
    );
    $context = stream_context_create($options);

    $content = file_get_contents($url, null, $context);
    if (!empty($http_response_header))
    {
      return (substr_count($http_response_header[0], ' 200 OK')>0) ? $content : false;
    }

    return false;    
  }

}
?>

检查此PHP Fiddle以查看其实际运行情况。因为这可能在这个服务器上经常被使用，所以google有可能出现503个错误。

票数 2

Stack Overflow用户

发布于 2013-05-31 19:49:21

你应该在两次调用之间添加一个睡眠(1)作为冷却，否则可能会被禁止。你有没有考虑过通过官方途径获得google API密钥？

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/16771142

复制

相似问题

问尽可能多地获取调查的SearchEngine结果
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问尽可能多地获取调查的SearchEngine结果EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问尽可能多地获取调查的SearchEngine结果
EN