文章/答案/技术大牛

发布

问从服务器获取所有内存
EN

Stack Overflow用户

提问于 2018-11-21 11:01:00

回答 3查看 116关注 0票数 0

我编写了一个非常简单的PHP爬虫，但是我对内存丢失有问题。守则是：

<?php
require_once 'db.php';

$homepage = 'https://example.com';
$query = "SELECT * FROM `crawled_urls`";
$response = @mysqli_query($dbc, $query);

$already_crawled = [];
$crawling = [];

while($row = mysqli_fetch_array($response)){
  $already_crawled[] = $row['crawled_url'];
  $crawling[] = $row['crawled_url'];
}

function follow_links($url){
  global $already_crawled;
  global $crawling;
  global $dbc;

  $doc = new DOMDocument();
  $doc->loadHTML(file_get_contents($url));

  $linklist = $doc->getElementsByTagName('a');

  foreach ($linklist as $link) {
    $l = $link->getAttribute("href");
    $full_link = 'https://example.com'.$l;

    if (!in_array($full_link, $already_crawled)) {

      // TODO: Fetch data from the crawled url and store it in the DB. Check if it was already stored.

      $query = 'INSERT INTO `crawled_urls`(`id`, `crawled_url`) VALUES (NULL,\'' . $full_link . '\')';
      $stmt = mysqli_prepare($dbc, $query);
      mysqli_stmt_execute($stmt);

      echo $full_link.PHP_EOL;
    }
  }

  array_shift($crawling);

  foreach ($crawling as $link) {
    follow_links($link);
  }
}

follow_links($homepage);

你能帮我，并与我分享一个方法，以避免这种巨大的记忆损失？当我开始这个过程的时候，一切都很好，但是记忆却稳定地上升到100%。

php

web-crawler

回答 3

Stack Overflow用户

回答已采纳

发布于 2018-11-21 11:20:12

当您不再需要unset $doc时，您需要它：

function follow_links($url){
  global $already_crawled;
  global $crawling;
  global $dbc;

  $doc = new DOMDocument();
  $doc->loadHTML(file_get_contents($url));

  $linklist = $doc->getElementsByTagName('a');

  unset($doc);

  foreach ($linklist as $link) {
    $l = $link->getAttribute("href");
    $full_link = 'https://example.com'.$l;

    if (!in_array($full_link, $already_crawled)) {

      // TODO: Fetch data from the crawled url and store it in the DB. Check if it was already stored.

      $query = 'INSERT INTO `crawled_urls`(`id`, `crawled_url`) VALUES (NULL,\'' . $full_link . '\')';
      $stmt = mysqli_prepare($dbc, $query);
      mysqli_stmt_execute($stmt);

      echo $full_link.PHP_EOL;
    }
  }

  array_shift($crawling);

  foreach ($crawling as $link) {
    follow_links($link);
  }
}

follow_links($homepage);

说明:您使用的是递归，也就是说，您基本上使用的是一堆函数。这意味着，如果您有一个由20个元素组成的堆栈，则将相应地分配堆栈中所有函数的所有资源。这样做越深，使用的内存就越多。$doc是主要问题，但您可能希望查看其他变量的使用情况，并确保在再次调用该函数时没有不必要的分配。

票数 1

Stack Overflow用户

发布于 2018-11-21 11:18:23

在调用函数之前，尝试unset $doc变量：

function follow_links($url){
  global $already_crawled;
  global $crawling;
  global $dbc;

  $doc = new DOMDocument();
  $doc->loadHTML(file_get_contents($url));

  $linklist = $doc->getElementsByTagName('a');

  foreach ($linklist as $link) {
    $l = $link->getAttribute("href");
    $full_link = 'https://example.com'.$l;

    if (!in_array($full_link, $already_crawled)) {

      // TODO: Fetch data from the crawled url and store it in the DB. Check if it was already stored.

      $query = 'INSERT INTO `crawled_urls`(`id`, `crawled_url`) VALUES (NULL,\'' . $full_link . '\')';
      $stmt = mysqli_prepare($dbc, $query);
      mysqli_stmt_execute($stmt);

      echo $full_link.PHP_EOL;
    }
  }

  array_shift($crawling);
  unset($doc);

  foreach ($crawling as $link) {
    follow_links($link);
  }
}

票数 0

Stack Overflow用户

发布于 2018-11-21 11:20:36

代码的主要问题是使用递归。这样，您就可以将旧页保存在内存中，尽管不再需要它们了。

尝试删除这个递归。这应该是相对容易的，因为您已经使用列表来存储您的链接。不过，我更喜欢使用一个列表并将URL表示为对象。

其他一些事情：

看起来您有一个SQL注入漏洞，所以要学会正确地使用准备好的语句。
避免使用全局变量(可以使函数返回链接列表)
如果您计划在其他人的网站上使用此代码，确保您遵守robots.txt，请限制爬行速度，并确保不多次爬行页面

如果您想将此代码用于教育以外的其他方面，我建议您使用一个库。这将比从头开始创建一个爬虫更容易。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/53410681

复制

相似问题

问从服务器获取所有内存
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从服务器获取所有内存EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从服务器获取所有内存
EN