我编写了一个非常简单的PHP爬虫,但是我对内存丢失有问题。守则是:
<?php
require_once 'db.php';
$homepage = 'https://example.com';
$query = "SELECT * FROM `crawled_urls`";
$response = @mysqli_query($dbc, $query);
$already_crawled = [];
$crawling = [];
while($row = mysqli_fetch_array($response)){
$already_crawled[] = $row['crawled_url'];
$crawling[] = $row['crawled_url'];
}
function follow_links($url){
global $already_crawled;
global $crawling;
global $dbc;
$doc = new DOMDocument();
$doc->loadHTML(file_get_contents($url));
$linklist = $doc->getElementsByTagName('a');
foreach ($linklist as $link) {
$l = $link->getAttribute("href");
$full_link = 'https://example.com'.$l;
if (!in_array($full_link, $already_crawled)) {
// TODO: Fetch data from the crawled url and store it in the DB. Check if it was already stored.
$query = 'INSERT INTO `crawled_urls`(`id`, `crawled_url`) VALUES (NULL,\'' . $full_link . '\')';
$stmt = mysqli_prepare($dbc, $query);
mysqli_stmt_execute($stmt);
echo $full_link.PHP_EOL;
}
}
array_shift($crawling);
foreach ($crawling as $link) {
follow_links($link);
}
}
follow_links($homepage);你能帮我,并与我分享一个方法,以避免这种巨大的记忆损失?当我开始这个过程的时候,一切都很好,但是记忆却稳定地上升到100%。
发布于 2018-11-21 11:20:12
当您不再需要unset $doc时,您需要它:
function follow_links($url){
global $already_crawled;
global $crawling;
global $dbc;
$doc = new DOMDocument();
$doc->loadHTML(file_get_contents($url));
$linklist = $doc->getElementsByTagName('a');
unset($doc);
foreach ($linklist as $link) {
$l = $link->getAttribute("href");
$full_link = 'https://example.com'.$l;
if (!in_array($full_link, $already_crawled)) {
// TODO: Fetch data from the crawled url and store it in the DB. Check if it was already stored.
$query = 'INSERT INTO `crawled_urls`(`id`, `crawled_url`) VALUES (NULL,\'' . $full_link . '\')';
$stmt = mysqli_prepare($dbc, $query);
mysqli_stmt_execute($stmt);
echo $full_link.PHP_EOL;
}
}
array_shift($crawling);
foreach ($crawling as $link) {
follow_links($link);
}
}
follow_links($homepage);说明:您使用的是递归,也就是说,您基本上使用的是一堆函数。这意味着,如果您有一个由20个元素组成的堆栈,则将相应地分配堆栈中所有函数的所有资源。这样做越深,使用的内存就越多。$doc是主要问题,但您可能希望查看其他变量的使用情况,并确保在再次调用该函数时没有不必要的分配。
发布于 2018-11-21 11:18:23
在调用函数之前,尝试unset $doc变量:
function follow_links($url){
global $already_crawled;
global $crawling;
global $dbc;
$doc = new DOMDocument();
$doc->loadHTML(file_get_contents($url));
$linklist = $doc->getElementsByTagName('a');
foreach ($linklist as $link) {
$l = $link->getAttribute("href");
$full_link = 'https://example.com'.$l;
if (!in_array($full_link, $already_crawled)) {
// TODO: Fetch data from the crawled url and store it in the DB. Check if it was already stored.
$query = 'INSERT INTO `crawled_urls`(`id`, `crawled_url`) VALUES (NULL,\'' . $full_link . '\')';
$stmt = mysqli_prepare($dbc, $query);
mysqli_stmt_execute($stmt);
echo $full_link.PHP_EOL;
}
}
array_shift($crawling);
unset($doc);
foreach ($crawling as $link) {
follow_links($link);
}
}发布于 2018-11-21 11:20:36
代码的主要问题是使用递归。这样,您就可以将旧页保存在内存中,尽管不再需要它们了。
尝试删除这个递归。这应该是相对容易的,因为您已经使用列表来存储您的链接。不过,我更喜欢使用一个列表并将URL表示为对象。
其他一些事情:
robots.txt,请限制爬行速度,并确保不多次爬行页面如果您想将此代码用于教育以外的其他方面,我建议您使用一个库。这将比从头开始创建一个爬虫更容易。
https://stackoverflow.com/questions/53410681
复制相似问题