我写这个脚本是为了从网站上提取外部链接(没有http的href)。
对于解析,我使用DOMdocument,因为与正则表达式相比,建议使用它,而且我不知道它是否编写得很好。
这就是脚本:
<?php
// It may take a whils to spider a website ...
set_time_limit(10000);
// Inculde the phpcrawl-mainclass
include_once('../PHPCrawl_083/PHPCrawl_083/libs/PHPCrawler.class.php');
//include ('2.php');
// Extend the class and override the handleDocumentInfo()-method
class MyCrawler extends PHPCrawler
{
function handleDocumentInfo(PHPCrawlerDocumentInfo $DocInfo) {
if (PHP_SAPI == "cli") $lb = "\n";
else {
$lb = "<br />";
$home_url = parse_url($DocInfo->url ,PHP_URL_HOST );
$dom = new DOMDocument();
$dom->loadHTML($DocInfo->url);
// all links in document
$dom->strictErrorChecking = FALSE;
// Get all the links
$links = $dom->getElementsByTagName("a");
foreach($links as $link) {
$href = $link->getAttribute("href");
if (strpos( $home_url['host'], $href) == -1) {
echo $link ;
}
}
}
}
}
$crawler = new MyCrawler();
$crawler->setURL("http://tunisie-web.org");
$crawler->addURLFilterRule("#\.(jpg|gif|png|pdf|jpeg|css|js)$#i");
$crawler->setWorkingDirectory("C:/Users/mayss/Documents/travailcrawl/");
$crawler->go();
?>发布于 2021-08-12 11:42:19
你实际上在你的代码中有它。这是从HTML获取链接列表并读取DOMDocument属性的部分:
/** @var DOMNodelist $links */
$links = $dom->getElementsByTagName('a');
if ($links->length) {
foreach($links as $link) {
/** @var DOMNode $link */
$href = $link->getAttribute('href');
echo $href."<br>";
}
} else {
echo "No link found in HTML.";
}https://stackoverflow.com/questions/31814135
复制相似问题