问使用domdocument解析网站
EN

Stack Overflow用户

提问于 2015-08-05 00:05:52

回答 1查看 57关注 0票数 0

我写这个脚本是为了从网站上提取外部链接(没有http的href)。

对于解析，我使用DOMdocument，因为与正则表达式相比，建议使用它，而且我不知道它是否编写得很好。

这就是脚本：

<?php 

  // It may take a whils to spider a website ... 
     set_time_limit(10000); 

  // Inculde the phpcrawl-mainclass 
 include_once('../PHPCrawl_083/PHPCrawl_083/libs/PHPCrawler.class.php'); 
  //include ('2.php');  
  // Extend the class and override the handleDocumentInfo()-method 

  class MyCrawler extends PHPCrawler 

   {   
    function handleDocumentInfo(PHPCrawlerDocumentInfo $DocInfo) {

        if (PHP_SAPI == "cli") $lb = "\n"; 
         else {
        $lb = "<br />"; 

         $home_url = parse_url($DocInfo->url ,PHP_URL_HOST ); 

         $dom = new DOMDocument();
          $dom->loadHTML($DocInfo->url);

          // all links in document
       $dom->strictErrorChecking = FALSE;

         // Get all the links
         $links = $dom->getElementsByTagName("a");
        foreach($links as $link) {
         $href = $link->getAttribute("href");


          if (strpos( $home_url['host'], $href) == -1) {

        echo $link ;
         }

       }


           }
         }
     }
    $crawler = new MyCrawler(); 
     $crawler->setURL("http://tunisie-web.org"); 

   $crawler->addURLFilterRule("#\.(jpg|gif|png|pdf|jpeg|css|js)$#i"); 
   $crawler->setWorkingDirectory("C:/Users/mayss/Documents/travailcrawl/"); 
        $crawler->go(); 

   ?>

php

regex

domdocument

回答 1

Stack Overflow用户

发布于 2021-08-12 11:42:19

你实际上在你的代码中有它。这是从HTML获取链接列表并读取DOMDocument属性的部分：

/** @var DOMNodelist $links */
$links = $dom->getElementsByTagName('a');
if ($links->length) {
    foreach($links as $link) {
        /** @var DOMNode $link */
        $href = $link->getAttribute('href');
        echo $href."<br>";
    }
} else {
     echo "No link found in HTML.";
}

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/31814135

复制

相似问题

问使用domdocument解析网站
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用domdocument解析网站EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用domdocument解析网站
EN