文章/答案/技术大牛

发布

社区首页 >问答首页 >不了解网页结构的Web抓取

问不了解网页结构的Web抓取
EN

Stack Overflow用户

提问于 2014-05-28 21:13:59

回答 2查看 3.5K关注 0票数 8

我正试图通过写剧本来教自己一个概念。基本上，我正在尝试编写一个Python脚本，给定几个关键字，它将爬行网页，直到找到我需要的数据为止。例如，假设我想找到一张生活在美国的蛇类清单。我可能会用关键字list,venemous,snakes,US运行我的脚本，我希望能够至少80%的肯定地相信它会返回美国的蛇列表。

我已经知道如何实现网页蜘蛛部分，我只想学习如何确定网页的相关性，而不知道网页的结构。我已经研究过web抓取技术，但它们似乎都假设了解页面的html标记结构。是否有某种算法可以让我从页面中提取数据并确定其相关性？

任何指示都将不胜感激。我将Python与urllib和BeautifulSoup结合使用。

python

web-scraping

beautifulsoup

web-crawler

回答 2

Stack Overflow用户

发布于 2014-05-28 21:36:57

使用像scrapy这样的爬虫(只是为了处理并发下载)，您可以编写这样一个简单的蜘蛛，并且可能从Wikipedia开始作为一个很好的起点。这个脚本是一个使用scrapy、nltk和whoosh的完整示例。它将永远不会停止，并将索引的链接，为以后的搜索使用whoosh，这是一个小谷歌：

_Author = Farsheed Ashouri
import os
import sys
import re
## Spider libraries
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from main.items import MainItem
from scrapy.http import Request
from urlparse import urljoin
## indexer libraries
from whoosh.index import create_in, open_dir
from whoosh.fields import *
## html to text conversion module
import nltk

def open_writer():
    if not os.path.isdir("indexdir"):
        os.mkdir("indexdir")
        schema = Schema(title=TEXT(stored=True), content=TEXT(stored=True))
        ix = create_in("indexdir", schema)
    else:
        ix = open_dir("indexdir")
    return ix.writer()

class Main(BaseSpider):
    name        = "main"
    allowed_domains = ["en.wikipedia.org"]
    start_urls  = ["http://en.wikipedia.org/wiki/Snakes"]
    
    def parse(self, response):
        writer = open_writer()  ## for indexing
        sel = Selector(response)
        email_validation = re.compile(r'^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$')
        #general_link_validation = re.compile(r'')
        #We stored already crawled links in this list
        crawledLinks    = set()
        titles = sel.xpath('//div[@id="content"]//h1[@id="firstHeading"]//span/text()').extract()
        contents = sel.xpath('//body/div[@id="content"]').extract()
        if contents:
            content = contents[0]
        if titles: 
            title = titles[0]
        else:
            return
        links   = sel.xpath('//a/@href').extract()

        
        for link in links:
            # If it is a proper link and is not checked yet, yield it to the Spider
            url = urljoin(response.url, link)
            #print url
            ## our url must not have any ":" character in it. link /wiki/talk:company
            if not url in crawledLinks and re.match(r'http://en.wikipedia.org/wiki/[^:]+$', url):
                crawledLinks.add(url)
                  #print url, depth
                yield Request(url, self.parse)
        item = MainItem()
        item["title"] = title
        print '*'*80
        print 'crawled: %s | it has %s links.' % (title, len(links))
        #print content
        print '*'*80
        item["links"] = list(crawledLinks)
        writer.add_document(title=title, content=nltk.clean_html(content))  ## I save only text from content.
        #print crawledLinks
        writer.commit()
        yield item

票数 5

Stack Overflow用户

发布于 2014-05-28 21:32:16

你基本上是在问“我怎么写一个搜索引擎。”这是..。不是微不足道的。

正确的方法是使用谷歌(或必应，雅虎，或.)的搜索API，并显示最佳搜索结果。但是，如果你只是在一个个人项目上教自己一些概念(不确定这些概念到底是什么)，那么下面是一些建议：

搜索适当标记(<p>、<div>等)的文本内容，以查找相关关键字(duh)
使用相关关键字检查是否存在可能包含所需内容的标记。例如，如果您正在寻找一张东西列表，那么一个包含<ul>或<ol>甚至<table>的页面可能是一个很好的选择
建立一个同义词词典，并搜索每一页的同义词你的关键字也。将自己限制在"US“可能意味着对包含"America”的页面进行人为的低排名
保留一个不在你的关键词集中的单词的列表，并给包含它们中大多数的页面一个更高的排名。这些页面(可以说)更有可能包含您正在寻找的答案

祝你好运(你会需要的)！

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/23921986

复制

相似问题

问不了解网页结构的Web抓取
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问不了解网页结构的Web抓取EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问不了解网页结构的Web抓取
EN