是否有适合于解析许多非结构化网站(新闻、文章)并在没有之前定义的规则的情况下从其中提取主要内容块的网页爬虫器?
我的意思是,当我解析一个新闻提要时,我想从每一篇文章中提取主内容块来做一些NLP内容。我有很多网站,要查看它们的DOM模型并为它们编写规则需要花费很长时间。
我试图使用Scrapy并获取所有没有标记和脚本的文本,放在一个正文中,但是它包含了很多无关的东西,比如菜单项、广告块等等。
site_body = selector.xpath('//body').extract_first()但是,在这样的内容上做NLP并不是很精确。
那么,还有其他工具或方法来完成这些任务吗?
发布于 2016-03-17 17:04:31
我试着用模式匹配来解决这个问题。因此,您可以注释网页本身的来源,并使用它作为获得匹配的示例,并且您不需要编写特殊的规则。
例如,如果您查看此页面的源,您会看到:
<td class="postcell">
<div>
<div class="post-text" itemprop="text">
<p>Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?</p>然后删除文本并添加{.},将位置标记为相关,并获得:
<td class="postcell">
<div>
<div class="post-text" itemprop="text">
{.}(通常您也需要结束标记,但是对于单个元素,它是不必要的)
然后将该模式传递给Xidel (因此似乎阻止了默认的用户代理,因此需要对其进行更改),
xidel 'http://stackoverflow.com/questions/36066030/web-crawler-for-unstructured-data' --user-agent "Mozilla/5.0 (compatible; Xidel)" -e '<td class="postcell"><div><div class="post-text" itemprop="text">{.}'它输出你的文本
Are there any web-crawlers adapted for parsing many unstructured websites (news, articles) and extracting a main block of content from them without previously defined rules?
I mean when I'm parsing a news feed, I want to extract the main content block from each article to do some NLP stuff. I have a lot of websites and it will take forever to look into their DOM model and write rules for each of them.
I was trying to use Scrapy and get all text without tags and scripts, placed in a body, but it include a lot of un-relevant stuff, like menu items, ad blocks, etc.
site_body = selector.xpath('//body').extract_first()
But doing NLP over such kind of content will not be very precise.
So is there any other tools or approaches for doing such tasks?发布于 2016-03-18 16:06:44
您可以在您的parse()和get_text()中使用美丽的汤
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(response.body, 'html.parser')
yield {'body': soup.get_text() }你也可以手动删除你不想要的东西(如果你发现你喜欢一些标记,比如<H1>或<b>可能是有用的信号)
# Remove invisible tags
#for i in soup.findAll(lambda tag: tag.name in ['script', 'link', 'meta']):
# i.extract()你可以做一些类似的事情白名单几个标签。
https://stackoverflow.com/questions/36066030
复制相似问题