文章/答案/技术大牛

发布

社区首页 >问答首页 >使用Easy Html Parser (EHP) python获取最近十篇codepad.org帖子

问使用Easy Html Parser (EHP) python获取最近十篇codepad.org帖子
EN

Stack Overflow用户

提问于 2016-03-29 22:02:39

回答 1查看 157关注 0票数 0

我发现了一个python html解析器，它为html源构建了一个类似dom的结构，它看起来很容易使用，而且非常快。我正在尝试为从http://codepad.org/recent检索最后10篇文章的codepad.org编写一个搜索器

EHP lib在https://github.com/iogf/ehp，我有下面的代码，它正在工作。

import requests
from ehp import Html

def catch_refs(data):
    html = Html()
    dom = html.feed(data)

    return [ind.attr['href']
                for ind in dom.find('a')
                    if 'view' in ind.text()]

def retrieve_source(refs, dir):
    """
    Get the source code of the posts then save in a dir.
    """
    pass


if __name__ == '__main__':
    req  = requests.get('http://codepad.org/recent')
    refs = catch_refs(req.text)
    retrieve_source(refs, '/tmp/')
    print refs

它输出：

[u'http://codepad.org/aQGNiQ6t', 
 u'http://codepad.org/HMrG1q7t', 
 u'http://codepad.org/zGBMaKoZ', ...]

正如所料，但我想不出如何下载这些文件的源代码。

python

html

parsing

scraper

回答 1

Stack Overflow用户

发布于 2016-03-29 22:09:58

实际上你的retrieve_source(refs, dir)什么也不做。

所以你没有得到任何结果。

根据您的评论进行更新：

import os


def get_code_snippet(page):
    dom = Html().feed(page)
    # getting all <div class=='highlight'>
    elements = [el for el in dom.find('div')
                if el.attr['class'] == 'highlight']
    return elements[1].text()

def retrieve_source(refs, dir):
    for i, ref in enumerate(refs):
        with open(os.path.join(dir, str(i) + '.html'), 'w') as r:
            r.write(get_code_snippet(requests.get(ref).content))

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/36286640

复制

相似问题

问使用Easy Html Parser (EHP) python获取最近十篇codepad.org帖子
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Easy Html Parser (EHP) python获取最近十篇codepad.org帖子EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Easy Html Parser (EHP) python获取最近十篇codepad.org帖子
EN