我正在使用python goose extractor,它在mashable.com和usatoday.com上的每一篇文章都失败了。有人能为这个问题提出一个解决方案吗?
对于usatoday.com文章:
g = Goose()
article = g.extract(url='http://www.usatoday.com/story/tech/columnist/talkingtech/2014/01/25/namm-2014---ik-multimedias-rings-to-make-music/4863193/')
assert(article.cleaned_text=='')关于mashable的文章:
g = Goose()
article = g.extract(url='http://mashable.com/2014/01/26/square-cofounder-jim-mckelvey/')
assert(article.cleaned_text=='')对于politicalwire的文章:
g = Goose()
article = g.extract(url='http://politicalwire.com/archives/2014/01/27/some_republicans_go_off_script_in_sotu_response.html')
assert(article.cleaned_text=='')我认为这些都是非常重要的文本提取网站。有没有人能给我个建议?谢谢
发布于 2014-06-08 00:39:10
来自here的最新版本Goose能够从usatoday.com和mashable.com中提取
https://stackoverflow.com/questions/21397893
复制相似问题