当我运行Python代码时
import newspaper
print(len(newspaper.build('http://cnn.com', memoize_articles=False).articles))
exit()在Python3中,我得到了输出897 (即,newspaper3k找到了897个关于域http://cnn.com的文章),但是当我运行
import newspaper
print(len(newspaper.build('http://www.cnn.com', memoize_articles=False).articles))
exit()(例如,有了额外的www.;其他的都没有改变)我只得到了895。当我在这两个URL之间来回切换时,这些数字是一致的。www.在网址中真的很重要吗?如果是这样,为什么在使用newspaper3k库时,文章计数与这两个URL变得如此相似?否则,为什么文章数量不完全相同呢?
发布于 2020-09-14 05:45:54
如下图所示,www‘’less resource中的几个url有两种变体:
带www
www的
import newspaper
artcls = newspaper.build('https://cnn.com', memoize_articles=False).articles
urls = [a.url.replace('www.', '') for a in artcls]
duplicated = set()
for u in urls:
if urls.count(u) > 1:
duplicated.add(u)
for d in duplicated:
print(d)结果:
https://cnn.com/business/media
https://cnn.com/travel/news
https://cnn.com/travel/article/hong-kong-cbd-cafe-found-wellness-intl-hnk/index.html
https://cnn.com/travel/article/rent-fire-lookout-towers-covid-19/index.htmlhttps://stackoverflow.com/questions/63875180
复制相似问题