我从来没有机会学习网络抓取。我想知道我可以在下面的代码中添加什么,以便在给定的时间段内获得标题?如果可以只获取财经新闻,那就太好了!
import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
news_url="https://news.google.com/news/rss"
Client=urlopen(news_url)
xml_page=Client.read()
Client.close()
soup_page=soup(xml_page,"xml")
news_list=soup_page.findAll("item")
# Print news title, url and publish date
for news in news_list:
print(news.title.text)
print(news.link.text)
print(news.pubDate.text)
print("-"*60)发布于 2019-05-08 09:17:41
这是我尝试的解决方案。检查我在最后包含的get_headlines(start_date,end_date)方法。
我将抓取的XML格式转换为datetime对象,并将其与我指定的其他datetime对象进行比较,以生成布尔值。我们可以根据显示的布尔值判断一篇文章是否在我们的范围内,然后只选择那些文章。
import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
from datetime import datetime
news_url="https://news.google.com/news/rss"
Client=urlopen(news_url)
xml_page=Client.read()
Client.close()
soup_page=soup(xml_page,"xml")
news_list=soup_page.findAll("item")
# Print news title, url and publish date
for news in news_list:
print(news.title.text)
print(news.link.text)
print(news.pubDate.text)
print("-"*60)
print("Date Object: ", datetime.strptime(news.pubDate.text, "%a, %d %B %Y %X %Z"))
sample_end_date = "Wed, 08 May 2019 18:17:04 GMT"
print(datetime.strptime(sample_end_date, "%a, %d %B %Y %X %Z") > datetime.strptime(news.pubDate.text, "%a, %d %B %Y %X %Z"))
#datetime of article is less than the datetime of the end date
sample_start_date = "Wed, 08 May 2019 00:00:00 GMT"
print(datetime.strptime(sample_start_date, "%a, %d %B %Y %X %Z") < datetime.strptime(news.pubDate.text, "%a, %d %B %Y %X %Z"))
#datetime of article is greater than the datetime of the start date
#If both values are true, then we know that the article falls within the range we specified. If not, then it falls outside the range.'''
def get_headlines(start_date= input("Enter start date. \nFollow this format exactly for date input Wed, 08 May 2019 18:17:04 GMT: \n"), end_date= input("Enter end date. \n")):
start_date_object = datetime.strptime(start_date, "%a, %d %B %Y %X %Z")
end_date_object = datetime.strptime(end_date, "%a, %d %B %Y %X %Z")
news_url="https://news.google.com/news/rss"
Client=urlopen(news_url)
xml_page=Client.read()
Client.close()
soup_page=soup(xml_page,"xml")
news_list=soup_page.findAll("item")
# Print news title, url and publish date
print(f"All articles from {start_date_object} to {end_date_object}: \n")
for news in news_list:
if (end_date_object>datetime.strptime(news.pubDate.text, "%a, %d %B %Y %X %Z")>start_date_object):
print(news.title.text)
print(news.link.text)
print(news.pubDate.text)
print("-"*60)
get_headlines()以下是从GMT星期三午夜到GMT星期三18:00的输出示例:
输入开始日期。
完全按照此格式输入日期,2019年5月8日18:17:04 GMT:
2019年5月8日星期三格林尼治标准时间00:00:00
输入结束日期。
Wed,2019年5月8日18:17:04 GMT
2019-05-08 00:00:00 - 2019-05-08 18:17:04所有文章:
特朗普简要介绍科罗拉多州枪击案,白宫称-福克斯新闻https://www.foxnews.com/us/trump-briefed-on-colorado-shooting-white-house-says-politicians-offer-condolences
星期三,2019年5月8日08: 08 :22格林尼治标准时间
伊朗领导人宣布部分退出核协议-美国有线电视新闻网https://www.cnn.com/2019/05/08/middleeast/iran-nuclear-deal-intl/index.html
2019年5月8日星期三格林尼治标准时间09:40:00
中国在美国贸易协议的几乎所有方面都收回了立场:消息来源- CNBC https://www.cnbc.com/2019/05/08/china-backtracked-on-nearly-all-aspects-of-us-trade-deal-sources.html
2019年5月8日星期三格林尼治标准时间03:12:00
巴尔的高级助手看到了特朗普世界中很少有人像特朗普那样对俄罗斯进行调查- POLITICO https://www.politico.com/story/2019/05/08/brian-rabbitt-william-barr-1309751
2019年5月8日星期三格林尼治标准时间09:03:00
万洲国际指示前律师不遵守国会传票-- ABC新闻https://abcnews.go.com/Politics/white-house-instruct-counsel-comply-congressional-subpoena/story?id=62873987
2019年5月8日星期三格林尼治标准时间00:10:00
特朗普在佛罗里达州狭长地带举行集会,救灾资金被搁置- NPR https://www.npr.org/2019/05/08/720803270/as-hurricane-relief-stalls-in-d-c-trump-to-rally-base-in-florida-panhandle
2019年5月8日星期三格林尼治标准时间09:01:00
科罗拉多州STEM学校学生布兰登·比利帮助缴械枪手--美国全国广播公司新闻https://www.nbcnews.com/news/us-news/colorado-stem-school-student-brendan-bialy-helped-disarm-gunman-n1003181
2019年5月8日星期三格林尼治标准时间09:52:00
在伊朗紧张局势加剧之际,庞皮欧出其不意地访问伊拉克- Aljazeera.com https://www.aljazeera.com/news/2019/05/pompeo-surprise-iraq-visit-rising-iran-tensions-190508034718722.html
星期三,2019年5月8日格林尼治标准时间04:18:00
在南非选举中,拉马福萨面临幻想破灭的选民的裁决-纽约时报https://www.nytimes.com/2019/05/08/world/africa/south-africa-election.html
Wed,2019年5月8日07:39:49 GMT
拉合尔爆炸:至少六人在苏菲派圣地附近的爆炸中死亡- CNN https://www.cnn.com/2019/05/08/asia/lahore-blast-intl/index.html
2019年5月8日星期三格林尼治标准时间06:15:00
优步司机将在全球范围内抗议该公司900亿美元的首次公开募股-- CNBC https://www.cnbc.com/2019/05/08/uber-drivers-strike-over-low-wages-benefits-ahead-of-ipo.html
星期三,2019年5月8日08:51:13格林尼治标准时间
亿万富翁查理芒格将比特币投资人比作“犹大恶棍”--以太世界新闻https://ethereumworldnews.com/billionaire-charlie-munger-compares-bitcoin-investors-to-judas-iscariot/
2019年5月8日星期三格林尼治标准时间00:21:12
一旦助手链接到电视直播指南数据,安卓电视将受益-- Engadget https://www.engadget.com/2019/05/08/google-assistant-epg-android-tv-play-store/
星期三,2019年5月8日格林尼治标准时间08:35:40
金·卡戴珊监狱改革:金·卡戴珊·韦斯特在过去90天里帮助17人获释--哥伦比亚广播公司新闻https://www.cbsnews.com/news/kim-kardashian-west-has-helped-free-17-people-from-prison-in-the-last-90-days/
2019年5月8日星期三格林尼治标准时间05:07:00
与哈伊姆在“法伦”上的“吸血周末”表演: Watch - Pitchfork https://pitchfork.com/news/vampire-weekend-perform-with-haim-on-fallon-watch/
星期三,2019年5月8日格林尼治标准时间04:40:00
乔治·克鲁尼透露哈里和梅根的皇室宝宝共享他的生日--每日邮报https://www.dailymail.co.uk/tvshowbiz/article-7004777/George-Clooney-reveals-Prince-Harry-Meghan-Markles-newborn-shares-birthday.html
2019年5月8日星期三格林尼治标准时间06:59:43
奥克兰A的投手迈克·菲尔斯投出职业生涯第二个无安打,击败红军-福克斯新闻https://www.foxnews.com/sports/athletics-fiers-pitching-no-hitter-beats-reds
Wed,2019年5月8日格林尼治标准时间06:36:18
乔·纳马斯在电视直播中的尴尬时刻后就再也没喝过酒了--美国全国广播公司体育频道http://profootballtalk.nbcsports.com/2019/05/07/joe-namath-hasnt-had-a-drink-since-his-embarrassing-moment-on-live-tv/
星期三,2019年5月8日格林尼治标准时间01:28:00
水手队在布朗克斯-西雅图时报https://www.seattletimes.com/sports/mariners/mariners-fall-to-500-with-another-bullpen-collapse-that-leads-to-5-4-loss-in-bronx/的另一次牛棚倒塌中以5-4输给了.500
2019年5月8日星期三格林尼治标准时间03:20:00
让位,硅开关:有一种新的计算方式-- Phys.org https://phys.org/news/2019-05-silicon.html
Wed,2019年5月8日07:19:31 GMT
美国国家航空航天局小行星:航天局揭示大胆的小行星防御计划-“理想目标”- Express.co.uk https://www.express.co.uk/news/science/1123704/NASA-asteroid-double-redirection-test-NASA-DART-asteroid-Didymos
2019年5月8日星期三格林尼治标准时间07:43:00
小RFK是我们的兄弟和叔叔。可悲的是,他对疫苗的看法错了。- POLITICO https://www.politico.com/magazine/story/2019/05/08/robert-kennedy-jr-measles-vaccines-226798
2019年5月8日星期三格林尼治标准时间09:05:00
发布于 2019-05-08 09:41:40
试试feedparser
import feedparser
news_url=r'https://news.google.com/news/rss'
fp = feedparser.parse(news_url)
## number of entries
len(fp['entries'])输出:
38索引‘0’处的文章标题:
print(fp['entries'][0]['title'])输出:
School Shooting in Colorado Leaves 1 Student Dead and 7 Injured - The New York Times打印索引‘0’处的条目的所有信息:fp‘entry’
输出:
{'title': 'School Shooting in Colorado Leaves 1 Student Dead and 7 Injured - The New York Times',
'title_detail': {'type': 'text/plain',
'language': None,
'base': 'https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en',
'value': 'School Shooting in Colorado Leaves 1 Student Dead and 7 Injured - The New York Times'},
'links': [{'rel': 'alternate',
'type': 'text/html',
'href': 'https://www.nytimes.com/2019/05/07/us/colorado-school-shooting.html'}],
'link': 'https://www.nytimes.com/2019/05/07/us/colorado-school-shooting.html',
'id': '52780288859641',
'guidislink': False,
'published': 'Wed, 08 May 2019 00:56:15 GMT',
'published_parsed': time.struct_time(tm_year=2019, tm_mon=5, tm_mday=8, tm_hour=0, tm_min=56, tm_sec=15, tm_wday=2, tm_yday=128, tm_isdst=0),
'summary': '<ol><li><a href="https://www.nytimes.com/2019/05/07/us/colorado-school-shooting.html" target="_blank">School Shooting in Colorado Leaves 1 Student Dead and 7 Injured</a> <font color="#6f6f6f">The New York Times</font></li><li><a href="https://www.foxnews.com/us/injuries-reported-unstable-situation-shots-fired-at-colorado-school-sheriff-says" target="_blank">Colorado school shooting leaves at least 1 dead, 7 injured, 2 in custody, sheriff\'s office says</a> <font color="#6f6f6f">Fox News</font></li><li><a href="https://www.cnn.com/2019/05/07/us/colorado-denver-area-school-shooting/index.html" target="_blank">Eight injured in school shooting in suburban Denver, 2 suspects are in custody</a> <font color="#6f6f6f">CNN</font></li><li><a href="https://kdvr.com/2019/05/07/president-trump-briefed-on-highlands-ranch-school-shooting/" target="_blank">President Trump briefed on Highlands Ranch school shooting</a> <font color="#6f6f6f">FOX 31 Denver</font></li><li><a href="https://www.oregonlive.com/nation/2019/05/sheriff-school-shooting-near-denver-injures-at-least-7.html" target="_blank">Sheriff: School shooting near Denver injures at least 7</a> <font color="#6f6f6f">OregonLive</font></li><li><strong><a href="https://news.google.com/stories/CAAqcQgKImtDQklTU2pvSmMzUnZjbmt0TXpZd1NqMEtFUWo1dV9ueWpZQU1FVWE5TGp2Z2NDNFJFaWhUYUc5MGN5Qm1hWEpsWkNCaGRDQnpZMmh2YjJ3Z2FXNGdTR2xuYUd4aGJtUnpJRkpoYm1Ob0tBQVAB?oc=5" target="_blank">View full coverage on Google News</a></strong></li></ol>',
'summary_detail': {'type': 'text/html',
'language': None,
'base': 'https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en',
'value': '<ol><li><a href="https://www.nytimes.com/2019/05/07/us/colorado-school-shooting.html" target="_blank">School Shooting in Colorado Leaves 1 Student Dead and 7 Injured</a> <font color="#6f6f6f">The New York Times</font></li><li><a href="https://www.foxnews.com/us/injuries-reported-unstable-situation-shots-fired-at-colorado-school-sheriff-says" target="_blank">Colorado school shooting leaves at least 1 dead, 7 injured, 2 in custody, sheriff\'s office says</a> <font color="#6f6f6f">Fox News</font></li><li><a href="https://www.cnn.com/2019/05/07/us/colorado-denver-area-school-shooting/index.html" target="_blank">Eight injured in school shooting in suburban Denver, 2 suspects are in custody</a> <font color="#6f6f6f">CNN</font></li><li><a href="https://kdvr.com/2019/05/07/president-trump-briefed-on-highlands-ranch-school-shooting/" target="_blank">President Trump briefed on Highlands Ranch school shooting</a> <font color="#6f6f6f">FOX 31 Denver</font></li><li><a href="https://www.oregonlive.com/nation/2019/05/sheriff-school-shooting-near-denver-injures-at-least-7.html" target="_blank">Sheriff: School shooting near Denver injures at least 7</a> <font color="#6f6f6f">OregonLive</font></li><li><strong><a href="https://news.google.com/stories/CAAqcQgKImtDQklTU2pvSmMzUnZjbmt0TXpZd1NqMEtFUWo1dV9ueWpZQU1FVWE5TGp2Z2NDNFJFaWhUYUc5MGN5Qm1hWEpsWkNCaGRDQnpZMmh2YjJ3Z2FXNGdTR2xuYUd4aGJtUnpJRkpoYm1Ob0tBQVAB?oc=5" target="_blank">View full coverage on Google News</a></strong></li></ol>'},
'source': {'href': 'https://www.nytimes.com', 'title': 'The New York Times'}}https://stackoverflow.com/questions/56031974
复制相似问题