我正试图用BS4制作一个新闻刮板,我可以从网站(cnn)获得html代码,这是我的代码:
from bs4 import BeautifulSoup
import requests
url = "https://www.cnn.com/"
topic = input("What kind of news are you looking for")
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
prices = doc.find_all(text = f"{topic}")
parent = prices[0].parent
print(parent)但这让我犯了这个错误
xxx@xxx-xxx xxx % python3 news_scraper.py
What kind of news are you looking for?Coronavirus
Traceback (most recent call last):
File "/xxx/xxx/xxx/xxx/news_scraper.py", line 10, in <module>
parent = prices[0].parent
IndexError: list index out of range我不知道是什么导致了这一切,谢谢!
发布于 2022-09-01 01:33:53
如果没有在页面上找到字符串topic,那么prices将是一个空数组。要解决这个问题,首先检查prices的长度是否为零。如下所示:
from bs4 import BeautifulSoup
import requests
url = "https://www.cnn.com/"
topic = input("What kind of news are you looking for")
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
prices = doc.find_all(text = f"{topic}")
if len(prices) != 0:
parent = prices[0].parent
print(parent)
else:
print("No news of that topic was found.");发布于 2022-09-01 01:48:01
我认为问题是,CNN的大部分网页是动态的,美丽的汤不能阅读动态生成的内容。页面底部的部分是该url的原生部分,并且在这些部分上工作得很好。动态页面需要一些类似Selenium的内容。
发布于 2022-09-01 04:39:39
doc = BeautifulSoup(result.text, "html.parser")
print(doc)添加打印和查看,而不是DOM,只有JS和CSS
https://stackoverflow.com/questions/73563318
复制相似问题