我想从特定日期的戴尔社区论坛线程中提取用户名、帖子标题、发布时间和消息内容,并将其存储到excel文件中。
我想提取帖子标题:“我得到了时间同步错误,最后一次同步时间显示为2015年的一天”。
和评论的详细信息(用户名、发布时间、信息)仅为10-25-2022。
没有任何其他评论。
我对这件事很陌生。
到目前为止,我只是设法提取信息(没有用户名),没有日期过滤器。
我对这件事很陌生。
到目前为止,我只是设法提取信息(没有用户名),没有日期过滤器。
import requests
from bs4 import BeautifulSoup
url = "https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
###### time ######
time = doc.find_all('span', attrs={'class':'local-time'})
print(time)
##################
##### date #######
date = doc.find_all('span', attrs={'class':'local-date'})
print(date)
#################
#### message ######
article_text = ''
article = doc.find_all("div", {"class":"lia-message-body-content"})
for element in article:
article_text += '\n' + ''.join(element.find_all(text = True))
print(article_text)
##################
all_data = []
for t, d, m in zip(time, date, article):
all_data.append([t.text, d.get_text(strip=True),m.get_text(strip=True, separator='\n')])
with open('data.csv', 'w', newline='', encoding="utf-8") as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in all_data:
writer.writerow(row)发布于 2022-11-02 17:34:45
在我看来,您的选择器有问题,并且在一般范围(整个HTML主体)中搜索它们。我的方法是缩小“组件”的范围,并在其中搜索:
div以下是如何实现这一目标:
import requests
from bs4 import BeautifulSoup
url = "https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017"
result = requests.get(url)
soup = BeautifulSoup(result.text, "html.parser")
date = '10-25-2022'
comments = []
comments_section = soup.find('div', {'class':'lia-component-message-list-detail-with-inline-editors'})
comments_body = comments_section.find_all('div', {'class':'lia-linear-display-message-view'})
for comment in comments_body:
if date in comment.find('span',{'class':'local-date'}).text:
comments.append({
'name': comment.find('a',{'class':'lia-user-name-link'}).text,
'date': comment.find('span',{'class':'local-date'}).text,
'comment': comment.find('div',{'class':'lia-message-body-content'}).text,
})
data = {
"title": soup.find('div', {'class':'lia-message-subject'}).text,
"comments": comments
}
print(data)此脚本生成一个JSON对象(字符串化),如下所示:
{
"title":"\n\n\n\n\n\t\t\t\t\t\t\tI am getting time sync errror and the last synced time shown as a day in 2015\n\t\t\t\t\t\t\n\n\n\n",
"comments":[
{
"name":"jraju",
"date":"10-25-2022",
"comment":"This pc is desktop inspiron 3910 model . The dell supplied only this week."
},
{
"name":"Mary G",
"date":"10-25-2022",
"comment":"Try rebooting the computer and connecting to the internet again to see if that clears it up.\\xa0\nDon't forget to run Windows Update to get all the necessary updates on a new computer.\\xa0\n\\xa0"
},
{
"name":"RoHe",
"date":"10-25-2022",
"comment":"You might want to read Fix: Time synchronization failed on Windows 11.\nTotally ignore the part about downloading the software tool, and scroll down that same page to the part: How to manually sync time on a Windows 11 PC.\nNOTE: In step #6, if time.windows.com doesn't work, pick a different server from the drop-down menu on that screen.\n\nRon\\xa0\\xa0 Forum Member since 2004\\xa0\\xa0 I'm not a Dell employee"
}
]
}作为WebScrapingAPI的一名工程师,我也可以向您推荐我们的工具,这将防止检测,使您的刮刀更可靠的长期。
唯一需要更改它才能工作的是您请求的URL。在这种情况下,目标网站将成为API端点的参数。其他一切都保持不变。
然后,url变量将变成:
url = 'https://api.webscrapingapi.com/v1?api_key=<YOUR_API_KEY>&url=https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017'发布于 2022-11-02 17:45:51
您可以通过锁定lia-component-message-view-widget-author-username类来获取用户名
username = doc.find_all('span', attrs={'class':'lia-component-message-view-widget-author-username'})然后将其包含在all_data中,并使用if进行过滤:
all_data = []
for u, t, d, m in zip(username, time, date, article):
if d.get_text(strip=True)[1:] == '10-25-2022':
all_data.append([
u.get_text(strip=True) ,
t.text,
d.get_text(strip=True),
m.get_text(strip=True, separator='\n')
])顺便说一句,列表理解比追加on循环要快一些,您只需在一条语句中定义和填充all_data:
all_data = [[
u.get_text(strip=True),
t.text,
d.get_text(strip=True),
m.get_text(strip=True, separator='\n')
] for u, t, d, m in zip(uname, time, date, article)
if d.get_text(strip=True)[1:] == '10-25-2022'
]https://stackoverflow.com/questions/74292169
复制相似问题