文章/答案/技术大牛

发布

社区首页 >问答首页 >从戴尔社区论坛提取特定日期的数据

问从戴尔社区论坛提取特定日期的数据
EN

Stack Overflow用户

提问于 2022-11-02 15:58:39

回答 2查看 41关注 0票数 0

我想从特定日期的戴尔社区论坛线程中提取用户名、帖子标题、发布时间和消息内容，并将其存储到excel文件中。

例如，URL：https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017

我想提取帖子标题：“我得到了时间同步错误，最后一次同步时间显示为2015年的一天”。

和评论的详细信息(用户名、发布时间、信息)仅为10-25-2022。

jraju，上午04:20，“这台pc是台式电脑inspiron 3910型号，戴尔本周才供货。”
Mary，09:10 AM，“尝试重新启动计算机并再次连接到互联网，看看是否清除了它。不要忘记运行Windows Update，以便在新计算机上获得所有必要的更新。”
RoHe，01:00 PM，“您可能需要阅读Fix:时间同步在Windows 11上失败。完全忽略下载软件工具的部分，然后向下滚动到该部分:如何手动同步Windows11 PC上的时间。注意:在步骤6中，如果time.windows.com不能工作，请从屏幕上的下拉菜单中选择另一台服务器。”

没有任何其他评论。

我对这件事很陌生。

到目前为止，我只是设法提取信息(没有用户名)，没有日期过滤器。

我对这件事很陌生。

到目前为止，我只是设法提取信息(没有用户名)，没有日期过滤器。

import requests
from bs4 import BeautifulSoup

url = "https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017"

result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")

###### time ######
time = doc.find_all('span', attrs={'class':'local-time'})
print(time)
##################

##### date #######
date = doc.find_all('span', attrs={'class':'local-date'})
print(date)
#################

#### message ######
article_text = ''
article = doc.find_all("div", {"class":"lia-message-body-content"})
for element in article:
    article_text += '\n' + ''.join(element.find_all(text = True))
    
print(article_text)
##################
all_data = []
for t, d, m in zip(time, date, article):
    all_data.append([t.text, d.get_text(strip=True),m.get_text(strip=True, separator='\n')])

with open('data.csv', 'w', newline='', encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in all_data:
        writer.writerow(row)

python

web-scraping

beautifulsoup

csvwriter

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-11-02 17:34:45

在我看来，您的选择器有问题，并且在一般范围(整个HTML主体)中搜索它们。我的方法是缩小“组件”的范围，并在其中搜索：

找到保存所有注释的div
在它内部搜索每个注释容器
从每个注释容器获取用户名、日期和注释信息

以下是如何实现这一目标：

import requests
from bs4 import BeautifulSoup

url = "https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017"

result = requests.get(url)
soup = BeautifulSoup(result.text, "html.parser")

date = '10-25-2022'
comments = []

comments_section = soup.find('div', {'class':'lia-component-message-list-detail-with-inline-editors'})
comments_body = comments_section.find_all('div', {'class':'lia-linear-display-message-view'})
for comment in comments_body:
    if date in comment.find('span',{'class':'local-date'}).text:
        comments.append({
            'name': comment.find('a',{'class':'lia-user-name-link'}).text,
            'date': comment.find('span',{'class':'local-date'}).text,
            'comment': comment.find('div',{'class':'lia-message-body-content'}).text,
        })

data = {
    "title": soup.find('div', {'class':'lia-message-subject'}).text,
    "comments": comments
}

print(data)

此脚本生成一个JSON对象(字符串化)，如下所示：

{
   "title":"\n\n\n\n\n\t\t\t\t\t\t\tI am getting time sync errror and the last synced time shown as a day in 2015\n\t\t\t\t\t\t\n\n\n\n",
   "comments":[
      {
         "name":"jraju",
         "date":"10-25-2022",
         "comment":"This pc is desktop inspiron 3910 model . The dell supplied only this week."
      },
      {
         "name":"Mary G",
         "date":"10-25-2022",
         "comment":"Try rebooting the computer and connecting to the internet again to see if that clears it up.\\xa0\nDon't forget to run Windows Update to get all the necessary updates on a new computer.\\xa0\n\\xa0"
      },
      {
         "name":"RoHe",
         "date":"10-25-2022",
         "comment":"You might want to read Fix: Time synchronization failed on Windows 11.\nTotally ignore the part about downloading the software tool, and scroll down that same page to the part: How to manually sync time on a Windows 11 PC.\nNOTE: In step #6, if time.windows.com doesn't work, pick a different server from the drop-down menu on that screen.\n\nRon\\xa0\\xa0 Forum Member since 2004\\xa0\\xa0 I'm not a Dell employee"
      }
   ]
}

作为WebScrapingAPI的一名工程师，我也可以向您推荐我们的工具，这将防止检测，使您的刮刀更可靠的长期。

唯一需要更改它才能工作的是您请求的URL。在这种情况下，目标网站将成为API端点的参数。其他一切都保持不变。

然后，url变量将变成：

url = 'https://api.webscrapingapi.com/v1?api_key=<YOUR_API_KEY>&url=https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017'

票数 0

Stack Overflow用户

发布于 2022-11-02 17:45:51

您可以通过锁定lia-component-message-view-widget-author-username类来获取用户名

username = doc.find_all('span', attrs={'class':'lia-component-message-view-widget-author-username'})

然后将其包含在all_data中，并使用if进行过滤：

all_data = []
for u, t, d, m in zip(username, time, date, article):
    if d.get_text(strip=True)[1:] == '10-25-2022': 
        all_data.append([
            u.get_text(strip=True) , 
            t.text, 
            d.get_text(strip=True),
            m.get_text(strip=True, separator='\n')
        ])

顺便说一句，列表理解比追加on循环要快一些，您只需在一条语句中定义和填充all_data：

all_data = [[
        u.get_text(strip=True), 
        t.text, 
        d.get_text(strip=True),
        m.get_text(strip=True, separator='\n')
    ] for u, t, d, m in zip(uname, time, date, article)
    if d.get_text(strip=True)[1:] == '10-25-2022'
]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/74292169

复制

相似问题

问从戴尔社区论坛提取特定日期的数据
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从戴尔社区论坛提取特定日期的数据EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从戴尔社区论坛提取特定日期的数据
EN