首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >我无法在Python中抓取以下HTML的数据

我无法在Python中抓取以下HTML的数据
EN

Stack Overflow用户
提问于 2019-12-16 19:59:43
回答 1查看 53关注 0票数 0

我正在尝试从MouthShut.com用户评论中获取数据。如果我正在查看评论Devtools,所需的评论文本位于以下标签中。-更多评论数据

代码语言:javascript
复制
<div class="more reviewdata">                                            Ipohone 11 Pro X : Looks alike a minion having Three Eyes. yes its Seems as An Alien, But Technically Iphone is Copying features and Function of Androids and Having Custom Os Phones.Triple Camera is Great! for Wide Angle Photography.But The looks of Iphone 11 pro X isn't Good.If ...<a style="cursor:pointer" onclick="bindreviewcontent('2958778',this,false,'I found this review of Apple iPhone 11 Pro Max 512GB pretty useful',925993570,'.png','I found this review of Apple iPhone 11 Pro Max 512GB pretty useful %23WriteShareWin','https://www.mouthshut.com/review/Apple-iPhone-11-Pro-Max-512GB-review-omnstsstqun','Apple iPhone 11 Pro Max 512GB',' 1/5','omnstsstqun');">Read More</a></div>

我只想提取评论的文本内容,有人可以帮助如何提取,因为它没有唯一的分隔符这样做。

我已经完成了以下代码:

代码语言:javascript
复制
from requests import get
bse_url = 'https://www.mouthshut.com/mobile-phones/Apple-iPhone-11-Pro-Max-reviews-925993567'
response = get(url)

print(response.text[:100])
from bs4 import BeautifulSoup

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
reviews = html_soup.find_all('div', class_ = 'more reviewdata')

print(type(reviews))
print(len(reviews))

first_review = reviews[2]
first_review.div
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-12-16 20:42:41

要从页面中抓取所有评论,您可以使用此示例。一些较大的评论被分别作为POST请求进行抓取:

代码语言:javascript
复制
import re
import requests
from textwrap import wrap
from bs4 import BeautifulSoup

base_url = 'https://www.mouthshut.com/mobile-phones/Apple-iPhone-11-Pro-Max-reviews-925993567'


data = {
    'type': 'review',
    'reviewid': -1,
    'corp': 'false',
    'catname': ''
}

more_url = 'https://www.mouthshut.com/review/CorporateResponse.ashx'

output = []
with requests.session() as s:
    soup = BeautifulSoup(s.get(base_url).text, 'html.parser')
    for review in soup.select('.reviewdata'):

        a = review.select_one('a[onclick^="bindreviewcontent"]')
        if a:
            data['reviewid'] = re.findall(r"bindreviewcontent\('(\d+)", a['onclick'])[0]
            comment = BeautifulSoup( s.post(more_url, data=data).text, 'html.parser' )
            comment.div.extract()
            comment.ul.extract()

            output.append( comment.get_text(separator=' ', strip=True) )
        else:
            review.div.extract()
            output.append( review.get_text(separator=' ', strip=True) )


for i, review in enumerate(output, 1):
    print('--- Review no.{} ---'.format(i))
    print(*wrap(review), sep='\n')
    print()

打印:

代码语言:javascript
复制
--- Review no.1 ---
As you all know Apple products are too expensive this one is damn one
but who needs to sell his kidney to buy its look is not that much ease
than expected. For me it's 2 star phone

--- Review no.2 ---
Very disappointing product.nothing has changed in operating system,
only camera look has changed which is very odd looking.Device weight
is not light and dont fit in one hand.

--- Review no.3 ---
Ipohone 11 Pro X : Looks alike a minion having Three Eyes. yes its
Seems as An Alien, But Technically Iphone is Copying features and
Function of Androids and Having Custom Os Phones. Triple Camera is
Great! for Wide Angle Photography. But The looks of Iphone 11 pro X
isn't Good. If You Have 3 Kidneys, Then You Can Waste one of them to

... and so on.
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59356213

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档