首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >从文本中删除html,但保留Python中的<br>标记

从文本中删除html,但保留Python中的<br>标记
EN

Stack Overflow用户
提问于 2020-06-28 13:07:16
回答 4查看 529关注 0票数 1

我正在使用python从网站获取数据,我需要删除所有的html和\n\t,但保留所有的文本和"br“标记

代码语言:javascript
复制
response.xpath('//div[@class="comment-text-inner"]').extract()

输出

代码语言:javascript
复制
['<div class="comment-text-inner">\n\t\t\t\t<b>Certified, Low Mileage, Twin Panel Moonroof, SE Convenience Package, Rear Parking  Aid Sensors, Black Roof Side Rails, Sync 3, Power 10-Way 
Driver Seat, SE Leather Plus Package, Voice-Activated Touch-Screen <br>
Navigation!</b><br> <br>    
Whether you\'re getting out of the city for a weekend camping trip or just driving to the grocery store, the 2017 
Ford Escape has you covered. This  2017 Ford Escape is for sale today. <br> <br>
For 2017, the Escape has under gone a small refresh, updating the exterior with a more angular tailgate, LED tail lights, an aluminum hood and a new fascia that makes it look similar to the other Ford crossovers.  
Both programs offer you an exclusive Comprehensive Warranty over and above any remaining factory warranty. For specific details on either program see your sales representative today!<br> <br><br>AMVIC Licensed Dealer<br> Come by and check out our fleet of 40+ used cars and trucks and 70+ new cars and trucks for sale in Calgary.  o~o\t\t\t</div>']

使用response.xpath('//div[@class="comment-text-inner"]/text()').extract()

返回带有\n\t和不带" br“标记的文本,因此我需要删除\n\t并保留br标记

EN

回答 4

Stack Overflow用户

发布于 2020-06-28 14:22:38

下面是一段代码,它可以做你想要的事情:

代码语言:javascript
复制
children = response.xpath('//div[@class="comment-text-inner"]/node()')  
res = ""
for c in children:   
   name = c.xpath("name()") 
   if len(name) == 0 or c.get() == "<br>": 
     text = c.get() 
     text = text.replace("\n", "").replace("\t", "") 
     res = res + text
     print(text)  # not strictly needed

在这里,我正在打印文本--当然,您也可以将其放入数据库或执行其他操作。

(我使用的网址是https://www.marlboroughford.com/vehicle-details/used-2017-ford-escape-se---certified---low-mileage-calgary-ab-id-36312139)

票数 0
EN

Stack Overflow用户

发布于 2020-06-28 16:07:22

1.保存html

2.在html中,将<br>标记替换为100%不会出现在文本上且不会被识别为../text() Xpath选择器的标记或::text选择器的标记(就像您之前尝试使用text选择器时发生的那样)。例如,通过__br__

3.在修改后的html代码上调用text选择器。

4.在接收到的文本中-将back __br__替换为<br>

代码语言:javascript
复制
from scrapy import Selector
...
...
..
def parse(self, response):
    ....
    html = response.xpath('//div[@class="comment-text-inner"]').extract()
    # replace '<br>` by __br__ 
    html = html.replace("<br>", "__br__")
    # create selector from modified html code
    sel = Selector(text=html)
    text = sel.css("*::text").extract()
    
    #convert list to string:
    if text:
        text = ",".join(text)
        # you can use `strip` for removing \t and \n
        # text = "".join([t.strip("\t\n") for t in text if t.strip("\t\n")])
    # or use replace
    # text = text.replace("\n","").replace("\t","")

   # return <br> tags back to result:
   text = text.replace("__br__", "<br>")
票数 0
EN

Stack Overflow用户

发布于 2020-06-28 19:13:22

您可以使用regex删除所有标记。但在此之前,你必须替换

标记添加到其他内容,以将其保留为文本

代码语言:javascript
复制
text = response.xpath('//div[@class="comment-text-inner"]').extract()
text = ' '.join(text) if text else ''
text = re.sub(r'<br>', '__br__', text) #replace <br> to return it in text
text = re.sub(r'<.*?>', '', text) # remove all tags
text = text.replace('__br__', '<br>').strip() # return <br> tag back into text
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/62618468

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档