我只想输出外部li标记文本。
from bs4 import BeautifulSoup
html = BeautifulSoup("""
<ul>
<li><a href="#">B2B Marketing</a>
<ul>
<li><a href="offerings/b2bmarketing/outboundai.php"> Campagin </a></li>
<li><b>Inbound AI </b>Enrich inbound leads</a></li>
</ul>
</li>
<li>Marketing Data Analysis
<ul>
<li><a href="offerings/marketingdataanalysis/event360ai.php"><b>Event 360 AI </b></a></li>
</ul>
</li>
<li class="drop-down"><a href="#">Enrichment API</a>
</li>
</ul>
""")
print([i.text.strip() for i in html.findAll('li')])输出来自html内容的整个文本。
['B2B Marketing\n\n Campagin \nInbound AI Enrich inbound leads', 'Campagin', 'Inbound AI Enrich inbound leads', 'Marketing Data Analysis\n \nEvent 360 AI', 'Event 360 AI', 'Enrichment API\n\nAPI Technographics, Firmographics, Intent data', 'API Technographics, Firmographics, Intent data']但
输出应为:-
[
'B2B Marketing : Campagin, Enrich inbound leads',
'Marketing Data Analysis : Event 360 AI',
'Enrichment API'
]请帮我解决这个问题
发布于 2020-02-01 09:50:58
这怎么回事?
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<ul>
<li><a href="#">B2B Marketing</a>
<ul>
<li><a href="offerings/b2bmarketing/outboundai.php"> Campagin </a></li>
<li><b>Inbound AI </b>Enrich inbound leads</a></li>
</ul>
</li>
<li>Marketing Data Analysis
<ul>
<li><a href="offerings/marketingdataanalysis/event360ai.php"><b>Event 360 AI </b></a></li>
</ul>
</li>
<li class="drop-down"><a href="#">Enrichment API</a>
</li>
</ul>
'''
doc = SimplifiedDoc(html)
lis = doc.ul.lis
out = []
for li in lis:
if li.b and li.b.nextText():
li.removeElement('b')
name = li.firstText() if li.firstText() else li.a.text
tmp = ''
for l in li.lis:
tmp += l.text+','
if tmp:
out.append(name+':'+tmp[0:-1])
else:
out.append(name)
print (out)结果:
['B2B Marketing:Campagin,Enrich inbound leads', 'Marketing Data Analysis:Event 360 AI', 'Enrichment API']https://stackoverflow.com/questions/60010088
复制相似问题