我已经收集了heads=str(soup.find_all(re.compile('^h[1-6]$')))给出的数据中的所有头部标签。然后,我在head标签之间收集数据。给出了部分源代码。
import bs4
import re
data = '''
<html>
<body>
<div class="mob-icon"> <span></span></div>
<nav id="nav">
<ul class="" id="menu-home-welcome-banner">
<li class="menu-item menu-item-type-custom menu-item-object-custom current-menu-parent menu-item-has-children menu-item-1778" id="menu-item-1778"> <a class="submeny-top" href="http://www.uvionicstech.com" ontouchstart="">Home</a> </li>
<!--<li id="menu-item-1785" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1785"><a href="#about" class="scroll-to-link" ontouchstart="">About</a></li>-->
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1786" id="menu-item-1786"><a class="scroll-to-link" href="#data-analytics" ontouchstart="">PRODUCTS & SOLUTIONS</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1787" id="menu-item-1787"><a class="scroll-to-link" href="#artificial-intelligence" ontouchstart="">Artificial Intelligence</a></li>
<!-- <li id="menu-item-1788" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1788"><a href="#iot" class="scroll-to-link" ontouchstart="">IOT</a></li> -->
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1788" id="menu-item-1788"><a class="scroll-to-link" href="#services" ontouchstart="">All in One Place</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1789" id="menu-item-1789"><a class="scroll-to-link" href="#eco-system" ontouchstart="">PARTNERS</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-1791" id="menu-item-1791"><a class="scroll-to-link" href="#contact" ontouchstart="">Contact</a></li>
<h3 class="h3 text-center">PARTNERS</h3>
<h3 class="vc_custom_heading titel-left wow" data-wow-delay="0.3s">
<span class="titel-line"></span>Artificial Intelligence </h3>
<h3 class="vc_custom_heading titel-left wow " data-wow-delay="0.3s"><span class="titel-line">
</span>Everything for your Business, <small>all in one place</small>
</h3>
</ul>
</nav>
</div>
</body>
</html>
'''
searched_word = 'Artificial Intelligence'
soup = bs4.BeautifulSoup(data, 'html.parser')
results = soup.body.find_all(string=re.compile('.*{0}.*'.format(searched_word)), recursive=True)输出:
results
['Artificial Intelligence',
'Artificial Intelligence ']这里的第一个标签是列表项,第二个Artificial Intelligence是头标签。我只想找出有head标签的单词。如何获取单词只有head标签?有没有办法找到接下来的几个字符后面跟着单词Artificial Intelligence。这样它就会得到Artificial Intelligence </h3>。那么它将不会考虑该列表项。
发布于 2018-11-29 20:13:48
既然这只是你想要的head标签,我们能不能抓取这些标签,然后搜索一下?
searched_word = 'Artificial Intelligence'
soup = bs4.BeautifulSoup(data, 'html.parser')
head_tags = soup.find_all('h3')
for ele in head_tags:
if searched_word in ele.text:
results = [ele.text.replace('\n', '')]
if results:
print(results)
else:
print('No matches found')给出输出:
In [184]: results
Out[184]: ['Artificial Intelligence ']发布于 2018-11-29 23:15:58
如果标题中没有子标记,如
<h3 class="vc_custom_heading">Artificial Intelligence</h3>您可以组合您的正则表达式
results = soup.body.find_all(re.compile('^h[1-6]$'),
string=re.compile(searched_word))但是您答案包含子标记,我将创建类似first h3的循环或创建自定义函数来传递给find_all()
def head_contain_word(tag):
return re.match(r'^h[1-6]$', tag.name) \
and searched_word in tag.text
searched_word = 'Artificial Intelligence'
soup = bs4.BeautifulSoup(data, 'html.parser')
results = soup.body.find_all(head_contain_word)结果:
[<h3 class="vc_custom_heading titel-left wow" data-wow-delay="0.3s">
\n<span class="titel-line"></span>Artificial Intelligence </h3>]https://stackoverflow.com/questions/53537936
复制相似问题