寻找一种方法来排除图像链接/不包含任何锚文本的链接。下面的代码完成了编译我想要的数据的工作,但它也从页面上的一些缩略图/图像链接中拾取了不需要的URL
for url in list_urls:
browser.get(url)
soup = BeautifulSoup(browser.page_source,"html.parser")
for line in soup.find_all('a'):
href = line.get('href')
links_with_text.append([url, href])抓取的页面上的图像都具有相同的格式(并且它们都在相同的div类“related-content”下):
<a href="https://XXXX/" ><picture class="crp_thumb crp_featured" title="XXXX">
<source type="image/webp" srcset="https://XXXX.jpg.webp"/>
<img width="150" height="150" src="https://XXXX.jpg" alt="XXXX"/>
</picture>发布于 2020-01-14 07:22:09
这里有几个你可以使用的例子:
<a>标记<img>标记的<a>标记txt = '''
<a href="https://XXXX/">
<picture class="crp_thumb crp_featured" title="XXXX">
<source type="image/webp" srcset="https://XXXX.jpg.webp"/>
<img width="150" height="150" src="https://XXXX.jpg" alt="XXXX"/>
</picture>
</a>
<a href="https://XXX">OK LINK</a>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
# select <a> tags that don't contain any text
for a in soup.find_all(lambda t: t.name == 'a' and t.get_text(strip=True) != ''):
print(a)
# select <a> tags that don't contain <img> tags
for a in soup.select('a:not(:has(img))'):
print(a)
# select <a> tags that don't contain any text and <img> tags
for a in soup.find_all(lambda t: t.name == 'a' and t.get_text(strip=True) != '' and not t.find('img')):
print(a)发布于 2020-01-14 09:49:13
使用SimplifiedDoc的解决方案。
from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''<a href="https://XXXX/" ><picture class="crp_thumb crp_featured" title="XXXX">
<source type="image/webp" srcset="https://XXXX.jpg.webp"/>
<img width="150" height="150" src="https://XXXX.jpg" alt="XXXX"/>
</picture></a>'''
doc = SimplifiedDoc(html)
lstA = doc.getElementsByTag('a')
lstImg = doc.getElementsByTag('img')
lstSource = doc.getElementsByTag('source')
print ([a.href for a in lstA])
print ([img.src for img in lstImg])
print ([source.srcset for source in lstSource])
lstA = doc.getElementsByTag('a').notContains('<picture')
print ([a.href for a in lstA])结果:
['https://XXXX/']
['https://XXXX.jpg']
['https://XXXX.jpg.webp']
[]https://stackoverflow.com/questions/59725518
复制相似问题