文章/答案/技术大牛

发布

社区首页 >问答首页 >排除Beautifulsoup中的图片链接

问排除Beautifulsoup中的图片链接
EN

Stack Overflow用户

提问于 2020-01-14 07:11:25

回答 2查看 406关注 0票数 0

寻找一种方法来排除图像链接/不包含任何锚文本的链接。下面的代码完成了编译我想要的数据的工作，但它也从页面上的一些缩略图/图像链接中拾取了不需要的URL

for url in list_urls:
    browser.get(url)
    soup = BeautifulSoup(browser.page_source,"html.parser")
    for line in soup.find_all('a'):
        href = line.get('href')
        links_with_text.append([url, href])

抓取的页面上的图像都具有相同的格式(并且它们都在相同的div类“related-content”下)：

<a href="https://XXXX/"    ><picture class="crp_thumb crp_featured" title="XXXX">
<source type="image/webp" srcset="https://XXXX.jpg.webp"/>
<img width="150" height="150" src="https://XXXX.jpg" alt="XXXX"/>
</picture>

python

beautifulsoup

回答 2

Stack Overflow用户

发布于 2020-01-14 07:22:09

这里有几个你可以使用的例子：

选择不包含任何文本的<a>标记
选择不包含<img>标记的<a>标记
选择不包含任何文本的D9标记

txt = '''
<a href="https://XXXX/">
<picture class="crp_thumb crp_featured" title="XXXX">
<source type="image/webp" srcset="https://XXXX.jpg.webp"/>
<img width="150" height="150" src="https://XXXX.jpg" alt="XXXX"/>
</picture>
</a>

<a href="https://XXX">OK LINK</a>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(txt, 'html.parser')

# select <a> tags that don't contain any text
for a in soup.find_all(lambda t: t.name == 'a' and t.get_text(strip=True) != ''):
    print(a)

# select <a> tags that don't contain <img> tags
for a in soup.select('a:not(:has(img))'):
    print(a)

# select <a> tags that don't contain any text and <img> tags
for a in soup.find_all(lambda t: t.name == 'a' and t.get_text(strip=True) != '' and not t.find('img')):
    print(a)

票数 0

Stack Overflow用户

发布于 2020-01-14 09:49:13

使用SimplifiedDoc的解决方案。

from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''<a href="https://XXXX/"    ><picture class="crp_thumb crp_featured" title="XXXX">
<source type="image/webp" srcset="https://XXXX.jpg.webp"/>
<img width="150" height="150" src="https://XXXX.jpg" alt="XXXX"/>
</picture></a>'''
doc = SimplifiedDoc(html)
lstA = doc.getElementsByTag('a')
lstImg = doc.getElementsByTag('img')
lstSource = doc.getElementsByTag('source')
print ([a.href for a in lstA])
print ([img.src for img in lstImg])
print ([source.srcset for source in lstSource])
lstA = doc.getElementsByTag('a').notContains('<picture')
print ([a.href for a in lstA])

结果：

['https://XXXX/']
['https://XXXX.jpg']
['https://XXXX.jpg.webp']
[]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59725518

复制

相似问题

问排除Beautifulsoup中的图片链接
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问排除Beautifulsoup中的图片链接EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问排除Beautifulsoup中的图片链接
EN