文章/答案/技术大牛

发布

社区首页 >问答首页 >检查字符串是否有.pdf扩展

问检查字符串是否有.pdf扩展
EN

Stack Overflow用户

提问于 2020-01-17 13:10:48

回答 2查看 293关注 0票数 0

我对刮擦很陌生。我有两个问题。首先，我需要取消网站的一个特定部分，其中包含锚标签。我只需要获得锚标签pdf链接与他们的标题，但不幸的是，锚标签有正常的链接also.this是我的第一个问题。

第二个问题是输出有不想要的行中断。对于这两个问题，代码是相同的。对于相同的代码，我有以下两个问题。

website.html

<div>
<a href="www.url.com/somethin.pdf">pdf
link</a>

<a href="www.url.com/somethin.pdf">pdf
link</a>

<a href="www.url.com/somethin">normal
link</a>
</div>

scrappy.py

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.privacy.gov.ph/advisories/')
soup = BeautifulSoup(page.content,'html.parser')

section = soup.find("section", {"class": "news_content"})
for link in section.find_all("a"):
   pdf =  link['href'].replace("..", "")
   title =  link.text.strip()
   print("title: " + title + "\t")
   print("pdf_link: " + pdf + "\t")
   print('\n')

如果运行此代码，您将发现该html代码有不必要的新换行标题。

python

web-scraping

beautifulsoup

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-01-17 13:25:02

您可以使用regex获得以pdf扩展结尾的href。我不知道你是什么意思。我只能假设你的意思是每个打印之间有两条新的行。如果这个假设是正确的，那是因为每个print函数都在一条新的行上。因此，当您有print('\n')时，它将在一条新行上打印，然后打印一条新行。如果只需要一个空格，请删除最后一个打印函数并将\t更改为\n。

import requests
from bs4 import BeautifulSoup
import re

page = requests.get('https://www.privacy.gov.ph/advisories/')
soup = BeautifulSoup(page.content,'html.parser')

section = soup.find("section", {"class": "news_content"})
links = section.findAll(href=re.compile("\.pdf$")) # <---- SEE HERE

for link in links:
   pdf =  link['href'].replace("..", "")
   title =  link.text.strip().replace('\n','')
   print("title: " + title)
   print("pdf_link: " + pdf + "\n")

输出：

title: Updated Templates on Security Incident and Personal Data Breach Reportorial Requirements 
pdf_link: https://www.privacy.gov.ph/wp-content/files/attachments/nwsltr/Final_Advisory18-02_6.26.18.pdf        

title: Guidelines on Privacy Impact Assessments   
pdf_link: https://www.privacy.gov.ph/wp-content/files/attachments/nwsltr/NPC_AdvisoryNo.2017-03.pdf     

title: Access to Personal Data Sheets of Government Personnel 
pdf_link: https://www.privacy.gov.ph/wp-content/files/attachments/nwsltr/NPC_Advisory_No.2017-02.pdf

票数 0

Stack Overflow用户

发布于 2020-01-17 13:29:19

在您的例子中，有些标题的正文中包含了\n --您应该尝试如下：

title =  link.text.strip().replace('\n', '')

因此，使用.pdf筛选的最终代码如下所示：

section = soup.find("section", {"class": "news_content"})
for link in section.find_all("a"):
   pdf =  link['href'].replace("..", "")
   if not pdf.endswith('.pdf'):
       continue
   title =  link.text.strip().replace('\n', '')
   print("title: " + title + "\t")
   print("pdf_link: " + pdf + "\t")
   print('\n')

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/59788022

复制

相似问题

问检查字符串是否有.pdf扩展
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问检查字符串是否有.pdf扩展EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问检查字符串是否有.pdf扩展
EN