我想打印环境保护局解决方案的“民事处罚”部分,如https://www.epa.gov/enforcement/chevron-settlement-information-sheet或https://www.epa.gov/enforcement/ngl-crude-logistics-llc-clean-air-act-settlement。
从以下HTML源中删除
<h2 id="civil">Civil Penalty</h2>
<p>Chevron U.S.A. will pay a $2.95 million civil penalty, of which $2,492,750 will be paid to the United States and $457,250 to the State of Mississippi.</p>我想让雪佛龙美国公司支付295万美元的民事罚款。
这一结构适用于所有定居点实况报告。
<h2 id="civil">Civil Penalty</h2>
<p>NGL will pay a civil penalty of $25 million. The penalty is based, in part, on the company’s limited ability to pay a larger penalty.</p>我发现了与在有漂亮汤的字符串之前获得一个元素相似的地方--但这与我的问题并不完全相同。
下面是我的代码框架:
import requests
from bs4 import BeautifulSoup
import sys
for i in ['chevron-settlement-information-sheet', 'ngl-crude-logistics-llc-clean-air-act-settlement', 'derive-systems-clean-air-act-settlement']:
page = requests.get("https://www.epa.gov/enforcement/"+i)
soup = BeautifulSoup(page.content, 'html.parser')
data = []
for result in soup.find_all('h2', id='civil'):
data.append(result)
print(data)如何直接在<p>之后打印<h2 id="civil">部分
发布于 2018-10-29 17:09:43
您可能没有得到您正在寻找的结果的一个原因是您将/history添加到URL中,这将导致一个404错误页。如果删除该部分,然后使用findNext('p')在<h2 id="civil">元素之后获取页面上的下一个段落元素,您将得到预期的结果:
import requests
from bs4 import BeautifulSoup
for url in ['chevron-settlement-information-sheet', 'ngl-crude-logistics-llc-clean-air-act-settlement', 'derive-systems-clean-air-act-settlement']:
page = requests.get("https://www.epa.gov/enforcement/" + url)
soup = BeautifulSoup(page.content, 'html.parser')
result = soup.find('h2', {'id': 'civil'}).findNext('p')
print(result.text)这张打印出来:
Chevron U.S.A. will pay a $2.95 million civil penalty, of which $2,492,750 will be paid to the United States and $457,250 to the State of Mississippi.
NGL will pay a civil penalty of $25 million. The penalty is based, in part, on the company’s limited ability to pay a larger penalty.
Derive will pay a civil penalty of $300,000, as the company has limited financial ability to pay a higher penalty. 发布于 2018-10-29 17:06:38
您可以尝试兄弟姐妹选择器,+。
p=soup.select('#civil + p')
print(p[0].getText())这将只选择p元素,它是#civil元素的下一个同级。
https://stackoverflow.com/questions/53050160
复制相似问题