嗨,我正在尝试在标签之间来回移动。下面我附上了我想刮掉的源代码的一部分。如果你仔细看,有3个ul标签。第一个ul标记的class = "listGroup“。我正在尝试提取第二个"ul“标记的文本,使用的想法是它后面跟着另一个具有类"listGroup”的"ul“标记。请分享我如何做到这一点。
<ul class="listGroup" id="ul_e6d09fbd-19fe-49ac-9b47-bd857c0d411b"><li class="acces-listitems"><a href="https://order.store.mayoclinic.com/books/gnweb43?utm_source=MC-DotOrg-PS&utm_medium=Link&utm_campaign=FamilyHealth-Book&utm_content=FHB">Book: Mayo Clinic Family Health Book, 5th Edition</a></li><li class="acces-hide-listitems"><a href="https://order.store.mayoclinic.com/hl/hldiged?utm_source=MC-DotOrg-PS&utm_medium=Link&utm_campaign=HealthLetter-Digital&utm_content=HLDE">Newsletter: Mayo Clinic Health Letter — Digital Edition</a></li></ul>
<ul>
<li>Osteoporosis</li>
<li>Kidney stones</li>
<li>Excessive urination</li>
<li>Abdominal pain</li>
<li>Tiring easily or weakness</li>
<li>Depression or forgetfulness</li>
<li>Bone and joint pain</li>
<li>Frequent complaints of illness with no apparent cause</li>
<li>Nausea, vomiting or loss of appetite</li>
</ul>
<ul>
<li>A noncancerous growth (adenoma) on a gland is the most common cause.</li>
<li>Enlargement (hyperplasia) of two or more parathyroid glands accounts for most other cases.</li>
<li>A cancerous tumor is a very rare cause of primary hyperparathyroidism.</li>
</ul>我附上我到目前为止所做的简短的脚本。请帮帮忙。
import requests
import pandas
from bs4 import BeautifulSoup
for link in ['/diseases-conditions/hyperparathyroidism/symptoms-causes/syc-20356194']:
page = requests.get(f"https://www.mayoclinic.org{link}")
soup = BeautifulSoup(page.content, "html.parser")
for each in soup.find_all("ul"):
print(each)发布于 2020-06-26 18:25:47
您可以使用CSS选择器ul.listGroup + ul li ->,这将选择"listGroup"类<ul>旁边的<ul>标签的所有<li>标签
txt = '''<ul class="listGroup" id="ul_e6d09fbd-19fe-49ac-9b47-bd857c0d411b"><li class="acces-listitems"><a href="https://order.store.mayoclinic.com/books/gnweb43?utm_source=MC-DotOrg-PS&utm_medium=Link&utm_campaign=FamilyHealth-Book&utm_content=FHB">Book: Mayo Clinic Family Health Book, 5th Edition</a></li><li class="acces-hide-listitems"><a href="https://order.store.mayoclinic.com/hl/hldiged?utm_source=MC-DotOrg-PS&utm_medium=Link&utm_campaign=HealthLetter-Digital&utm_content=HLDE">Newsletter: Mayo Clinic Health Letter — Digital Edition</a></li></ul>
<ul>
<li>Osteoporosis</li>
<li>Kidney stones</li>
<li>Excessive urination</li>
<li>Abdominal pain</li>
<li>Tiring easily or weakness</li>
<li>Depression or forgetfulness</li>
<li>Bone and joint pain</li>
<li>Frequent complaints of illness with no apparent cause</li>
<li>Nausea, vomiting or loss of appetite</li>
</ul>
<ul>
<li>A noncancerous growth (adenoma) on a gland is the most common cause.</li>
<li>Enlargement (hyperplasia) of two or more parathyroid glands accounts for most other cases.</li>
<li>A cancerous tumor is a very rare cause of primary hyperparathyroidism.</li>
</ul>'''
soup = BeautifulSoup(txt, 'html.parser')
for li in soup.select('ul.listGroup + ul li'):
print(li.text)打印:
Osteoporosis
Kidney stones
Excessive urination
Abdominal pain
Tiring easily or weakness
Depression or forgetfulness
Bone and joint pain
Frequent complaints of illness with no apparent cause
Nausea, vomiting or loss of appetite发布于 2020-06-26 18:28:13
这似乎是CSS选择器的一个自然用例,即:
ul.listGroup + ul li将选择第一个li标记中的所有ul标记,该标记位于每个带有listGroup类的ul标记之后。相反,将+替换为~将选择所有(在本例中为2个) li标签中的所有ul标签,这些标签紧跟在每个带有listGroup类的标签之后。
要在脚本中实现这个答案,请用select替换find_all,并用相关的CSS选择器更新选择器。
import requests
import pandas
from bs4 import BeautifulSoup
for link in ['/diseases-conditions/hyperparathyroidism/symptoms-causes/syc-20356194']:
page = requests.get(f"https://www.mayoclinic.org{link}")
soup = BeautifulSoup(page.content, "html.parser")
for each in soup.select("ul.listGroup + ul li"):
print(each.text)发布于 2020-06-26 23:59:45
也许您应该考虑使用正则表达式来进行捕获。
https://stackoverflow.com/questions/62591628
复制相似问题