我正在尝试从以下url提取数据: https://www.medicineindia.org/pharmacology-for-generic/3/diphtheria-toxoid-pertusis-vaccine-tetanus-toxoid,我需要按如下方式追加数据:
[
['id', 'heading', 'data_under_heading_as_one_string','heading','data_under_heading_as_one_string',....],
['id', 'heading', 'data_under_heading_as_one_string','heading','data_under_heading_as_one_string',....]
]页面上有多个项目,我必须获得每个项目的信息作为一个单独的列表,如上所述,每个项目的名称是在h2标签下提供的,相关信息在20个标题(dt标签)下提供,相应的信息在dd标签中给出。
下面是我的方法:
final_data = []
for g in range(5):
url = df['url_column'][g]
page_source = req.get(url)
soup = bs4.BeautifulSoup(page_source.text,"html5lib")
heading = soup.find_all('h2')
headings = []
for head in heading:
headings.append(head.text)
for i in range(len(headings)-1):
text = soup.find(text=headings[i])
row = []
row.insert(0,df['id'][g])
for d in range(40):
for x in text.findNext(['dt','dd']):
row.append(x) # <--- here's the problem
text = x
final_data.append(row)
print(g, end = ' ')我的问题是,其中一个标题下的内容(有一个有编号的字符串列表)被分解成几个字符串,而不是一个字符串。由于这个原因,当我试图通过附加所有行列表来创建数据帧时,会创建不必要的带有br/ tag等的列。
我尝试更改x(用文本提示,这是代码中的问题所在),它是字符串的NavigableString,并替换不必要的br/、编号、句点等:
s = str(x) # here's the problem
row.append(s.replace('<dd>|</dd>|<br/>|\d+\.',''))任何帮助都将非常感谢!
发布于 2020-07-30 02:28:49
我希望我正确理解了你的问题,但是这个脚本将把所有的<h2>、<dt>和<dd>标签放入结构化列表中:
import requests
from bs4 import BeautifulSoup
url = 'https://www.medicineindia.org/pharmacology-for-generic/3/diphtheria-toxoid-pertusis-vaccine-tetanus-toxoid'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for tag in soup.select('h2, dt, dd'):
if tag.name == 'h2':
all_data.append([tag.get_text()])
elif tag.name in ('dt', 'dd'):
all_data[-1].append(tag.get_text(strip=True, separator=' '))
from pprint import pprint
pprint(all_data, width=150)打印:
[['Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'About Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Mechanism of Action of Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Pharmacokinets of Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Onset of Action for Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Duration of Action for Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Half Life of Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Side Effects of Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Contra-indications of Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Special Precautions while taking Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
'N/A',
'Pregnancy Related Information',
'N/A',
'Old Age Related Information',
'N/A',
'Breast Feeding Related Information',
...and so on.https://stackoverflow.com/questions/63159617
复制相似问题