文章/答案/技术大牛

发布

社区首页 >问答首页 >从NavigableString中移除字符并追加到python中的列表中

问从NavigableString中移除字符并追加到python中的列表中
EN

Stack Overflow用户

提问于 2020-07-30 01:51:25

回答 1查看 85关注 0票数 1

我正在尝试从以下url提取数据: https://www.medicineindia.org/pharmacology-for-generic/3/diphtheria-toxoid-pertusis-vaccine-tetanus-toxoid，我需要按如下方式追加数据：

[
 ['id', 'heading', 'data_under_heading_as_one_string','heading','data_under_heading_as_one_string',....],
 ['id', 'heading', 'data_under_heading_as_one_string','heading','data_under_heading_as_one_string',....]
]

页面上有多个项目，我必须获得每个项目的信息作为一个单独的列表，如上所述，每个项目的名称是在h2标签下提供的，相关信息在20个标题(dt标签)下提供，相应的信息在dd标签中给出。

下面是我的方法：

final_data = []
for g in range(5):
    url = df['url_column'][g] 
    page_source = req.get(url)
    soup = bs4.BeautifulSoup(page_source.text,"html5lib")
    heading =  soup.find_all('h2')
    headings = []
    for head in heading:
        headings.append(head.text)
    for i in range(len(headings)-1):
        text = soup.find(text=headings[i])
        row = []
        row.insert(0,df['id'][g])
        for d in range(40):
            for x in text.findNext(['dt','dd']):
                row.append(x) # <--- here's the problem
            text = x
        final_data.append(row)
    print(g, end = ' ')

我的问题是，其中一个标题下的内容(有一个有编号的字符串列表)被分解成几个字符串，而不是一个字符串。由于这个原因，当我试图通过附加所有行列表来创建数据帧时，会创建不必要的带有br/ tag等的列。

我尝试更改x(用文本提示，这是代码中的问题所在)，它是字符串的NavigableString，并替换不必要的br/、编号、句点等：

s = str(x) # here's the problem
row.append(s.replace('<dd>|</dd>|<br/>|\d+\.',''))

任何帮助都将非常感谢！

beautifulsoup

python

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-07-30 02:28:49

我希望我正确理解了你的问题，但是这个脚本将把所有的<h2>、<dt>和<dd>标签放入结构化列表中：

import requests
from bs4 import BeautifulSoup


url = 'https://www.medicineindia.org/pharmacology-for-generic/3/diphtheria-toxoid-pertusis-vaccine-tetanus-toxoid'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_data = []
for tag in soup.select('h2, dt, dd'):
    if tag.name == 'h2':
        all_data.append([tag.get_text()])
    elif tag.name in ('dt', 'dd'):
        all_data[-1].append(tag.get_text(strip=True, separator=' '))

from pprint import pprint
pprint(all_data, width=150)

打印：

[['Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
  'About Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
  'N/A',
  'Mechanism of Action of Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
  'N/A',
  'Pharmacokinets of Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
  'N/A',
  'Onset of Action for Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
  'N/A',
  'Duration of Action for Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
  'N/A',
  'Half Life of Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
  'N/A',
  'Side Effects of Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
  'N/A',
  'Contra-indications of Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
  'N/A',
  'Special Precautions while taking Diphtheria toxoid + Pertusis vaccine + Tetanus toxoid',
  'N/A',
  'Pregnancy Related Information',
  'N/A',
  'Old Age Related Information',
  'N/A',
  'Breast Feeding Related Information',

...and so on.

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/63159617

复制

相似问题

问从NavigableString中移除字符并追加到python中的列表中
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从NavigableString中移除字符并追加到python中的列表中EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问从NavigableString中移除字符并追加到python中的列表中
EN