我曾经和spark一起工作,他设法自动构建漂亮的表。现在我使用python和漂亮的汤来摄取药物数据。这是我的代码,我想构建一个包含所有药物及其相关信息的表。
我尝试使用"split“,因为信息是用”-“拼接的
但是没有得到一些可读的东西:请找到下面的代码+结果样本+理想DataFrame的结构
1-代码
import requests
from bs4 import BeautifulSoup
def get_details(url):
print('details:', url)
# get subpage
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
# get data on subpabe
dts = soup.findAll('dt')
dds = soup.findAll('dd')
# display details
for dt, dd in zip(dts, dds):
print(dt.text)
print(dd.text)
print('---')
print('---------------------------')
def drug_data():
url = 'https://www.drugbank.ca/drugs/'
while url:
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
# get links to subpages
links = soup.select('strong a')
for link in links:
# exeecute function to get subpage
get_details('https://www.drugbank.ca' + link['href'])
# next page url
url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
print(url)
if url:
url = 'https://www.drugbank.ca' + url[0].get('href')
else:
break2-输出结构如下:
名字
5-甲基四氢叶酸
登录号
DB04789
类型
小分子
群组
已批准,Nutraceutical
说明5-甲基四氢叶酸是四氢叶酸的甲基化衍生物。它是由亚甲基四氢叶酸还原酶从5,10-亚甲基四氢叶酸生成的,通过5-甲基四氢叶酸-同型半胱氨酸甲基转移酶(又称蛋氨酸合成酶)将同型半胱氨酸循环回蛋氨酸。
Panda Dataframe应该如下所示:

你有什么建议?最好的
发布于 2018-01-09 05:22:01
如果您希望对现有代码进行最简单的扩展,只为DataFrame选择几个列,则可以使用以下内容作为起点:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headings=['Name','Accession Number','Type','Groups','Description'] # add more as needed, first hit on each is taken
df = pd.DataFrame([], columns=headings)
def get_details(url):
global df
global headings
print('details:', url)
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
dts = soup.findAll('dt')
dds = soup.findAll('dd')
data = {}
for dt, dd in zip(dts, dds):
if (dt.text in headings) and (dt.text not in data):
data[dt.text] = dd.text
df = df.append(data,ignore_index=True)
def drug_data():
url = 'https://www.drugbank.ca/drugs/'
while url:
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
# get links to subpages
links = soup.select('strong a')
for link in links:
# exeecute function to get subpage
get_details('https://www.drugbank.ca' + link['href'])
# next page url
url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
print(url)
if url:
url = 'https://www.drugbank.ca' + url[0].get('href')
else:
break
drug_data()
print df[['Name', 'Accession Number']](注意,URL中的数据的问题是你在eg上得到了多次点击。“‘Description”)总之,希望这篇文章能有所帮助。
发布于 2018-01-09 04:39:31
为了达到我们的目标,有一些事情要做。首先,您必须将解析后的HTML文件转换为Pandas Dataframe。为此,您可以使用Pandas read_html函数。然后,您必须将每个结果Pandas Dataframe合并为一个Dataframe。要设置输出的样式,您可以使用与Pandas显示相关的选项,因为这有点取决于您的系统,我只是在脚本的顶部添加了三个可能的选项。有关进一步的配置,请随时咨询Pandas options。
### Load the pandas library
import pandas as pd
### Set Pandas settings
# Set max number of displayed columns
pd.set_option('display.max_columns', 100)
# Set max column width
pd.set_option('display.max_colwidth', 1000)
# Prevent pandas linebreak
pd.set_option('display.expand_frame_repr', False)
import requests
from bs4 import BeautifulSoup
def get_details(url):
print('details:', url)
# get subpage
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
### Convert the soup object to string
# than you can parse it directly with pandas
# this returns a list of dfs therefore, fetch the first list item
# return the df to merge it later with the other results
return pd.read_html(str(soup))[0]
def drug_data():
url = 'https://www.drugbank.ca/drugs/'
### Helper list to store the single drugs Pandas Dataframes
listOfDrugDfs = []
while url:
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
# get links to subpages
links = soup.select('strong a')
for link in links:
# exeecute function to get subpage
### Extended to append results of function to listOfDrugDfs
listOfDrugDfs.append(get_details('https://www.drugbank.ca' + link['href']))
# next page url
url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
print(url)
if url:
url = 'https://www.drugbank.ca' + url[0].get('href')
else:
break
### Merge single results dfs to one big results dataframe
resultDf = pd.concat(listOfDrugDfs)
### Print results
print(resultDf)
### Or return
#return resultDfhttps://stackoverflow.com/questions/48152899
复制相似问题