我正在尝试解析以下网站中存储为不同表的数据:
https://teams.technion.ac.il/residency-placements/
我希望解析表中的数据并将其存储为dataframe。我在试着做美味的汤。我已经找出了要解析的id标记,并且发现有十个独立的id标记(例如tab-1,tab-2,...,tab-10)。我想编写一个遍历HTML的函数,并将每个选项卡下的文本作为单独的dataframe存储在pandas中。我是一个初学者,所以我不知道我在做什么。谢谢!
发布于 2019-12-19 05:28:27
我发现lxml在遍历超文本标记语言来解析/收集/存储数据时非常有用。研究一下如何使用xpath循环遍历,您可以使用以下方法轻松地获取所有文本并存储它们:
from lxml import etree
root = etree.HTML(html)
textOut = root.xpath('//*[text()]') #grab all text OR
textOut = root.xpath('//*[text()[contains(., "stuff")]]') #grab just text you want给定您注意到的站点,并且您正在抓取表,您可以使用xpath将搜索定向到仅如下所示的表:
textOut = root.xpath('//table[text()]') #grab just text from tables发布于 2019-12-19 09:23:53
您可以使用它来抓取一个表,并使用whichtable来选择哪个表,因为它有多个:
def getContent(link, filename, whichtable=0):
result1 = requests.get(link)
src1 = result1.content
soup = BeautifulSoup(src1,'lxml')
table = soup.find_all('table')[whichtable]
with open(filename,'w',newline='') as f:
writer = csv.writer(f)
for tr in table('tr'):
row = [t.get_text(strip=True)for t in tr(['td','th'])]
writer.writerow(row)
getContent('https://teams.technion.ac.il/residency-placements/', 'what.csv', whichtable=0)
df2= pd.read_csv('what.csv')输出:
Hospital/Location Specialty
0 Maimonides Med Ctr-NYMaimonides Med Ctr-NY Medicine-PreliminaryAnesthesiology
1 Jacobi Med Ctr/Einstein-NY Pediatrics
2 Jacobi Med Ctr/Einstein-NY Pediatrics
3 Temple Univ Hosp-PA Internal Medicine
4 Case Western/MetroHealth Med Ctr-OH Pediatrics
5 Sinai Hospital of Baltimore-MD Pediatrics
6 Baystate Med Ctr-MA Medicine
7 U Rochester/Strong Memorial-NY Anesthesiology
8 Johns Hopkins Hosp-MD Surgery-Preliminary
9 Westchester Medical Ctr-NY Psychiatry
10 Danbury Hospital-CT Medicine
11 Icahn SOM at Mount Sinai-NY Pediatrics
12 Emory Univ SOM-GAEmory Univ SOM-GA TransitionalDermatology
13 NYMC-Metropolitan Hosp Ctr-NYKingsbrook Jewish... Medicine-PreliminaryPhys Medicine & Rehab
14 Advocate Health Care-IL Pediatrics/ALGH
15 Mayo Clinic- WI Family Medicine发布于 2019-12-19 11:22:19
您也可以使用它从表中提取数据。其他的事情需要你自己来处理。首先,您需要使用pip安装libs。
pip install requests
pip install simplified_scrapyimport requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
html=requests.get('https://teams.technion.ac.il/residency-placements/').text
doc = SimplifiedDoc(html)
tables = doc.getElementsByClass('entry',start='listitems') # Use start to filter previous div
data = []
for table in tables:
t=[] # Data of a table
rows = table.trs # get all rows
for row in rows:
tds = row.children # get all td and th
t.append([td.text for td in tds])
data.append(t)
print (data)https://stackoverflow.com/questions/59400075
复制相似问题