我是Python和BeautifulSoup的新手,我已经尝试了几个小时了……
首先,我想从标题中带有“大选”的以下链接中提取所有表格数据:
https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)
我确实有另一个数据帧,每个表的名称(例如,“1961年大选”,“1965年大选”),但我希望通过在每个表格上搜索“大选”来确认它是否是我所需要的。
然后我想要得到所有在Bold中的名字(这表明他们赢了),最后我想要另一个按原始顺序排列的"Count 1“(有时是Count)列表,我想将它与"Bold”列表进行比较。我甚至还没有看过这篇文章,因为我还没有通过第一个障碍。
url = "https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)"
res = requests.get(url)
soup = BeautifulSoup(res.content,'lxml')
my_tables = soup.find_all("table", {"class":"wikitable"})
for table in my_tables:
rows = table.find_all('tr', text="general election")
print(rows)在这方面的任何帮助都将非常感谢……
发布于 2020-12-06 00:26:07
这个页面需要一些技巧,但它可以做到:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
req = requests.get('https://en.wikipedia.org/wiki/Carlow%E2%80%93Kilkenny_(D%C3%A1il_constituency)')
soup = bs(req.text,'lxml')
#first - select all the tables on the page
tables = soup.select('table.wikitable')
for table in tables:
ttr = table.select('tbody tr')
#next, filter out any table that doesn't involve general elections
if "general election" in ttr[0].text:
#clean up the rows
s_ttr = ttr[1].text.replace('\n','xxx').strip()
#find and clean up column headings
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
rows = [] #initialize a list to house the table rows
for c in ttr[2:]:
#from here, start processing each row and loading it into the list
row = [a.text.strip() if len(a.text.strip())>0 else 'NA' for a in c.select('td') ]
if (row[0])=="NA":
row=row[1:]
columns = [col.strip() for col in s_ttr.split('xxx') if len(col.strip())>0 ]
if len(row)>0:
rows.append(row)
#load the whole thing into a dataframe
df = pd.DataFrame(rows,columns=columns)
print(df)输出应该是页面上的所有普选表。
https://stackoverflow.com/questions/65155732
复制相似问题