我正在使用以下链接:https://www.bu.edu/phpbin/course-search/section/?t=casma124
为了关注2020年秋季,我对数据进行了索引。你可以看到有几个数字显示有多少个“开放座位”。如果你检查这些数字的元素,你可以看到它们在主一个下面的一个较小的卖出中。我的python代码输出如下:
Section Open Seats Instructor Type Location Schedule \
0 A1 NaN Enrique Jariwala LEC SCI B23 TR 11:00 am-12:15 pm
1 A1 NaN Enrique Jariwala NaN ROOM M 8:00 pm-9:45 pm
2 B1 NaN Enrique Jariwala LEC SCI B23 TR 5:00 pm-6:15 pm
3 B1 NaN Enrique Jariwala NaN ROOM M 8:00 pm-9:45 pm
4 D1 NaN Enrique Jariwala DIS PSY B39 W 11:15 am-12:05 pm
5 D2 NaN Enrique Jariwala DIS PSY B39 W 12:20 pm-1:10 pm
6 D3 NaN Enrique Jariwala DIS PSY B39 W 1:25 pm-2:15 pm
7 D4 NaN Enrique Jariwala DIS PSY B39 W 2:30 pm-3:20 pm
8 D5 NaN Enrique Jariwala DIS CAS 218 R 12:30 pm-1:20 pm
9 D6 NaN Enrique Jariwala DIS CGS 421 R 2:00 pm-2:50 pm
10 D7 NaN Enrique Jariwala DIS PRB 146 R 3:35 pm-4:25 pm
11 D8 NaN Enrique Jariwala DIS PRB 150 R 6:30 pm-7:20 pm
12 DX NaN Enrique Jariwala DIS NaN ARR 0: am
13 L1 NaN Enrique Jariwala LAB SCI 134 M 11:15 am-2:00 pm
14 L2 NaN Enrique Jariwala LAB SCI 134 T 6:30 pm-9:15 pm
15 L3 NaN Enrique Jariwala LAB SCI 134 W 8:00 am-10:45 am
16 L4 NaN Enrique Jariwala LAB SCI 134 W 11:15 am-2:00 pm
17 L5 NaN Enrique Jariwala LAB SCI 134 W 2:30 pm-5:15 pm
18 L6 NaN Enrique Jariwala LAB SCI 134 W 6:30 pm-9:15 pm
19 L7 NaN Enrique Jariwala LAB SCI 134 R 12:30 pm-3:15 pm
20 L8 NaN Enrique Jariwala LAB SCI 134 R 6:30 pm-9:15 pm
21 LX NaN Enrique Jariwala LAB NaN ARR 0: am 您可以看到,所有开放座位显示为NaN值。有没有我可以用来访问数字的函数。我想要这个号码而不是NaN。这是我的上下文代码。
def init_dataframe():
html_dataframe = pd.read_html(wanted_class_url(course_input))
dataframe_concatenate = pd.concat(html_dataframe)
dataframe_semester = html_dataframe[-1]
dataframe_locate_class = dataframe_semester.loc[:, ]
return dataframe_locate_class谢谢你的帮助!
发布于 2020-04-20 08:43:32
这里有一个有趣的问题:您的DataFrames显示NaN而不是数字的原因是,仅加载了HTML之后的网站实际上是空的。只有在脚本view-section.js运行后(在本地浏览器中),这些值才会被填充。因此,为了从脚本中获取相同的数据,您将不得不撤回与网站相同的数据。草图:
为每个“部分”退开座位。幸运的是,端点openseats.php接受这样的一个过程代码数组:
https://www.bu.edu/phpbin/summer/rpc/openseats.php?sections[]=2020SPRGCASMA124%20B7(显然,不管你要什么代码,它都会返回所有课程的公开座位。因此,就目前而言,一个查询就足够了。)
结果是以下JSON对象:
{"time_secs":0.20295810699463,"results":{"2020SPRGCASMA124 A1":"133","2020SPRGCASMA124 A2":"133","2020SPRGCASMA124 A3":"134","2020SPRGCASMA124 B1":"60","2020SPRGCASMA124 B2":"60","2020SPRGCASMA124 B3":"60","2020SPRGCASMA124 B4":"40","2020SPRGCASMA124 B5":"60","2020SPRGCASMA124 B6":"60","2020SPRGCASMA124 B7":"60","2020SPRGCASMA193 A1":"100","2020SPRGCASMA213 A1":"112","2020SPRGCASMA213 B1":"23","2020SPRGCASMA213 B2":"23","2020SPRGCASMA213 B3":"22","2020SPRGCASMA213 B4":"22","2020SPRGCASMA213 B5":"22","2020SPRGCASMA213 C1":"37","2020SPRGCASMA213 C2":"37","2020SPRGCASMA213 C3":"38"}}将其转换为DataFrame,现在您只需将两个DataFrames都转换为.join(..)即可。但是等等,你原来的桌子没有神秘的课程代码。不幸的是,这些只出现在某些表单元格的data-section="..."属性中。
非常不幸的是,当前获取该信息的最佳方法是自己进行HTML解析。切入点:from bs4 import BeautifulSoup (+这里存在的许多关于SO的问题)。
我希望这能让你开始。
https://stackoverflow.com/questions/61314017
复制相似问题