我想要得到以下熊猫的资料:

以下是我尝试过的,试图通过课程的内容,但是提供了所有的内容,而不是我想要得到的单独的内容。我是bs4的新手。
html_doc = """
<div class="schoolinfo" data-attr-lat="33.7527" data-attr-lon="-84.3867" id="1396">
<div class="schoolheader">
<h3 class="schoolname">
Georgia State University
</h3>
</div>
<div class="schooldetails">
<div class="schoollocation">
<div class="citystate">
Atlanta, Georgia
</div>
</div>
<div class="programs">
<div class="schoolprogram">
<h4>
<a href="http://cs.gsu.edu/graduate/doctor-philosophy/ph-d-bioinformatics-concentration-degree-requirements/" target="_blank">
Ph.D. in Computer Science - Bioinformatics Concentration
</a>
</h4>
<div class="cost-curric">
<a class="btn btn-sm btn-default detailbutton" href="http://cs.gsu.edu/graduate/doctor-philosophy/ph-d-admission-requirements/" target="_blank">
HOW TO APPLY
</a>
<a class="btn btn-sm btn-default detailbutton" href="https://catalog.gsu.edu/graduate20152016/computer-science/" target="_blank">
CURRICULUM
</a>
<a class="btn btn-sm btn-default detailbutton" href="http://sfs.gsu.edu/tuition-fees/what-it-costs/tuition-and-fees/" target="_blank">
COST
</a>
</div>
<div class="programdetails">
<div class="dept">
<strong>
OFFERED BY:
</strong>
Department of Computer Science
</div>
<div class="dept">
<strong>
DELIVERY:
</strong>
Campus
</div>
<div class="dept">
<strong>
LENGTH:
</strong>
48 Credits
</div>
<div class="dept">
<strong>
PRE-REQUISITE TECHNICAL COURSEWORK:
</strong>
technical bachelor's degree
</div>
</div>
</div>
</div>
</div>
</div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
for i in soup.find_all(attrs={'class': ["schoolname", "citystate", "schoolprogram","dept"]}):
print(i)不提供所需的标记并传递所有html内容而不进行任何筛选,如果我只传递了一个类而不是一个标签列表.下面是具有多个find_all“class”的的输出
<h3 class="schoolname">
Georgia State University
</h3>
<div class="citystate">
Atlanta, Georgia
</div>
<div class="schoolprogram">
<h4>
<a href="http://cs.gsu.edu/graduate/doctor-philosophy/ph-d-bioinformatics-concentration-degree-requirements/" target="_blank">
Ph.D. in Computer Science - Bioinformatics Concentration
</a>
</h4>
<div class="cost-curric">
<a class="btn btn-sm btn-default detailbutton" href="http://cs.gsu.edu/graduate/doctor-philosophy/ph-d-admission-requirements/" target="_blank">
HOW TO APPLY
</a>
<a class="btn btn-sm btn-default detailbutton" href="https://catalog.gsu.edu/graduate20152016/computer-science/" target="_blank">
CURRICULUM
</a>
<a class="btn btn-sm btn-default detailbutton" href="http://sfs.gsu.edu/tuition-fees/what-it-costs/tuition-and-fees/" target="_blank">
COST
</a>
</div>
<div class="programdetails">
<div class="dept">
<strong>
OFFERED BY:
</strong>
Department of Computer Science
</div>
<div class="dept">
<strong>
DELIVERY:
</strong>
Campus
</div>
<div class="dept">
<strong>
LENGTH:
</strong>
48 Credits
</div>
<div class="dept">
<strong>
PRE-REQUISITE TECHNICAL COURSEWORK:
</strong>
technical bachelor's degree
</div>
</div>
</div>
<div class="dept">
<strong>
OFFERED BY:
</strong>
Department of Computer Science
</div>
<div class="dept">
<strong>
DELIVERY:
</strong>
Campus
</div>
<div class="dept">
<strong>
LENGTH:
</strong>
48 Credits
</div>
<div class="dept">
<strong>
PRE-REQUISITE TECHNICAL COURSEWORK:
</strong>
technical bachelor's degree
</div>多重代码:
pathP = "http://www.mastersindatascience.org/schools/doctorate/#on-campus" #text for multiple
response = requests.get(pathP)
response.text[:100] # Access the HTML with the text property
soup = BeautifulSoup(response.text, "lxml")发布于 2017-08-14 02:15:02
我不会在这里使用.find_all和属性列表,因为对于一些您想要访问的文本,最好按外观顺序专门存储它们,而不是存储它们的所有内容。因此,让他们中的每一个到他们的具体变量:
citystate = soup.find('div',{'class':'citystate'}).text.strip()
dept = soup.find('div',{'class':'dept'}).text.strip()
dept = dept[dept.index(':')+1:].strip()
link = soup.find('div',{'class':'schoolprogram'}).a['href']
schoolname = soup.find('h3',{'class':'schoolname'}).text.strip()
schoolprogram = soup.find('div',{'class':'schoolprogram'}).a.text.strip()关于行dept = dept[dept.index(':')+1:].strip(),它使dept成为您真正想要的,而不是从"OFFERED BY:"开始。同时,在所有这些代码中调用.strip()以消除大量的\n。
现在你可以用熊猫创建你的DataFrame了:
df = pd.DataFrame(data = [[citystate, dept, link, schoolname, schoolprogram]],
columns = ['citystate', 'dept', 'link', 'schoolname', 'schoolprogram'])>>> print(df.to_string())
citystate dept link schoolname schoolprogram
0 Atlanta, Georgia Department of Computer Science http://cs.gsu.edu/graduate/doctor-philosophy/p... Georgia State University Ph.D. in Computer Science - Bioinformatics Con...如果您正在处理许多这样的问题,您只需将所有的.find替换为.findAll,然后将它们的文本放在一个列表理解中,对于schoolprogram,我们将拥有:
schoolprogram = [x.text.strip() for x in soup.findAll('div',{'class':'schoolprogram'})]https://stackoverflow.com/questions/45666408
复制相似问题