首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >利用美汤提取大熊猫数据中的多类

利用美汤提取大熊猫数据中的多类
EN

Stack Overflow用户
提问于 2017-08-14 01:38:31
回答 1查看 1.2K关注 0票数 1

我想要得到以下熊猫的资料:

以下是我尝试过的,试图通过课程的内容,但是提供了所有的内容,而不是我想要得到的单独的内容。我是bs4的新手。

代码语言:javascript
复制
html_doc = """
<div class="schoolinfo" data-attr-lat="33.7527" data-attr-lon="-84.3867" id="1396">
      <div class="schoolheader">
       <h3 class="schoolname">
        Georgia State University
       </h3>
      </div>
      <div class="schooldetails">
       <div class="schoollocation">
        <div class="citystate">
         Atlanta, Georgia
        </div>
       </div>
       <div class="programs">
        <div class="schoolprogram">
         <h4>
          <a href="http://cs.gsu.edu/graduate/doctor-philosophy/ph-d-bioinformatics-concentration-degree-requirements/" target="_blank">
           Ph.D. in Computer Science - Bioinformatics Concentration
          </a>
         </h4>
         <div class="cost-curric">
          <a class="btn btn-sm btn-default detailbutton" href="http://cs.gsu.edu/graduate/doctor-philosophy/ph-d-admission-requirements/" target="_blank">
           HOW TO APPLY
          </a>
          <a class="btn btn-sm btn-default detailbutton" href="https://catalog.gsu.edu/graduate20152016/computer-science/" target="_blank">
           CURRICULUM
          </a>
          <a class="btn btn-sm btn-default detailbutton" href="http://sfs.gsu.edu/tuition-fees/what-it-costs/tuition-and-fees/" target="_blank">
           COST
          </a>
         </div>
         <div class="programdetails">
          <div class="dept">
           <strong>
            OFFERED BY:
           </strong>
           Department of Computer Science
          </div>
          <div class="dept">
           <strong>
            DELIVERY:
           </strong>
           Campus
          </div>
          <div class="dept">
           <strong>
            LENGTH:
           </strong>
           48 Credits
          </div>
          <div class="dept">
           <strong>
            PRE-REQUISITE TECHNICAL COURSEWORK:
           </strong>
           technical bachelor's degree
          </div>
         </div>
        </div>
       </div>
      </div>
     </div>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

for i in soup.find_all(attrs={'class': ["schoolname", "citystate", "schoolprogram","dept"]}):
    print(i)

不提供所需的标记并传递所有html内容而不进行任何筛选,如果我只传递了一个类而不是一个标签列表.下面是具有多个find_all“class”的的输出

代码语言:javascript
复制
<h3 class="schoolname">
            Georgia State University
           </h3>
<div class="citystate">
             Atlanta, Georgia
            </div>
<div class="schoolprogram">
<h4>
<a href="http://cs.gsu.edu/graduate/doctor-philosophy/ph-d-bioinformatics-concentration-degree-requirements/" target="_blank">
               Ph.D. in Computer Science - Bioinformatics Concentration
              </a>
</h4>
<div class="cost-curric">
<a class="btn btn-sm btn-default detailbutton" href="http://cs.gsu.edu/graduate/doctor-philosophy/ph-d-admission-requirements/" target="_blank">
               HOW TO APPLY
              </a>
<a class="btn btn-sm btn-default detailbutton" href="https://catalog.gsu.edu/graduate20152016/computer-science/" target="_blank">
               CURRICULUM
              </a>
<a class="btn btn-sm btn-default detailbutton" href="http://sfs.gsu.edu/tuition-fees/what-it-costs/tuition-and-fees/" target="_blank">
               COST
              </a>
</div>
<div class="programdetails">
<div class="dept">
<strong>
                OFFERED BY:
               </strong>
               Department of Computer Science
              </div>
<div class="dept">
<strong>
                DELIVERY:
               </strong>
               Campus
              </div>
<div class="dept">
<strong>
                LENGTH:
               </strong>
               48 Credits
              </div>
<div class="dept">
<strong>
                PRE-REQUISITE TECHNICAL COURSEWORK:
               </strong>
               technical bachelor's degree
              </div>
</div>
</div>
<div class="dept">
<strong>
                OFFERED BY:
               </strong>
               Department of Computer Science
              </div>
<div class="dept">
<strong>
                DELIVERY:
               </strong>
               Campus
              </div>
<div class="dept">
<strong>
                LENGTH:
               </strong>
               48 Credits
              </div>
<div class="dept">
<strong>
                PRE-REQUISITE TECHNICAL COURSEWORK:
               </strong>
               technical bachelor's degree
              </div>

多重代码:

代码语言:javascript
复制
pathP = "http://www.mastersindatascience.org/schools/doctorate/#on-campus" #text for multiple 

response = requests.get(pathP)
response.text[:100] # Access the HTML with the text property
soup = BeautifulSoup(response.text, "lxml")
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2017-08-14 02:15:02

我不会在这里使用.find_all和属性列表,因为对于一些您想要访问的文本,最好按外观顺序专门存储它们,而不是存储它们的所有内容。因此,让他们中的每一个到他们的具体变量:

代码语言:javascript
复制
citystate = soup.find('div',{'class':'citystate'}).text.strip()
dept = soup.find('div',{'class':'dept'}).text.strip()
dept = dept[dept.index(':')+1:].strip()
link = soup.find('div',{'class':'schoolprogram'}).a['href']
schoolname = soup.find('h3',{'class':'schoolname'}).text.strip()
schoolprogram = soup.find('div',{'class':'schoolprogram'}).a.text.strip()

关于行dept = dept[dept.index(':')+1:].strip(),它使dept成为您真正想要的,而不是从"OFFERED BY:"开始。同时,在所有这些代码中调用.strip()以消除大量的\n

现在你可以用熊猫创建你的DataFrame了:

代码语言:javascript
复制
df = pd.DataFrame(data = [[citystate, dept, link, schoolname, schoolprogram]],
                  columns = ['citystate', 'dept', 'link', 'schoolname', 'schoolprogram'])
代码语言:javascript
复制
>>> print(df.to_string())
          citystate                            dept                                               link                schoolname                                      schoolprogram
0  Atlanta, Georgia  Department of Computer Science  http://cs.gsu.edu/graduate/doctor-philosophy/p...  Georgia State University  Ph.D. in Computer Science - Bioinformatics Con...

如果您正在处理许多这样的问题,您只需将所有的.find替换为.findAll,然后将它们的文本放在一个列表理解中,对于schoolprogram,我们将拥有:

代码语言:javascript
复制
schoolprogram = [x.text.strip() for x in soup.findAll('div',{'class':'schoolprogram'})]
票数 5
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/45666408

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档