文章/答案/技术大牛

发布

社区首页 >问答首页 >使用Python中的BS4从网页中提取数据

问使用Python中的BS4从网页中提取数据
EN

Stack Overflow用户

提问于 2014-06-07 11:51:03

回答 1查看 159关注 0票数 0

我试图从这个站点中提取数据：http://www.afl.com.au/fixture

在某种程度上，我有一个字典，将日期作为键，“预览”链接作为列表中的值，如

dict = {Saturday, June 07: ["preview url-1, "preview url-2","preview url-3","preview url-4"]}

请帮我拿，我用了下面的代码：

def extractData():
    lDateInfoMatchCase = False
#     lDateInfoMatchCase = []
    global gDict
    for row in table_for_players.findAll("tr"):
        for lDateRowIndex in row.findAll("th", {"colspan" : "4"}):
            ldateList.append(lDateRowIndex.text)

    print ldateList
    for index in ldateList:
        #print index
        lPreviewLinkList = []
        for row in table_for_players.findAll("tr"):
            for lDateRowIndex in row.findAll("th", {"colspan" : "4"}):

                if lDateRowIndex.text == index:
                    lDateInfoMatchCase = True
                else:
                    lDateInfoMatchCase = False

             if lDateInfoMatchCase == True:
                     for lInfoRowIndex in row.findAll("td", {"class": "info"}):
                         for link in lInfoRowIndex.findAll("a", {"class" : "preview"}):
                             lPreviewLinkList.append("http://www.afl.com.au/" + link.get('href'))
        print lPreviewLinkList
        gDict[index] = lPreviewLinkList

我的主要目标是根据数据结构中的数据，获得在主场和客场比赛的所有球员的名字。

python

beautifulsoup

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-06-07 12:06:55

我更喜欢使用CSS选择器。选择第一个表，然后选择tbody中的所有行，以便于处理；这些行按tr th行进行“分组”。在那里，您可以选择所有不包含th头的下一个兄弟姐妹，并扫描这些内容以获得预览链接：

previews = {}

table = soup.select('table.fixture')[0]
for group_header in table.select('tbody tr th'):
    date = group_header.string
    for next_sibling in group_header.parent.find_next_siblings('tr'):
        if next_sibling.th:
            # found a next group, end scan
            break
        for preview in next_sibling.select('a.preview'):
            previews.setdefault(date, []).append(
                "http://www.afl.com.au" + preview.get('href'))

这将生成一个列表字典；对于该页面的当前版本，它将生成：

{u'Monday, June 09': ['http://www.afl.com.au/match-centre/2014/12/melb-v-coll'],
 u'Sunday, June 08': ['http://www.afl.com.au/match-centre/2014/12/gcfc-v-syd',
                      'http://www.afl.com.au/match-centre/2014/12/fre-v-adel',
                      'http://www.afl.com.au/match-centre/2014/12/nmfc-v-rich']}

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/24096849

复制

相似问题

问使用Python中的BS4从网页中提取数据
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python中的BS4从网页中提取数据EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用Python中的BS4从网页中提取数据
EN