文章/答案/技术大牛

发布

社区首页 >问答首页 >无法获取附加到不同参与者的表格内容

问无法获取附加到不同参与者的表格内容
EN

Stack Overflow用户

提问于 2019-08-31 12:06:58

回答 3查看 250关注 0票数 3

我试图从网页中获取与不同参与者相关的表格的内容。为了你的理解，我想要的信息已经在图像中被删除了。目前，我的脚本只能给出不同参与者的名字。我也要分析与这些与会者有关的信息。

https://www.bet365.com.au/#/AC/B151/C1/D50/E2/F163/

由于内容是动态的，我不得不使用一些可以使用dev工具检索的公共API。

https://filebin.net/ybwver7vt5mp1dju表示信息在该页面中的显示方式。我想抓住的就是划过线的人。

https://pastebin.com/QsA3Pprr API响应的样子。

我已经试过了：

import re
import requests

url = 'https://www.bet365.com.au/SportsBook.API/web?'

params = {
    'lid': '30',
    'zid': '0',
    'pd': '#AC#B151#C1#D50#E2#F163#',
    'cid': '13',
    'ctid': '13'
}

r = requests.get(url, params=params,headers={'User-Agent':'Mozilla/5.0'})
games = re.finditer(r'NA=(.*?);', r.text)
for game in games:
    if not 'v' in game.group(): continue
    print(game.group(1))

我得到的输出类似于(部分)：

FunPlus Phoenix v Bilibili Gaming
Top Esports v Royal Never Give Up
Moops v Brute
eSuba v eXtatus
CS:GO - V4 Future Sports Festival
PACT v Capri Sun

我希望得到类似于(部分)的输出：

26:42    FunPlus Phoenix v Bilibili Gaming    1-1   -      -      21
09:00    Top Esports v Royal Never Give Up     -    2.00   1.72   49
12:00    Moops v Brute                         -    2.10   1.66   17

我怎样才能抓取附在不同参与者身上的表格内容？

PS这里可见的信息可能并不相同，因为它们每隔几分钟更新一次，我希望像我已经尝试过的那样使用请求来完成任务。

python-3.x

web-scraping

python

回答 3

Stack Overflow用户

发布于 2019-09-03 16:12:41

我帮你做了你的第一个问题的代码，这与这个网站有关。虽然其他两个答案使用Selenium，但这是不必要的，因为网站的api端点用于查找游戏。这种方法应该比硒更快。我能够再次使用正则表达式解析其他信息。然而，在实际的网站上，我无法找到任何类似的'1-1‘，如您的预期输出。希望这能有所帮助。“时代”可能有问题，我对他们不太确定。

代码

import re
import requests
from datetime import datetime, timedelta
import pandas as pd

url = 'https://www.bet365.com.au/SportsBook.API/web?'

params = {
    'lid': '30',
    'zid': '0',
    'pd': '#AC#B151#C1#D50#E2#F163#',
    'cid': '13',
    'ctid': '13'
}

r = requests.get(url, params=params, headers={'User-Agent': 'Mozilla/5.0'})

games = re.finditer(r'NA=([\w\s\-._]+? v [\w\s\-._]+?);', r.text)
col_games = []
for game in games:
    # if 'v' in game.group() and '-' not in game.group():
    col_games.append(game.group(1))

prices_text = re.finditer(r'NA=1;.*?((?:OD=\d+/\d+;(?:.*?))+?)NA=', r.text)
col_1 = []
for text in prices_text:
    segments = text.group(1).split('|')
    for segment in segments:
        price = re.search(r'OD=(\d+/\d+);', segment)
        if price:
            price = int(eval(price.group(1) + '+1') * 100) / 100
            col_1.append(price)

prices_text = re.finditer(r'NA=2;.*?((?:OD=\d+/\d+;(?:.*?))+?)NA=', r.text)
col_2 = []
for text in prices_text:
    segments = text.group(1).split('|')
    for segment in segments:
        price = re.search(r'OD=(\d+/\d+);', segment)
        if price:
            price = int(eval(price.group(1) + '+1') * 100) / 100
            col_2.append(price)

times = re.finditer(r'BC=(\d+);', r.text)
col_times = []
for time in times:
    datetime_time = datetime.strptime(time.group(1)[:-2], '%Y%m%d%H%M')
    datetime_time = datetime_time + timedelta(hours=-1)
    col_times.append(datetime_time.time())


df = pd.DataFrame({'Time': col_times, "Games": col_games, "1": col_1, "2": col_2})
print(df)

输出

        Time                                           Games     1     2
0   19:00:00                                 DETONA v Falkol  1.25  3.75
1   19:00:00                              paiN Gaming v Keyd  1.53  2.37
2   19:00:00                                 W7M v Bulldozer  1.22  4.00
3   03:00:00                       VP Game v Team WE Academy  2.62  1.44
4   05:00:00  Invictus Gaming Young v Top Esports Challenger  1.22  4.00
5   07:00:00   Vici Gaming Potential v FunPlus Phoenix Blaze  1.36  3.00
6   09:00:00    Edward Gaming Youth v Bilibili Gaming Junior  2.00  1.72
7   09:00:00                    Gama Dream v LinGan e-Sports  1.80  1.90
8   03:00:00                    Royal Club v Suning Gaming-S  1.66  2.10
9   05:00:00                         Joy Dream v Oh My Dream  2.37  1.53
10  07:00:00            LNG Academy v Bilibili Gaming Junior  3.25  1.33
11  07:00:00                   TS Gaming v Victorious Gaming  1.72  2.00
12  09:00:00         D7G Esports Club v Legend Esport Gaming  3.75  1.25
13  09:00:00        Dominus Esports.Y v Rogue Warriors Shark  2.50  1.50
14  05:00:00         Team WE Academy v Vici Gaming Potential  3.25  1.33
15  07:00:00                                 87 v Gama Dream  2.00  1.72
16  07:00:00             Invictus Gaming Young v LNG Academy  1.16  4.50
17  09:00:00                 FunPlus Phoenix Blaze v VP Game  1.50  2.50
18  09:00:00                   Scorpio Game v Young Miracles  3.40  1.30
19  09:00:00                   Top Esports v Bilibili Gaming  1.53  2.37
20  08:00:00           FunPlus Phoenix v Royal Never Give Up  1.57  2.25
21  09:30:00                                    Maru v Solar  1.40  2.75
22  10:15:00                                   Stats v Rogue  1.57  2.25
23  04:00:00                              Classic v RagnaroK  1.22  4.00
24  04:45:00                                     Dear v Zest  2.62  1.44
25  08:00:00               SANDBOX Gaming v KINGZONE DragonX  1.66  2.10
26  13:00:00                                ENCE v Renegades  1.25  3.75
27  16:30:00                         Team Vitality v AVANGAR  1.22  4.00
28  13:00:00                             NRG v Natus Vincere  1.66  2.10
29  16:30:00                          Astralis v Team Liquid  2.00  1.72
30  23:00:00                Vancouver Titans v Seoul Dynasty  1.33  3.25
31  02:00:00         Hangzhou Spark v Los Angeles Gladiators  1.72  2.00
32  08:00:00                                MAD Team v G-Rex  1.53  2.37
33  08:00:00               Flash Wolves v Hong Kong Attitude  3.25  1.33
34  19:00:00                        Clutch Gaming v FlyQuest  1.25  3.75
35  16:00:00                                 Flamengo v INTZ  1.16  4.50
36  16:00:00                             Fnatic v Schalke 04  1.20  4.33
37  16:00:00                                 Origen v Splyce  3.50  1.28
38  09:00:00                        GAM Esports v Team Flash  1.25  3.75

票数 3

Stack Overflow用户

发布于 2019-08-31 15:39:14

您可以使用selenium

from selenium import webdriver
from bs4 import BeautifulSoup as soup
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://www.bet365.com.au/#/AC/B151/C1/D50/E2/F163/')
def scrape_block(b):
   p = {'date':b.find('div', {'class':'gll-MarketColumnHeader sl-MarketHeaderLabel sl-MarketHeaderLabel_Date '}).text}
   c1, c2 = b.find_all('div', {'class':'sl-CouponParticipantWithBookCloses sl-CouponParticipantWithBookCloses_NoAdditionalMarkets sl-CouponParticipantIPPGBase '}), b.find_all('div', {'class':'sl-CouponParticipantWithBookCloses sl-CouponParticipantWithBookCloses_NoAdditionalMarkets sl-CouponParticipantIPPGBase sl-CouponParticipantWithBookCloses_ClockPaddingLeft '})
   if c1:
      pl = [[i.find('div', {'class':'sl-CouponParticipantWithBookCloses_BookCloses '}).text, i.find('div', {'class':'sl-CouponParticipantWithBookCloses_Name '}).text] for i in c1] 
   else:
      pl = [[i.find('div', {'class':'pi-CouponParticipantClockInPlay '}).text, i.find('div', {'class':'sl-CouponParticipantWithBookCloses_Name '}).text, i.find('div', {'class':'pi-ScoreVariantDefault '}).text] for i in c2]
   odds1, odds2 = [[i.text for i in c.find_all('div', {'class':'gll-ParticipantOddsOnlyDarker gll-Participant_General gll-ParticipantOddsOnly '})] for c in b.find_all('div', {'class':'sl-MarketCouponValuesExplicit2 gll-Market_General gll-Market_PWidth-15-4 '})]
   return {**p, 'data':[{'player':a, 1:b, 2:c} for a, b, c in zip(pl, [None] if not odds1 else odds1, [None] if not odds2 else odds2)]}

new_d = list(map(scrape_block, soup(d.page_source, 'html.parser').find_all('div', {'class':'gll-MarketGroupContainer gll-MarketGroupContainer_HasLabels '})))
final_result = list(filter(lambda x:bool(x['data']), new_d))

输出：

[{'date': 'Sat 31 Aug', 'data': [{'player': ['22:42', 'Royal Youth v SuperMassive', '1-2'], 1: None, 2: None}]}, {'date': 'Sun 01 Sep', 'data': [{'player': ['11:56', 'G2 Esports v Fnatic', '0-0'], 1: None, 2: None}]}, {'date': 'Sun 01 Sep', 'data': [{'player': ['01:20', 'Hjarnan (G2) v h$hjukken'], 1: '1.10', 2: '1.10'}]}, {'date': 'Sun 01 Sep', 'data': [{'player': ['02:00', 'Thijs v Kolento'], 1: '1.83', 2: '1.83'}, {'player': ['03:00', 'Orange v Hunterace'], 1: '2.25', 2: '1.57'}, {'player': ['04:00', 'Gallon v StrifeCro'], 1: '2.00', 2: '1.72'}, {'player': ['04:00', 'Rdu v SilverName'], 1: '2.00', 2: '1.72'}, {'player': ['05:00', 'Monsanto v PNC'], 1: '1.61', 2: '2.20'}, {'player': ['06:00', 'bloodyface v Amnesiac'], 1: '1.80', 2: '1.90'}, {'player': ['07:00', 'Eddie v Purple'], 1: '1.80', 2: '1.90'}, {'player': ['08:00', 'muzzy v Firebat'], 1: '1.72', 2: '2.00'}, {'player': ['09:00', 'ETC v Nalguidan'], 1: '2.10', 2: '1.66'}]}, {'date': 'Sun 01 Sep', 'data': [{'player': ['12:00', 'Mindfreak v ORDER'], 1: '1.53', 2: '2.37'}]}, {'date': 'Sun 01 Sep', 'data': [{'player': ['15:00', 'LinGan e-Sports v Bilibili Gaming Junior'], 1: '1.66', 2: '2.10'}, {'player': ['17:00', 'Scorpio Game v Suning Gaming-S'], 1: '3.00', 2: '1.36'}, {'player': ['17:00', 'Victorious Gaming v FunPlus Phoenix Blaze'], 1: '3.00', 2: '1.36'}, {'player': ['19:00', '87 v Top Esports Challenger'], 1: '1.66', 2: '2.10'}, {'player': ['19:00', 'Rogue Warriors Shark v Legend Esport Gaming'], 1: '2.62', 2: '1.44'}]}]

票数 1

Stack Overflow用户

发布于 2019-09-01 17:05:10

如果要使用JS API，您需要找到一种方法来了解如何解码网站的输出，以及JS部分如何在实际网站中呈现我们可以看到的内容。我觉得这不是一件容易的事。这就是为什么我建议您使用Selenium和BeautifulSoup在浏览器选项卡下加载网站，然后使用Beautifulsoup使用最终的HTML，这将降低从网站中提取什么内容的复杂性。

下面是一个如何使用tournments, dates and matches的headless模式刮除Chrome的示例

PS: cookie部分并不是必需的，但它将有助于自动加载我们正在尝试刮取的页面。

首先您需要安装：pip install webdriver-manager，然后：

import pickle
import time
from collections import defaultdict
from pprint import pprint
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup as bs

CHROME_OPTIONS = Options()
CHROME_OPTIONS.add_argument("--headless")

class Bet365:
    DRIVER = webdriver.Chrome(ChromeDriverManager().install(), options=CHROME_OPTIONS)
    DUMMY_URL = 'https://www.bet365.com'
    URL = 'https://www.bet365.com/#/AC/B1/C1/D13/E37628398/F2/:/AC/B1/C1/D13/E42294995/F2/:/AC/B1/C1/D13/E42535433/F2/'
    COOKIES_FILE = 'cookies.pkl'

    def __init__(self):
        self.DRIVER.get(self.DUMMY_URL)
        # Comment the next line if cookies file is not set
        self.setup_cookies()
        self.DRIVER.get(self.URL)
        # self.DRIVER.maximize_window()
        # Wait for JS to populate the page
        time.sleep(15)
        self.source = self.DRIVER.page_source
        # Store new cookies for next run
        self.dump_cookies()

    def dump_cookies(self):
        """Store cookies"""
        pickle.dump(self.DRIVER.get_cookies(), open(self.COOKIES_FILE, "wb"))

    def setup_cookies(self):
        """Add cookies"""
        cookies = pickle.load(open(self.COOKIES_FILE, "rb"))
        for cookie in cookies:
            if 'expiry' in cookie:
                del cookie['expiry']
            self.DRIVER.add_cookie(cookie)

    def get_source(self):
        """Get page HTML source"""
        return bs(self.source, "html.parser")

    def is_last_child(self, event):
        """Is last child"""
        out = {}
        out['last_child'] = True if 'sl-MarketCouponAdvancedBase_LastChild' in event['class'] else False
        event_date = event.find('div', {'class': 'sl-CouponParticipantWithBookCloses_BookCloses'})
        out['date'] = event_date.get_text() if event_date else 'None'
        teams = event.findAll('div', {'class': 'sl-CouponParticipantWithBookCloses_Name'})
        if len(teams) > 1:
            out['teams'] = ' v '.join(k.text for k in teams)
        elif len(teams) == 1:
            out['teams'] = teams[0].text
        else:
            out['teams'] = 'None'
        return out

    def get_events(self, data):
        """Return all events"""
        dates, teams = [], []
        for event in data.findAll('div', {'class': 'sl-MarketCouponFixtureLabelBase gll-Market_General gll-Market_HasLabels'}):
            dates = [elm.text for elm in event.find_all('div', {'class': lambda x: all(k in x for k in 'gll-MarketColumnHeader sl-MarketHeaderLabel sl-MarketHeaderLabel_Date'.split())})]
            teams_events = event.findAll("div", {'class': lambda x: x and x.startswith("sl-CouponParticipantWithBookCloses sl-CouponParticipantIPPGBase")})
            teams = [self.is_last_child(elm) for elm in teams_events]
            if len(dates) == 1:
                if teams:
                    teams[-1]['last_child'] = True
        return dates, teams

    def pretty_print_events(self, dates, teams):
        """Pretty print events"""
        def groupby_last_child(data):
            out, tmp = [], []
            for elm in data:
                tmp.append(elm)
                if elm['last_child']:
                    out.append(tmp)
                    tmp = []
            return out

        out = defaultdict(list)
        for date, groupped in zip(dates, groupby_last_child(teams)):
            # use += instead of append in order to have flatten list
            # instead of list of lists
            out[date] += groupped
        return dict(out)

    def scrape_events(self):
        """Return all ligues"""
        for block in self.get_source().findAll('div', {'class': 'gll-MarketGroup cm-CouponMarketGroup cm-CouponMarketGroup_Open'}):
            ligue_name = block.find('span', {'class': 'cm-CouponMarketGroupButton_Text'}).get_text()
            dates, teams = self.get_events(block)
            out = self.pretty_print_events(dates, teams)
            yield ligue_name, out

    def to_dict(self):
        """Scrape events and return a dict"""
        return dict((ligue, events) for ligue, events in self.scrape_events())


if __name__ == '__main__':
    instance = Bet365()
    out = instance.to_dict()
    pprint(out)

输出：

{'England League 2 - Full Time Result': {'Sat 07 Sep': [{'date': '15:00',
                                                         'last_child': False,
                                                         'teams': 'Bradford v '
                                                                  'Northampton'},
                                                        {'date': '15:00',
                                                         'last_child': False,
                                                         'teams': 'Cambridge '
                                                                  'Utd v '
                                                                  'Forest '
                                                                  'Green'},
                                                        {'date': '15:00',
                                                         'last_child': False,
                                                         'teams': 'Carlisle v '
                                                                  'Exeter'},
                                                        {'date': '15:00',
                                                         'last_child': False,
                                                         'teams': 'Cheltenham '
                                                                  'v '
                                                                  'Stevenage'},
                                                        {'date': '15:00',
                                                         'last_child': False,
                                                         'teams': 'Colchester '
                                                                  'v Walsall'},
                                                        {'date': '15:00',
                                                         'last_child': False,
                                                         'teams': 'Grimsby v '
                                                                  'Crewe'},
                                                        {'date': '15:00',
                                                         'last_child': False,
                                                         'teams': 'Leyton '
                                                                  'Orient v '
                                                                  'Swindon'},
                                                        {'date': '15:00',
                                                         'last_child': False,
                                                         'teams': 'Macclesfield '
                                                                  'v Crawley '
                                                                  'Town'},
                                                        {'date': '15:00',
                                                         'last_child': False,
                                                         'teams': 'Mansfield v '
                                                                  'Scunthorpe'},
                                                        {'date': '15:00',
                                                         'last_child': False,
                                                         'teams': 'Morecambe v '
                                                                  'Salford '
                                                                  'City'},
                                                        {'date': '15:00',
                                                         'last_child': False,
                                                         'teams': 'Newport '
                                                                  'County v '
                                                                  'Port Vale'},
                                                        {'date': '15:00',
                                                         'last_child': True,
                                                         'teams': 'Plymouth v '
                                                                  'Oldham'}]},...

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57737881

复制

相似问题

问无法获取附加到不同参与者的表格内容
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法获取附加到不同参与者的表格内容EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问无法获取附加到不同参与者的表格内容
EN