首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >按赢家/输家姓名查询体育数据的数据帧,获取每个球员的汇总统计表?

按赢家/输家姓名查询体育数据的数据帧,获取每个球员的汇总统计表?
EN

Stack Overflow用户
提问于 2020-04-08 05:56:15
回答 3查看 321关注 0票数 2

我有一个网球比赛结果的CSV数据集,格式为成功者和失败者,其结构与此相似:https://www.kaggle.com/jordangoblet/atp-tour-20002016

我想创建另一个表,将显示胜利的数量,损失,自上次比赛以来的天数,平均发球百分比等每名球员。

一位朋友建议尝试一下这样的方式:

代码语言:javascript
复制
games = [
    ["Elliot", 7, "Maya", 10],
    ["Elliot", 10, "Brendan", 8],
    ["Jamie", 7, "Omid", 10],
    ["Elliot", 6, "Jamie", 10],
    ["Brendan", 7, "Jamie", 10],
]

players = {}
K = 32.0
game_number = 1

for game in games:
    p1name, p1score, p2name, p2score = game
    if p1name not in players:
        players[p1name] = {
            'wins': 0,
            'losses': 0,
            'score': 1000.0,
        }
    if p2name not in players:
        players[p2name] = {
            'wins': 0,
            'losses': 0,
            'score': 1000.0,
        }

更多代码...

..。

代码语言:javascript
复制
print('Rank         Player         Elo       Wins    Losses')
rank = 1
for player, p in sorted(players.items(), key=lambda item: (item[1]['score'], item[0]), reverse=True):
    print('%4s %20s %7.1f %5s %7s' % (rank, player, p['score'], p['wins'], p['losses']))
    rank += 1 

我尝试通过使我的df成为列表列表来调整此代码以满足我的需要:

代码语言:javascript
复制
lol = df.values.tolist()
players = {}
for game in lol:
    game_id, tournament_name, tournament_id, tournament_date, round_match, winner, winner_id, loser, loser_id, winner_elo, 
    loser_elo, winner_delta, loser_delta, winner_set_1, loser_set_1, winner_set_2, loser_set_2, winner_set_3, loser_set_3,
    winner_set_4, loser_set_4, winner_set_5, loser_set_5, winner_sets, loser_sets, winner_serve_points_won, 
    loser_serve_points_won, winner_serve_hold, loser_serve_hold, winner_points_won, loser_points_won, winner_true_serve_pct, 
    loser_true_serve_pct = game

其中,以上所有内容都是原始df中的列名。

但是我得到了这个错误:NameError: name "game_id" is not defined

原始代码运行良好,但我不明白为什么=游戏在我的情况下不起作用。

我是不是遗漏了什么?有没有更好的方法来解决这个问题?我的数据集大约有60k个实例和33列

编辑:得到了解决方案,变量=游戏应该只在一行中。但是,如果有更好的解决方案来实现我在评论中描述的内容,欢迎您发表评论

编辑:列信息

代码语言:javascript
复制
game/tournament/winner/loser_id = unique id's

tournament_name = matches are played in daily tournaments

tournament_date = date of tournament/match

round_match = round of tournament the match takes place

winner/loser= winner/loser name

winner/loser_elo = current elo for the day

winner/loser_delta = delta change after match

winner/loser_set_1,2,3,4,5 = score for winner/loser per game, match ends when a player reaches 3 games, 11 points needed to win a game or 2 point differential (extra points if game is tied at 10-10)

winner/loser_sets= number of sets

winner/loser_points_won = total points per player in the match

winner/loser_true_serve_pct = % of points player wins when he is serving

winner_serve_points_won/loser_serve_points_won is obsolete

winner_serve_hold/loser_serve_hold will be replaced by winner/loser_true_serve_pct
EN

回答 3

Stack Overflow用户

发布于 2020-04-09 07:48:08

您需要查询'Winner‘或’Loser‘列中出现的球员名称的数据帧,以使 matches**,获得涉及该球员的所有比赛的数据帧。这比乍一看更难--我们不能只使用简单的** df.groupby()**.**,很抱歉低估了你的问题。

下面的

  • 解决方案使用df[['Winner','Loser']].isin([player]).any(axis=1)

代码语言:javascript
复制
- Note that parameterizes the list of columns where the name can occur, it doesn't hardcode them. More elegant idiom than `df['Winner'] == player or df['Loser'] == player`
- I also looked into `df.query("...")` which takes arbitrary SQL-style query string, but its syntax is annoying

  • 在这两列中有1387个独特的玩家:df['Winner']中有879个独特的玩家,df['Loser']中有1383个。我们将这两个组合到一个集合player_names中。你可以使用列表,但使用set来删除重复项更容易:set(df['Winner'].unique()) | set(df['Loser'].unique())其中'|‘是处理球员名称的列表效率,我们使用分类而不是字符串。注意代码是如何拥有一个统一的类别,并且合并了'Winner','Loser‘两个列的类别值(参见代码)。
  • 我们使用了defaultdict(dict)
  • defaultdict(dict),这很优雅,因为现在你可以直接定义字段,而不管是否定义了players players[player]['WhateverField'],如果没有定义,它将自动被players[player]。这里显示了你想创建的聚合字段的示例。
    • 注意到你可以直接使用pandas矢量化的mean()sum()等对于简单的东西
    • 来创建其他更任意的聚合,你可以使用带有自定义/lambda函数的matches.agg(...)

代码:

代码语言:javascript
复制
import pandas as pd

# https://www.kaggle.com/jordangoblet/atp-tour-20002016
df = pd.read_csv('../input/Data.csv', encoding='latin1', usecols=range(12+1), na_values='NR',
    parse_dates=['Date'], dayfirst=True)
# make sure that the 'NR' in WRank/LRank column is recognized as NaN, so they read in as float columns
# Columns 0-12 (basic data), 13-25 (results by set), 26-53 (betting odds: float)

# Create table of number of wins, losses, days since last match, average serve percentage etc. per player.

pd.options.display.max_rows = 100
pd.options.display.precision = 3    

player_names = set(df['Winner'].unique()) | set(df['Loser'].unique())
player_names = sorted(player_names) # 1.0.x BUG: order seems totally unstable(!)
player_dtype = pd.CategoricalDtype(categories = player_names) # merge across multiple columns

# WRONG: categories only taken from each individual column, not merged
# df[['WinnerCat1','LoserCat1']] = df[['Winner','Loser']].astype('category')
# RIGHT:
df[['WinnerCat','LoserCat']] = df[['Winner','Loser']].astype(player_dtype)

from collections import defaultdict
players = defaultdict(dict)

for _,player in (player_names):
    # query df where player occurs in either column, can't use `df[['Winner','Loser']].groupby(player)`
    matches = df[ df[['Winner','Loser']].isin([player]).any(axis=1) ]

    print(f'\n[MATCHES FOR {_} {player}]')
    print(matches.iloc[:, 0:13])

    players[player]['Wins']   = matches['Winner'].eq(player).sum()
    players[player]['Losses'] = matches['Loser'].eq(player).sum()
    players[player]['Avg']    = matches['Winner'].eq(player).mean().round(4)
    # You can do any arbitrary calculation involving both players' columns, e.g....
    players[player]['RankDiff'] = (matches['WRank'] - matches['LRank']).mean()

    #if _ >= 2: break # for debugging
票数 1
EN

Stack Overflow用户

发布于 2020-04-08 06:13:02

例如,如果您有一个csv文件,可以像这样加载它

代码语言:javascript
复制
import pandas as pd
dataframe = pd.read.csv('file.csv')

现在,选择您感兴趣的列(我将使用kaggle数据库作为参考)。

请注意,我不会写下你问的所有问题,因为我没有完全理解你的问题( kaggle数据库中没有日期,每个玩家的平均数据是什么?)

代码语言:javascript
复制
dataframe = dataframe['Winner', 'Loser']

现在,我们使用groupby函数和count函数来获得每个玩家的输赢数量(并使其在"Winner"/"Loser“列中的显示是唯一的。

代码语言:javascript
复制
dataframe['Wins'] = dataframe.groupby(labels='Winner', axis=1).count()
dataframe['Losses'] = dataframe.groupby(labels='Loser', axis=1).count()
票数 0
EN

Stack Overflow用户

发布于 2020-04-08 06:55:29

我运行了下面的代码,没有任何错误,所以我不认为这是游戏的问题。

代码语言:javascript
复制
df = pd.read_csv(r'Ten.csv')


lol = df.values.tolist()


players = {}
for game in lol:
    Player,Tournament,Wins,Losses = game

    if Player not in players:
        players[Player] = {
            'Tournament': 'foo',
            'wins': 1,
            'losses': 2,
        }

print(players)

但是,如果播放器不同,如下所示(小写"p")

代码语言:javascript
复制
player,Tournament,Wins,Losses = game

或者:(播放器末尾的"s“)

代码语言:javascript
复制
        players[Players] = {
            'Tournament': 'foo',
            'wins': 1,
            'losses': 2,
        }

将弹出错误NameError: name 'Player‘is not defined。你百分之百确定game_id没有拼错吗?

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/61090014

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档