文章/答案/技术大牛

发布

社区首页 >问答首页 >按赢家/输家姓名查询体育数据的数据帧，获取每个球员的汇总统计表？

问按赢家/输家姓名查询体育数据的数据帧，获取每个球员的汇总统计表？
EN

Stack Overflow用户

提问于 2020-04-08 05:56:15

回答 3查看 321关注 0票数 2

我有一个网球比赛结果的CSV数据集，格式为成功者和失败者，其结构与此相似：https://www.kaggle.com/jordangoblet/atp-tour-20002016

我想创建另一个表，将显示胜利的数量，损失，自上次比赛以来的天数，平均发球百分比等每名球员。

一位朋友建议尝试一下这样的方式：

games = [
    ["Elliot", 7, "Maya", 10],
    ["Elliot", 10, "Brendan", 8],
    ["Jamie", 7, "Omid", 10],
    ["Elliot", 6, "Jamie", 10],
    ["Brendan", 7, "Jamie", 10],
]

players = {}
K = 32.0
game_number = 1

for game in games:
    p1name, p1score, p2name, p2score = game
    if p1name not in players:
        players[p1name] = {
            'wins': 0,
            'losses': 0,
            'score': 1000.0,
        }
    if p2name not in players:
        players[p2name] = {
            'wins': 0,
            'losses': 0,
            'score': 1000.0,
        }

更多代码...

..。

print('Rank         Player         Elo       Wins    Losses')
rank = 1
for player, p in sorted(players.items(), key=lambda item: (item[1]['score'], item[0]), reverse=True):
    print('%4s %20s %7.1f %5s %7s' % (rank, player, p['score'], p['wins'], p['losses']))
    rank += 1

我尝试通过使我的df成为列表列表来调整此代码以满足我的需要：

lol = df.values.tolist()
players = {}
for game in lol:
    game_id, tournament_name, tournament_id, tournament_date, round_match, winner, winner_id, loser, loser_id, winner_elo, 
    loser_elo, winner_delta, loser_delta, winner_set_1, loser_set_1, winner_set_2, loser_set_2, winner_set_3, loser_set_3,
    winner_set_4, loser_set_4, winner_set_5, loser_set_5, winner_sets, loser_sets, winner_serve_points_won, 
    loser_serve_points_won, winner_serve_hold, loser_serve_hold, winner_points_won, loser_points_won, winner_true_serve_pct, 
    loser_true_serve_pct = game

其中，以上所有内容都是原始df中的列名。

但是我得到了这个错误：NameError: name "game_id" is not defined

原始代码运行良好，但我不明白为什么=游戏在我的情况下不起作用。

我是不是遗漏了什么？有没有更好的方法来解决这个问题？我的数据集大约有60k个实例和33列

编辑:得到了解决方案，变量=游戏应该只在一行中。但是，如果有更好的解决方案来实现我在评论中描述的内容，欢迎您发表评论

编辑:列信息

game/tournament/winner/loser_id = unique id's

tournament_name = matches are played in daily tournaments

tournament_date = date of tournament/match

round_match = round of tournament the match takes place

winner/loser= winner/loser name

winner/loser_elo = current elo for the day

winner/loser_delta = delta change after match

winner/loser_set_1,2,3,4,5 = score for winner/loser per game, match ends when a player reaches 3 games, 11 points needed to win a game or 2 point differential (extra points if game is tied at 10-10)

winner/loser_sets= number of sets

winner/loser_points_won = total points per player in the match

winner/loser_true_serve_pct = % of points player wins when he is serving

winner_serve_points_won/loser_serve_points_won is obsolete

winner_serve_hold/loser_serve_hold will be replaced by winner/loser_true_serve_pct

python

pandas

list

dictionary

回答 3

Stack Overflow用户

发布于 2020-04-09 07:48:08

您需要查询'Winner‘或’Loser‘列中出现的球员名称的数据帧，以使 matches**，获得涉及该球员的所有比赛的数据帧。这比乍一看更难--我们不能只使用简单的** df.groupby()**.**，很抱歉低估了你的问题。

下面的

解决方案使用df[['Winner','Loser']].isin([player]).any(axis=1)

- Note that parameterizes the list of columns where the name can occur, it doesn't hardcode them. More elegant idiom than `df['Winner'] == player or df['Loser'] == player`
- I also looked into `df.query("...")` which takes arbitrary SQL-style query string, but its syntax is annoying

在这两列中有1387个独特的玩家：df['Winner']中有879个独特的玩家，df['Loser']中有1383个。我们将这两个组合到一个集合player_names中。你可以使用列表，但使用set来删除重复项更容易：set(df['Winner'].unique()) | set(df['Loser'].unique())其中'|‘是处理球员名称的列表效率，我们使用分类而不是字符串。注意代码是如何拥有一个统一的类别，并且合并了'Winner'，'Loser‘两个列的类别值(参见代码)。
我们使用了defaultdict(dict)
defaultdict(dict)，这很优雅，因为现在你可以直接定义字段，而不管是否定义了players players[player]['WhateverField']，如果没有定义，它将自动被players[player]。这里显示了你想创建的聚合字段的示例。
- 注意到你可以直接使用pandas矢量化的mean()，sum()等对于简单的东西
- 来创建其他更任意的聚合，你可以使用带有自定义/lambda函数的matches.agg(...)。

代码：

import pandas as pd

# https://www.kaggle.com/jordangoblet/atp-tour-20002016
df = pd.read_csv('../input/Data.csv', encoding='latin1', usecols=range(12+1), na_values='NR',
    parse_dates=['Date'], dayfirst=True)
# make sure that the 'NR' in WRank/LRank column is recognized as NaN, so they read in as float columns
# Columns 0-12 (basic data), 13-25 (results by set), 26-53 (betting odds: float)

# Create table of number of wins, losses, days since last match, average serve percentage etc. per player.

pd.options.display.max_rows = 100
pd.options.display.precision = 3    

player_names = set(df['Winner'].unique()) | set(df['Loser'].unique())
player_names = sorted(player_names) # 1.0.x BUG: order seems totally unstable(!)
player_dtype = pd.CategoricalDtype(categories = player_names) # merge across multiple columns

# WRONG: categories only taken from each individual column, not merged
# df[['WinnerCat1','LoserCat1']] = df[['Winner','Loser']].astype('category')
# RIGHT:
df[['WinnerCat','LoserCat']] = df[['Winner','Loser']].astype(player_dtype)

from collections import defaultdict
players = defaultdict(dict)

for _,player in (player_names):
    # query df where player occurs in either column, can't use `df[['Winner','Loser']].groupby(player)`
    matches = df[ df[['Winner','Loser']].isin([player]).any(axis=1) ]

    print(f'\n[MATCHES FOR {_} {player}]')
    print(matches.iloc[:, 0:13])

    players[player]['Wins']   = matches['Winner'].eq(player).sum()
    players[player]['Losses'] = matches['Loser'].eq(player).sum()
    players[player]['Avg']    = matches['Winner'].eq(player).mean().round(4)
    # You can do any arbitrary calculation involving both players' columns, e.g....
    players[player]['RankDiff'] = (matches['WRank'] - matches['LRank']).mean()

    #if _ >= 2: break # for debugging

票数 1

Stack Overflow用户

发布于 2020-04-08 06:13:02

例如，如果您有一个csv文件，可以像这样加载它

import pandas as pd
dataframe = pd.read.csv('file.csv')

现在，选择您感兴趣的列(我将使用kaggle数据库作为参考)。

请注意，我不会写下你问的所有问题，因为我没有完全理解你的问题( kaggle数据库中没有日期，每个玩家的平均数据是什么？)

dataframe = dataframe['Winner', 'Loser']

现在，我们使用groupby函数和count函数来获得每个玩家的输赢数量(并使其在"Winner"/"Loser“列中的显示是唯一的。

dataframe['Wins'] = dataframe.groupby(labels='Winner', axis=1).count()
dataframe['Losses'] = dataframe.groupby(labels='Loser', axis=1).count()

票数 0

Stack Overflow用户

发布于 2020-04-08 06:55:29

我运行了下面的代码，没有任何错误，所以我不认为这是游戏的问题。

df = pd.read_csv(r'Ten.csv')


lol = df.values.tolist()


players = {}
for game in lol:
    Player,Tournament,Wins,Losses = game

    if Player not in players:
        players[Player] = {
            'Tournament': 'foo',
            'wins': 1,
            'losses': 2,
        }

print(players)

但是，如果播放器不同，如下所示(小写"p")

player,Tournament,Wins,Losses = game

或者：(播放器末尾的"s“)

        players[Players] = {
            'Tournament': 'foo',
            'wins': 1,
            'losses': 2,
        }

将弹出错误NameError: name 'Player‘is not defined。你百分之百确定game_id没有拼错吗？

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/61090014

复制

相似问题

问按赢家/输家姓名查询体育数据的数据帧，获取每个球员的汇总统计表？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问按赢家/输家姓名查询体育数据的数据帧，获取每个球员的汇总统计表？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问按赢家/输家姓名查询体育数据的数据帧，获取每个球员的汇总统计表？
EN