假设我有一个包含这些数据的表(一个dataframe):
| user | food |
|:--------:|:-------------:|
| 'A' | 'meat' |
| 'A' | 'carrot' |
| 'A' | 'candy' |
| 'B' | 'meat' |
| 'B' | 'carrot' |
| 'C' | 'meat' |
| 'C' | 'carrot' |代码:
df = pd.DataFrame({
"user":["A", "A", "A", "B", "B", "C", "C"],
"food":['meat', 'carrot', 'candy', 'meat', 'carrot', 'meat', 'carrot']
})我想要建立的是一个表格,每一对食物告诉我有多少用户拥有它们:
| food 1 | food 2 | num users |
|:----------:|:-------------:|:----------:|
| 'meat' | 'carrot' | 3 |
| 'meat' | 'candy' | 1 |
| 'carrot' | 'candy' | 1 | 有办法这样做吗?
发布于 2017-11-20 13:45:58
您可以首先使用get_dummies:
df = pd.get_dummies(df.set_index('user'), prefix='', prefix_sep='').max(level=0)
print (df)
candy carrot meat
user
A 1 1 1
B 0 1 1
C 0 1 1然后用list comprehension计数
from itertools import combinations
L = [(x[0], x[1],(df[list(x)] == 1).all(1).sum()) for x in list(combinations(df.columns, 2))]
print (L)
[('candy', 'carrot', 1), ('candy', 'meat', 1), ('carrot', 'meat', 3)]
df = pd.DataFrame(L, columns=['food 1','food 2','num users'])
print (df)
food 1 food 2 num users
0 candy carrot 1
1 candy meat 1
2 carrot meat 3发布于 2017-11-20 13:46:47
你可以试试这个:
food_pairs = [("meat", "carrot"), ("meat", "candy")]
food_to_users = {food: set(df.user[df.food == food].unique()) for food in df.food.unique()}
out = pd.DataFrame(
((*pair, len(set.intersection(*(food_to_users[food] for food in pair)))) for pair in food_pairs),
columns=["food1", "food2", "num users"]
)平均运行时间超过1000个试验是0.00256s。
可伸缩性测试代码:
import itertools
import math
import pandas as pd
from random import shuffle
from timeit import time
SIZE_OF_PAIRS = 2
NUM_FOODS = 50
NUM_USERS = 1000
NUM_RECORDS = 100000
foods = (list(range(NUM_FOODS)) * (math.ceil(NUM_RECORDS/NUM_FOODS)))[:NUM_RECORDS]
users = (list(range(NUM_USERS)) * (math.ceil(NUM_RECORDS/NUM_USERS)))[:NUM_RECORDS]
shuffle(foods)
shuffle(users)
df = pd.DataFrame({"user": users, "food": foods})
food_pairs = pd.Series([*itertools.combinations(df.food.unique(), SIZE_OF_PAIRS)])
start = time.time()
food_to_users = {food: set(df.user[df.food == food].unique()) for food in df.food.unique()}
out = pd.DataFrame(
((*pair, len(set.intersection(*(food_to_users[food] for food in pair)))) for pair in food_pairs),
columns=[*["food" + str(i) for i in range(SIZE_OF_PAIRS)], "num users"]
)
print("Time taken: {}s".format(time.time() - start))https://stackoverflow.com/questions/47392457
复制相似问题