我有一个数据集df,其中包括关于老客户的信息和可能是新的或旧的客户列表。我想给新客户分配一个忠诚度值,即那些还没有在dataset df中评估的客户。请看下面我想要达到的目标的一个例子。假设我有一个潜在的新客户列表(例如,customers=['cust1', 'cust76']),他们的忠诚度信息可能在df数据集中缺失(这里称为existing_cust_df)。
existing_cust_df=pd.read(path.csv)数据集只有两个字段:一个是客户的ID,另一个是忠诚度值(称为value)。
Customer value
cust2 13
cust3 14
cust6 35
cust7 21
cust1 24我想确定新客户的忠诚度(在这个例子中,只有cust76,因为cust1已经在数据集中了,我们有关于它的价值的信息--即。24)基于包含在existing_cust_df数据集中的现有客户的价值。
为了确定新客户的价值,我使用了一个不同的数据集,名为extra_df,其中包括新客户和可能用于确定其忠诚度价值的潜在客户列表。extra_df数据集如下所示:
Customer Related_customer
cust76 cust32
cust76 cust2
cust76 cust1然后,可以使用extra_df查看可能包含在existing_cust_df中的相关客户,以便从那里获取他们的值,并通过平均这些值分配给新客户。在我提供的示例中,cust76在existing_cust_df中缺失,所以我查看extra_df,在与它相关的三个客户中,只有两个(cust2和cust1)在existing_cust_df中,因此它们的值。通过平均它们的值(13和24),我希望将这个值(18.5)分配给新客户(cust76),并将其ID (cust76)和计算值(18.5)附加到existing_cust_df中。所以在最后我会:
Customer value
cust2 13
cust3 14
cust6 35
cust7 21
cust1 24
cust76 18.5我希望上面的步骤是有意义的,如果没有,请告诉我。请告诉我如何运行此检查,查看列表中的客户进入existing_cust_df,然后在extra_df中(如果不包括)计算其平均值并最终更新existing_cust_df中的值?我知道要在现有的dataframe中追加一行,我应该这样做:
new_row = {'Customer': cust76, 'value': 18.5}
#append row to the dataframe
existing_cust_df = existing_cust_df.append(new_row, ignore_index=False)但在这种情况下,困难在于查看不在existing_cust_df中的列表中的客户,然后,在extra_df中,为了包括他们和他们的值,一旦计算出平均值(否则,分配一个空值也会更好)。
任何帮助都会很好。
发布于 2022-03-25 14:34:44
让我们首先构建我们的数据框架:
csvfile = StringIO(
"""Customer\tvalue
cust2\t13
cust3\t14
cust6\t35
cust7\t21
cust1\t24""")
existing_cust_df = pd.read_csv(csvfile, sep = '\t', engine='python')
csvfile = StringIO("""
Customer\tRelated_customer
cust76\tcust32
cust76\tcust2
cust76\tcust1""")
extra_df = pd.read_csv(csvfile, sep = '\t', engine='python')我们希望为extra_df中的每个输入循环,让我们通过:
dg = extra_df.groupby('Customer')然后,我们可以在extra_cust_df:True中为每个客户构建一个掩码,如果related_customer在existing_cust_df中。
dg.Related_customer.apply(lambda x : existing_cust_df.Customer.isin(x.to_numpy()))
Out[123]:
Customer
cust76 0 True
1 False
2 False
3 False
4 True
Name: Related_customer, dtype: bool然后,我们可以在existing_cust_df.value上应用这个掩码,然后将其平均化:
dg.Related_customer.apply(lambda x : existing_cust_df.query("Customer.isin(@x.to_numpy())"))
Out[125]:
Customer value
Customer
cust76 0 cust2 13
4 cust1 24
dg.Related_customer.apply(lambda x : existing_cust_df.query("Customer.isin(@x.to_numpy())").value.mean())
Out[126]:
Customer
cust76 18.5
Name: Related_customer, dtype: float64最后,我们可以将其附加到您现有的df中。
发布于 2022-03-25 14:51:13
这里有一个方法来做你的问题所要求的:
import pandas as pd
import numpy as np
# Objective:
# for each customer in customers not already in existing_cust_df,
# add a record to existing_cust_df with value equal to
# the average of the values from existing_cust_df
# of Related_customer records in extra_df
# where extra_df.Customer == customer and Related_customer is in existing_cust_df.
existing_cust_df = pd.DataFrame({'Customer':['cust2','cust3','cust6','cust7','cust1'], 'value':[13,14,35,21,24]})
print("existing_cust_df:")
print(existing_cust_df)
extra_df = pd.DataFrame({'Customer':['cust76','cust76','cust76', 'cust77', 'cust77'], 'Related_customer':['cust32','cust2','cust1', 'cust2', 'cust6']})
print("extra_df:")
print(extra_df)
customers=['cust1', 'cust76', 'cust77']
new_cust_df = pd.DataFrame({'Customer': [c for c in customers if c not in set(existing_cust_df['Customer'])]})
print("new_cust_df:")
print(new_cust_df)
# create a record for each new_cust containing list of related_cust names
# with matching customer in extra_df where related_cust is a customer in existing_cust_df
existing_related_customer_values = \
pd.merge( \
pd.merge(extra_df, new_cust_df, on= 'Customer'), \
existing_cust_df, left_on='Related_customer', right_on='Customer', suffixes=(None, '_y')) \
[['Customer', 'Related_customer', 'value']]
print("existing_related_customer_values\n (JOIN from new_cust_df and extra_df on Customer [KEEP],\n then JOIN on extra_df.Related_customer and existing_cust_df.Customer [KEEP value])")
print(existing_related_customer_values)
# find the mean of values for existing Related_customer records for new customers
new_cust_df['value'] = new_cust_df.apply(lambda x: np.mean( \
existing_related_customer_values[existing_related_customer_values['Customer'] == x['Customer']]['value']), axis=1)
print("new_cust_df (UPDATED with value column):")
print(new_cust_df)
existing_cust_df = pd.concat([existing_cust_df, new_cust_df], ignore_index=True)
print("existing_cust_df (UPDATED with new customer rows):")
print(existing_cust_df)输出:
existing_cust_df:
Customer value
0 cust2 13
1 cust3 14
2 cust6 35
3 cust7 21
4 cust1 24
extra_df:
Customer Related_customer
0 cust76 cust32
1 cust76 cust2
2 cust76 cust1
3 cust77 cust2
4 cust77 cust6
new_cust_df:
Customer
0 cust76
1 cust77
existing_related_customer_values
(JOIN from new_cust_df and extra_df on Customer [KEEP],
then JOIN on extra_df.Related_customer and existing_cust_df.Customer [KEEP value])
Customer Related_customer value
0 cust76 cust2 13
1 cust77 cust2 13
2 cust76 cust1 24
3 cust77 cust6 35
new_cust_df (UPDATED with value column):
Customer value
0 cust76 18.5
1 cust77 24.0
existing_cust_df (UPDATED with new customer rows):
Customer value
0 cust2 13.0
1 cust3 14.0
2 cust6 35.0
3 cust7 21.0
4 cust1 24.0
5 cust76 18.5
6 cust77 24.0https://stackoverflow.com/questions/71617891
复制相似问题