我有一个样本df:
id email1 email2 output
1 abc@gmail.com 123@gmail.com random_email1@gmail.com
1 xyz@gmail.com 234@gmail.com random_email1@gmail.com
1 NaN NaN random_email1@gmail.com
2 a123@gmail.com NaN random_email2@gmail.com
2 b123@gmail.com NaN random_email2@gmail.com
2 NaN lol@gmail.com random_email2@gmail.com
3 NaN NaN random_email3@gmail.com
4 NaN lolz@gmail.com random_email3@gmail.com我的主要目标是基于多个条件覆盖output列。如果email1有多个唯一的电子邮件,则用email1_broken覆盖所有的输出并使用相应的If。email2也是如此,但如果两者都有多个唯一的电子邮件,email2优先,因此是output = email2_broken。最后,如果当前的output列都有一个id和一个唯一的电子邮件,那么我们将保留电子邮件。
企图:
df['output'] = np.where(df.groupby('id')['email1'].nunique() > 1, 'email1_broken',df['output'])
df['output'] = np.where(df.groupby('id')['email2'].nunique() > 1, 'email2_broken',df['output'])想要的df:
id email1 email2 output
1 abc@gmail.com 123@gmail.com email2_broken
1 xyz@gmail.com 234@gmail.com email2_broken
1 NaN NaN email2_broken
2 a123@gmail.com NaN email1_broken
2 b123@gmail.com NaN email1_broken
2 NaN lol@gmail.com email1_broken
3 NaN NaN random_email3@gmail.com
4 NaN lolz@gmail.com random_email3@gmail.com样本数据:
import pandas as pd
import numpy as np
cols = ['id','email1','email2', 'output']
data = [
[1 , 'abc@gmail.com' , '123@gmail.com' , 'random_email1@gmail.com'],
[1 , 'xyz@gmail.com' , '234@gmail.com' , 'random_email1@gmail.com'],
[1 , np.nan , np.nan , 'random_email1@gmail.com'],
[2 , 'a123@gmail.com', np.nan , 'random_email2@gmail.com'],
[2 , 'b123@gmail.com', np.nan , 'random_email2@gmail.com'],
[2 , np.nan , 'lol@gmail.com' , 'random_email2@gmail.com'],
[3 , np.nan , np.nan , 'random_email3@gmail.com'],
[4 , np.nan , 'lolz@gmail.com' , 'random_email3@gmail.com']]
df = pd.DataFrame(data, columns=cols)发布于 2022-09-01 18:17:22
你们非常亲密:
df['output'] = np.where(df.groupby('id')['email1'].transform('nunique') > 1, 'email1_broken',df['output'])
df['output'] = np.where(df.groupby('id')['email2'].transform('nunique') > 1, 'email2_broken',df['output'])使用转换获得形状相同的布尔数组。
发布于 2022-09-01 18:16:44
您可以使用np.select (在涉及多个条件时等效于numpy.where )和transform('nunique')。
g = df.groupby('id')
df['output'] = np.select(
[g['email2'].transform('nunique').gt(1),
g['email1'].transform('nunique').gt(1)],
['email2_broken', 'email1_broken'],
df['output'])
print(df)输出:
id email1 email2 output
0 1 abc@gmail.com 123@gmail.com email2_broken
1 1 xyz@gmail.com 234@gmail.com email2_broken
2 1 NaN NaN email2_broken
3 2 a123@gmail.com NaN email1_broken
4 2 b123@gmail.com NaN email1_broken
5 2 NaN lol@gmail.com email1_broken
6 3 NaN NaN random_email3@gmail.com
7 4 NaN lolz@gmail.com random_email3@gmail.comhttps://stackoverflow.com/questions/73573628
复制相似问题