我试图想出一种动态的方法来检查字符串的存在,并报告几个不同的结果: gone_client、回旋镖、new_client。
将熊猫作为pd导入numpy作为npdata存储在字典details ={ 'address_id':111,111,111,111,111,111,222,222,222,222,222,222,333,333,333,333,333,333,444,444,444,444,444,444,555,555,555,555,555,555,777,777,777,'my_company':'Comcast','Verizon','Other','Other','Comcast','Comcast',‘频谱’,'Verizon',‘频谱’,“频谱”、“威瑞森”、“频谱”、“威瑞森”、“其他”、“威瑞森”、“康卡斯特”、“康卡斯特”、“我的约会”:“2022-01-24”、“2022-02-21”、“2022-03-28”、“2022-04-25”、“2022-05-23”、“2022-06-27”,“2022-01-24”、“2022-02-21”、“2022-03-28”、“2022-04-25”、“2022-05-23”、“2022-06-27”、“2022-01-24”、“2022-02-21”、“2022-03-28”、“2022-04-25”、“2022-05-23”、“2022-06-27”、“2022-01-24”、“2022-02-21”,“2022-03-28”、“2022-04-25”、“2022-05-23”、“2022-06-27”、“2022-01-24”、“2022-02-21”、“2022-03-28”、“2022-04-25”、“2022-05-23”、“2022-06-27”、“2022-01-24”、“2022-02-21”、“2022-03-28”}=pd.DataFrame(详细信息) df
我不能完全正确地理解逻辑,但我认为这是一些关于lines...but的东西
def f(ser):
if 'Verizon' not in ser.unique():
return False
if 'Verizon' in ser.unique():
return True
return False
df['gone_client'] = df.groupby('address_id')['my_company'].transform(f)
df.head()也许是这样的..。
# substring to be searched
sub ='Verizon'
# creating and passing series to new column
df["Indexes"]= df["my_company"].str.find(sub)
df不过,我不知道如何处理这些日期,只有通过检查日期,您才能知道客户端是否已经离开、翻回、新的,或者可能是一些不适合于水桶的奇怪模式,比如“其他”。预期的结果将是这样(最后一种情况是一种又一次离开的回旋镖)。我不会期望有很多这样的场景,但是肯定会有一些非常独特的情况,所以“其他”对于这些记录来说是可以的。
address_id my_company my_date status
0 111 Comcast 1/24/2022 Left
1 111 Verizon 2/21/2022 Left
2 111 Other 3/28/2022 Left
3 111 Other 4/25/2022 Left
4 111 Comcast 5/23/2022 Left
5 111 Comcast 6/27/2022 Left
6 222 Spectrum 1/24/2022 Never Client
7 222 Spectrum 2/21/2022 Never Client
8 222 Spectrum 3/28/2022 Never Client
9 222 Spectrum 4/25/2022 Never Client
10 222 Spectrum 5/23/2022 Never Client
11 222 Spectrum 6/27/2022 Never Client
12 333 Verizon 1/24/2022 Never Left
13 333 Verizon 2/21/2022 Never Left
14 333 Verizon 3/28/2022 Never Left
15 333 Verizon 4/25/2022 Never Left
16 333 Verizon 5/23/2022 Never Left
17 333 Verizon 6/27/2022 Never Left
18 444 Spectrum 1/24/2022 Left
19 444 Spectrum 2/21/2022 Left
20 444 Spectrum 3/28/2022 Left
21 444 Spectrum 4/25/2022 Left
22 444 Verizon 5/23/2022 Left
23 444 Spectrum 6/27/2022 Left
24 555 Verizon 1/24/2022 Boomerang
25 555 Spectrum 2/21/2022 Boomerang
26 555 Spectrum 3/28/2022 Boomerang
27 555 Spectrum 4/25/2022 Boomerang
28 555 Verizon 5/23/2022 Boomerang
29 555 Verizon 6/27/2022 Boomerang
30 777 Comcast 1/24/2022 New Client
31 777 Comcast 2/21/2022 New Client
32 777 Verizon 1/24/2022 New Client
33 888 Verizon 2/21/2022 Other
34 888 Comcast 3/28/2022 Other
35 888 Verizon 4/25/2022 Other
36 888 Comcast 5/23/2022 Other
37 888 Comcast 6/27/2022 Other发布于 2022-09-06 21:16:18
这里有一个包含一些详细逻辑的解决方案,您可以随意使用。这听起来不像是你对你最后的逻辑很确定,但希望这能给你足够的时间去玩。
此解决方案基于address_id对数据进行分组。然后,对于每个单独的组,我们可以检查哪个my_company特性。我们可以利用这一点,再加上一个timedelta,来说明地址是否在Verizon,而不是Verizon,永远不属于Verizon,现在是Verizon,在过去的30天里又回来了等等。
这个答案不是由Verizon赞助的。其他手机提供商也存在。
import pandas as pd
import datetime
# data stored in dictionary
details = {
'address_id': [111,111,111,111,111,111,222,222,222,222,222,222,333,333,333,333,333,333,444,444,444,444,444,444,555,555,555,555,555,555,777,777,777],
'my_company':['Comcast','Verizon','Other','Other','Comcast','Comcast','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Verizon','Verizon','Verizon','Verizon','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Spectrum','Spectrum','Spectrum','Spectrum','Spectrum','Verizon','Other','Verizon','Comcast','Comcast'],
'my_date':['2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28','2022-04-25','2022-05-23','2022-06-27','2022-01-24','2022-02-21','2022-03-28']
}
df = pd.DataFrame(details)
df['my_date'] = pd.to_datetime(df['my_date'])
address_groups = df.groupby(['address_id'])
frame_list = []
current_date = datetime.datetime.now()
for group, frame in address_groups:
# Create a list and set of each company used by a given address-id:
company_list = frame['my_company'].values.tolist()
company_set = set(company_list)
# Exclusively Verizon
if ('Verizon' in company_set) and (len(company_set) == 1):
frame['status'] = 'Verizon Diehard'
# Never Verizon
if ('Verizon' not in company_set):
frame['status'] = 'Verizon Never'
# Verizon at some point but not currently
if ('Verizon' in company_set) and (company_list[-1] != 'Verizon'):
v_frame = frame[frame['my_company'] == 'Verizon']
last_verizon_date = v_frame['my_date'].iloc[-1]
last_verizon_date = datetime.datetime.strptime(last_verizon_date, '%Y-%m-%d')
if (current_date - last_verizon_date) < pd.Timedelta("30 days"):
frame['status'] = 'Not curretly Verizon, but was in last 30 days'
else:
frame['status'] = 'Not curretly Verizon, but was so more than 30 days ago'
# Verizon currently but was a boomerang
if (company_list[-1] == 'Verizon') and (len(company_set) >= 2):
non_v_frame = frame[frame['my_company'] != 'Verizon']
last_non_v_date = non_v_frame['my_date'].iloc[-1]
last_non_v_date = datetime.datetime.strptime(last_non_v_date, '%Y-%m-%d')
if (current_date - last_non_v_date) < pd.Timedelta("30 days"):
frame['status'] = 'Boomerang back to Verizon in last 30 days'
else:
frame['status'] = 'Boomerang back more than 30 days ago'
frame_list.append(frame)
final_df = pd.concat(frame_list)https://stackoverflow.com/questions/73623406
复制相似问题