我有一只熊猫的数据作为文本数据。我通过按组和聚合来创建文本,如下所示。后来我计算了单词计数。
df = df.groupby('id') \
.agg({'chat': ', '.join }) \
.reset_index()看起来是这样的: chat是每个id的文本数据的集合。created_at是聊天的日期,转换为字符串类型。
|id|chat |word count|created_at |
|23|hi,hey!,hi|3 |2018-11-09 02:11:24,2018-11-09 02:11:43,2018-11-09 03:13:22|
|24|look there|2 |2017-11-03 18:05:34,2017-11-06 18:03:22 |
|25|thank you!|2 |2017-11-07 09:18:01,2017-11-18 11:09:37 |我想更改“聊天持续时间”列,该列给出了第一次约会和最后一次约会之间的差别,因为integer.If聊天当天结束,然后是1。新的预期列是:-
|chat_duration|
|1 |
|3 |
|11 |复制到剪贴板在组之前如下所示
,id,chat,created_at
0,23,"hi",2018-11-09 02:11:24
1,23,"hey!",2018-11-09 02:11:43
2,23,"hi",2018-11-09 03:13:22发布于 2020-05-15 04:52:11
如果我做了整个过程
开头
id,chat,created_at
23,"hi i'm at school",2018-11-09 02:11:24
23,"hey! how are you",2018-11-09 02:11:43
23,"hi mom",2018-11-09 03:13:22
24,"leaving home",2018-11-09 02:11:24
24,"not today",2018-11-09 02:11:43
24,"i'll be back",2018-11-10 03:13:22
25,"yesterday i had",2018-11-09 02:11:24
25,"it's to hot",2018-11-09 02:11:43
25,"see you later",2018-11-12 03:13:22
# create the dataframe with this data on the clipboard
df = pd.read_clipboard(sep=',')created_at设置为datetimedf.created_at = pd.to_datetime(df.created_at)创建word_count的
df['word_count'] = df.chat.str.split(' ').map(len)groupby agg获得所有chat作为字符串,created_at作为列表,word_cound作为总和。df = df.groupby('id').agg({'chat': ','.join , 'created_at': list, 'word_count': sum}).reset_index()chat_duration计算
df['chat_duration'] = df['created_at'].apply(lambda x: (max(x) - min(x)).days)如果跳过此步骤,则datetimes.将是
created_at转换为所需的字符串格式
df['created_at'] = df['created_at'].apply(lambda x: ','.join([y.strftime("%m/%d/%Y %H:%M:%S") for y in x]))最终df
| | id | chat | created_at | word_count | chat_duration |
|---:|-----:|:------------------------------------------|:------------------------------------------------------------|-------------:|----------------:|
| 0 | 23 | hi i'm at school,hey! how are you,hi mom | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/09/2018 03:13:22 | 10 | 0 |
| 1 | 24 | leaving home,not today,i'll be back | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/10/2018 03:13:22 | 7 | 1 |
| 2 | 25 | yesterday i had,it's to hot,see you later | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/12/2018 03:13:22 | 9 | 3 |发布于 2020-05-15 04:25:43
经过几次尝试,我得到了它:
首先将字符串转换为列表。
df['created_at'] = df['created_at'].str.split(
',').apply(lambda s: list(s))然后将最大值和最小日期项转换为列表。
df['created_at'] = df['created_at'].apply(lambda s: (datetime.strptime(
str(max(s)), '%Y-%m-%d') - datetime.strptime(str(min(s)), '%Y-%m-%d') ).days)发布于 2020-05-15 04:39:09
通过DataFrame创建split,然后减去转换为datetimes的第一列和最后一列:
df1 = df['created_at'].str.split(',', expand=True).ffill(axis=1)
df['created_at'] = (pd.to_datetime(df1.iloc[:, -1]) - pd.to_datetime(df1.iloc[:, 0])).dt.dayshttps://stackoverflow.com/questions/61811034
复制相似问题