我想在没有拆分器的情况下拆分等长字符串,并扩展数据帧。
下面是我使用的测试数据帧:
sample1 = pd.DataFrame({
'TST': {1: 1535840000000, 2: 1535840000000},
'RCV': {1: 1535840000000, 2: 1535850000000},
'TCU': {1: 358272000000000, 2: 358272000000000},
'SPD': {1: '0', 2: '00000000000000710000007D007C00E2'}
})如您所见,SPD列包含不同长度的字符串,没有任何拆分器。
我希望每隔4个字符将SPD列拆分成新的行,然后将它们扩展到数据帧。
TST RCV TCU SPD
0 1535840000000 1535840000000 358272000000000 0000
1 1535840000000 1535840000000 358272000000000 0000
2 1535840000000 1535840000000 358272000000000 0000
3 1535840000000 1535840000000 358272000000000 0071
4 1535840000000 1535840000000 358272000000000 0000
5 1535840000000 1535840000000 358272000000000 007D
6 1535840000000 1535840000000 358272000000000 007C
7 1535840000000 1535840000000 358272000000000 00E2我尝试首先使用下面的代码生成一个序列:
pd.concat([pd.Series(re.findall('....', row['SPD'])) for _, row in sample1.iterrows()]).reset_index()这给了我们
index 0
0 0 0000
1 1 0000
2 2 0000
3 3 0071
4 4 0000
5 5 007D
6 6 007C
7 7 00E2但是我不能把它扩展回sample1
发布于 2019-06-04 22:57:05
您可以使用str.findall拆分每个4字符中的字符串,然后使用unnesting将结果数据帧从链接的解决方案中解套出来:
sample1['SPD'] = sample1.SPD.str.ljust(4, '0').str.findall(r'.{4}?')
unnesting(sample1, ['SPD'])
SPD TST RCV TCU
1 0000 1535840000000 1535840000000 358272000000000
2 0000 1535840000000 1535850000000 358272000000000
2 0000 1535840000000 1535850000000 358272000000000
2 0000 1535840000000 1535850000000 358272000000000
2 0071 1535840000000 1535850000000 358272000000000
2 0000 1535840000000 1535850000000 358272000000000
2 007D 1535840000000 1535850000000 358272000000000
2 007C 1535840000000 1535850000000 358272000000000
2 00E2 1535840000000 1535850000000 358272000000000发布于 2019-06-04 22:55:38
您可以使用str.findall,然后根据来自SPD的4个字符片的数量对行进行repeat。
from itertools import chain
spd4 = df.pop('SPD').str.findall(r'.{4}')
(pd.DataFrame(df.values.repeat(spd4.str.len(), axis=0), columns=df.columns)
.assign(SPD=list(chain.from_iterable(spd4))))
TST RCV TCU SPD
0 1535840000000 1535850000000 358272000000000 0000
1 1535840000000 1535850000000 358272000000000 0000
2 1535840000000 1535850000000 358272000000000 0000
3 1535840000000 1535850000000 358272000000000 0071
4 1535840000000 1535850000000 358272000000000 0000
5 1535840000000 1535850000000 358272000000000 007D
6 1535840000000 1535850000000 358272000000000 007C
7 1535840000000 1535850000000 358272000000000 00E2发布于 2019-06-04 23:46:07
使用Series.str.extractall,然后与原始df连接。
sample1.filter(regex='^(?!SPD)').join(
sample1.SPD.str.extractall('(?P<SPD>.{4})').reset_index(level=1, drop=True)
)
# TST RCV TCU SPD
#1 1535840000000 1535840000000 358272000000000 NaN
#2 1535840000000 1535850000000 358272000000000 0000
#2 1535840000000 1535850000000 358272000000000 0000
#2 1535840000000 1535850000000 358272000000000 0000
#2 1535840000000 1535850000000 358272000000000 0071
#2 1535840000000 1535850000000 358272000000000 0000
#2 1535840000000 1535850000000 358272000000000 007D
#2 1535840000000 1535850000000 358272000000000 007C
#2 1535840000000 1535850000000 358272000000000 00E2使用内部联接(...how='inner')如果您想排除SPD少于4个字符的行。
https://stackoverflow.com/questions/56446220
复制相似问题