我正试图从一个DataFrame中整理一个列表。我现有的DataFrame如下所示:
CreationDate
2013-12-22 15:25:02 <ubuntu><mac-osx><syslinux>
2009-12-14 14:29:32 <ubuntu><mod-rewrite><laconica><apache-2.2>
2013-12-22 15:42:00 <ubuntu><nat><squid><mikrotik>
Name: Tags, dtype: object然后,清理Tags列中的标记字符串:
def tag_cleaner(s):
s0 = "".join(s.split("<")).split(">")
return [i for i in s0 if i != ""]
df["Tags"] = df["Tags"].apply(lambda t: tag_cleaner(t))
df["NumTags"] = df["Tags"].apply(lambda x: len(x))其结果是:
CreationDate
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4现在,我为每个标记创建新列:
tag_df = pd.DataFrame(index=df.index, data=df["Tags"])
max_cols = tag_df["Tags"].map(len).max()
for col in range(max_cols):
tag_df[col] = pd.Series(index=tag_df.index)这给了我这个:
CreationDate
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] NaN NaN NaN NaN NaN
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] NaN NaN NaN NaN NaN
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] NaN NaN NaN NaN NaN对于Tags列中的每个标记,我想在它的适当的“索引”列处插入标记。因此,最终结果应该如下所示:
CreationDate
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] ubuntu mac-osx syslinux NaN NaN
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] ubuntu mod-rewrite laconica apache-2.2 NaN
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] ubuntu nat squid mikrotik NaN我尝试过pd.DataFrame.insert()和各种形式的创建新DataFrames并将它们合并在一起,但我似乎找不到合适的组合。如何将Tags列中的每个对象压平到同一行中的相应列?
发布于 2016-12-27 08:51:27
在本例中,我将使用.str.extractall()方法:
In [57]: df
Out[57]:
CreationDate Tags
0 2013-12-22 15:25:02 <ubuntu><mac-osx><syslinux>
1 2009-12-14 14:29:32 <ubuntu><mod-rewrite><laconica><apache-2.2>
2 2013-12-22 15:42:00 <ubuntu><nat><squid><mikrotik>
In [58]: x = df.pop('Tags').str.extractall(r'\<(.*?)\>').unstack()
In [59]: x.columns = x.columns.droplevel(0)
In [60]: df.join(x)
Out[60]:
CreationDate 0 1 2 3
0 2013-12-22 15:25:02 ubuntu mac-osx syslinux None
1 2009-12-14 14:29:32 ubuntu mod-rewrite laconica apache-2.2
2 2013-12-22 15:42:00 ubuntu nat squid mikrotik更新:假设数据是一个系列,而不是DataFrame:
In [14]: s
Out[14]:
CreationDate
2013-12-22 15:25:02 <ubuntu><mac-osx><syslinux>
2009-12-14 14:29:32 <ubuntu><mod-rewrite><laconica><apache-2.2>
2013-12-22 15:42:00 <ubuntu><nat><squid><mikrotik>
Name: Tags, dtype: object
In [15]: type(s)
Out[15]: pandas.core.series.Series
In [16]: x = s.str.extractall(r'\<(.*?)\>').unstack().rename_axis(None)
In [17]: x.columns = x.columns.droplevel(0)
In [18]: x
Out[18]:
match 0 1 2 3
2009-12-14 14:29:32 ubuntu mod-rewrite laconica apache-2.2
2013-12-22 15:25:02 ubuntu mac-osx syslinux None
2013-12-22 15:42:00 ubuntu nat squid mikrotik发布于 2016-12-27 05:37:07
获取长度和转换为列表的部分解决方案。
df.Tags = df.Tags.str.strip('<>')
df.Tags = df.Tags.str.split('><')
df['NumTags'] = df.Tags.apply(lambda x: len(x))工作溶液
只需注释掉评论并复制到剪贴板,然后再对它们进行评论。那就运行代码。
import pandas as pd
# CreationDate
# 2013-12-22 15:25:02 <ubuntu><mac-osx><syslinux>
# 2009-12-14 14:29:32 <ubuntu><mod-rewrite><laconica><apache-2.2>
# 2013-12-22 15:42:00 <ubuntu><nat><squid><mikrotik>
df= pd.read_clipboard()
df2= df.copy()
df2.CreationDate = df2.CreationDate.str.strip('<>')
df2.CreationDate = df2.CreationDate.str.split('><')
df2['Length'] = df2.CreationDate.apply(lambda x: len(x))
for a in range(df2.Length.max()):
df2[a]=df2.CreationDate.apply(lambda x: x[a] if a<len(x) else 'NaN')
df2输出:

https://stackoverflow.com/questions/41339388
复制相似问题