我有如下数据:
NAME ETHNICITY_RECAT TOTAL_LENGTH 3LETTER_SUBSTRINGS
joseph fr 14 jos, ose, sep, eph
ann en 16 ann
anne ir 14 ann, nne
tom en 18 tom
tommy fr 16 tom, omm, mmy
ann ir 19 ann
... more rows3LETTER_SUBSTRINGS值是字符串,它捕获名称变量的所有三个字母子字符串。我希望将其聚合到一个列表中,并将每个逗号分隔的项按每一行附加到列表中,并将其视为一个列表项。详情如下:
ETHNICITY_RECAT TOTAL_LENGTH 3LETTER_SUBSTRINGS
min max mean <lambda>
fr 2 26 13.22 [jos, ose, sep, eph, tom, oom, mmy, ...]
en 3 24 11.92 [ann, tom, ...]
ir 4 23 12.03 [ann, nne, ann, ...]我使用以下代码进行了“完成”:
aggregations = {
'TOTAL_LENGTH': [min, max, 'mean'],
'3LETTER_SUBSTRINGS': lambda x: list(x),
}
self.df_agg = self.df.groupby('ETHNICITY_RECAT', as_index=False).agg(aggregations)问题是整个字符串"ann,anne“在最终列表中被认为是一个单一的列表项,而不是将每一个作为单个列表项来考虑,例如"ann”、"anne“。
我希望看到子字符串的最高频率,但是当我运行以下代码时,我将得到整个字符串的频率(而不是单个的3个字母子字符串):
from collections import Counter
x = self.df_agg_eth[self.df_agg_eth['ETHNICITY_RECAT']=='en']['3LETTER_SUBSTRINGS']['<lambda>']
x_list = x[0]
c = Counter(x_list)我明白了:
[('jos, ose, sep, eph', 19), ('ann, nee', 5), ...]而不是我想要的:
[('jos', 19), ('ose', 19), ('sep', 23), ('eph', 19), ('ann', 15), ('nee', 5), ...]我试过:
'3LETTER_SUBSTRINGS': lambda x: list(i) for i in x.split(', '),但上面写着invalid syntax。
发布于 2019-11-07 17:23:19
您要做的第一件事是将字符串转换为列表,然后它只是一个带有agg的agg
df['3LETTER_SUBSTRINGS'] = df['3LETTER_SUBSTRINGS'].str.split(', ')
df.groupby('ETHNICITY_RECAT').agg({'TOTAL_LENGTH':['min','max','mean'],
'3LETTER_SUBSTRINGS':'sum'})输出:
TOTAL_LENGTH 3LETTER_SUBSTRINGS
min max mean sum
ETHNICITY_RECAT
en 16 18 17.0 [ann, tom]
fr 14 16 15.0 [jos, ose, sep, eph, tom, omm, mmy]
ir 14 19 16.5 [ann, nne, ann]发布于 2019-11-07 17:14:24
我认为您的大部分代码都没问题,您只是误解了错误:它与字符串转换无关。在3LETTER_SUBSTRING列的每个单元格中都有列表/元组。使用lambda x:list(x)函数时,可以创建一个元组列表。因此,没有什么可以像split(",")那样做,并且要转换为字符串,然后返回到表.
相反,您只需要在创建新列表时取消您的表。下面是一个可复制的小代码:(请注意,我关注的是您的元组/聚合问题,因为我相信您很快就会找到其余的代码)
import pandas as pd
# Create some data
names = [("joseph","fr"),("ann","en"),("anne","ir"),("tom","en"),("tommy","fr"),("ann","fr")]
df = pd.DataFrame(names, columns=["NAMES","ethnicity"])
df["3LETTER_SUBSTRING"] = df["NAMES"].apply(lambda name: [name[i:i+3] for i in range(len(name) - 2)])
print(df)
# Aggregate the 3LETTER per ethnicity, and unnest the result in a new table for each ethnicity:
df.groupby('ethnicity').agg({
"3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})使用你指定的计数器,我得到了
dfg = df.groupby('ethnicity', as_index=False).agg({
"3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
from collections import Counter
print(Counter(dfg[dfg["ethnicity"] == "en"]["3LETTER_SUBSTRING"][0]))
# Counter({'ann': 1, 'tom': 1})要将其作为元组列表,只需使用字典内置函数(如dict.items() )即可。
UPDATE:使用预先格式化的字符串列表,如下所示:
import pandas as pd
# Create some data
names = [("joseph","fr","jos, ose, sep, eph"),("ann","en","ann"),("anne","ir","ann, nne"),("tom","en","tom"),("tommy","fr","tom, omm, mmy"),("ann","fr","ann")]
df = pd.DataFrame(names, columns=["NAMES","ethnicity","3LETTER_SUBSTRING"])
def transform_3_letter_to_table(x):
"""
Update this function with regard to your data format
"""
return x.split(", ")
df["3LETTER_SUBSTRING"] = df["3LETTER_SUBSTRING"].apply(transform_3_letter_to_table)
print(df)
# Applying aggregation
dfg = df.groupby('ethnicity', as_index=False).agg({
"3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
print(dfg)
# test on some data
from collections import Counter
c = Counter(dfg[dfg["ethnicity"] == "en"]["3LETTER_SUBSTRING"][0])
print(c)
print(list(c.items()))https://stackoverflow.com/questions/58753277
复制相似问题