我从SQL数据库下载了一个.csv文件,其中列值是一个字符串列表。
但是,有一些值是的,它周围没有引号标记:
user_id B value
0 a1 3 {no_quotations} #no quotation mark ardoun this word
1 a2 4 {"with quotations", no_quotations} #one with quotations, one without
2 a3 1 {"with quotations", "with quotations 2"} #UPDATE quotations have spaces in them.
3 a4 1 {no_quotations, no_quotations2} #2 without quotations marks
4 a6 1 {"with quotations"} #UPDATE quotations have spaces in them 这几个字是:
用于no_quotations
的单词之间的空格
我希望将列值转换为虚拟列表。预期产出:

列值中的值在以后的迭代中可能会发生变化,所以我不希望代码依赖于值的确切字符串。
我可以为值做报价,但不是所有的
import pandas as pd
import ast
df = pd.read_csv("/content/SQLDB_quotations_in_literal_example.csv")
df_ = df.iloc[[2,4]]['value'].apply(lambda x: list(ast.literal_eval(x)))
pd.merge(df, pd.get_dummies(df_.apply(pd.Series).stack()).groupby(level=0).sum(), left_index=True, right_index=True, how='outer' )
OUT:
user_id B value with quotations \
0 a1 3 {no_quotations} NaN
1 a2 4 {"with quotations", no_quotations} NaN
2 a3 1 {"with quotations", "with quotations 2"} 1.0
3 a4 1 {no_quotations, no_quotations_2} NaN
4 a6 1 {"with quotations"} 1.0
with quotations 2
0 NaN
1 NaN
2 1.0
3 NaN
4 0.0 其他重要的信息字符串“带引号”中有空格,no_quotations中没有空格。
发布于 2022-09-08 08:07:10
一种使用str.extractall和pivot_table的方法
out = df.join(df['value']
.str.extractall('(\w+)')[0]
.droplevel(1).reset_index(name='col').assign(value=1)
.pivot_table(index='index', columns='col', values='value', fill_value=0)
)与str.get_dummies的替代
out = df.join(df['value']
.str.extractall('(\w+)')[0]
.groupby(level=0).agg('|'.join).str.get_dummies()
)产出:
user_id B value no_quotations no_quotations2 with_quotations with_quotations2
0 a1 3 {no_quotations} 1 0 0 0
1 a2 4 {"with_quotations", no_quotations} 1 0 1 0
2 a3 1 {"with_quotations", "with_quotations2"} 0 0 1 1
3 a4 1 {no_quotations, no_quotations2} 1 1 0 0
4 a6 1 {"with_quotations"} 0 0 1 0变体:允许单词中的空格:
使用以下正则表达式:r'([^",{}]*[^",{}\s])' in extractall。
产出:
user_id B value no_quotations no_quotations2 no_quotations with quotations with quotations 2
0 a1 3 {no_quotations} 0 0 1 0 0
1 a2 4 {"with quotations", no_quotations} 1 0 0 1 0
2 a3 1 {"with quotations", "with quotations 2"} 0 0 0 1 1
3 a4 1 {no_quotations, no_quotations2} 0 1 1 0 0
4 a6 1 {"with quotations"} 0 0 0 1 0https://stackoverflow.com/questions/73645570
复制相似问题