。
样本数据是:(data.csv)
user_id country option_values option_labels
abc456 Germany 256gb SSD
abc123 Brazil i5 intel
xyz456 France 128gb SSD
xyz123 Turkey i7 intel
abc123 Brazil 2gb nvidia
abc456 Germany 32gb RAM
xyz123 Turkey 4gb nvidia
xyz456 France 16gb RAM样本输出如下:
user_id country option_values option_labels intel nvidia SSD RAM
abc456 Germany 256gb SSD 0 0 256gb 0
abc123 Brazil i5 intel i5 0 0 0
xyz456 France 256gb SSD 0 0 128gb 0
xyz123 Turkey i7 intel i7 0 0 0
abc123 Brazil 2gb nvidia 0 2gb 0 0
abc456 Germany 32gb RAM 0 0 0 32gb
xyz123 Turkey 4gb nvidia 0 4gb 0 0
xyz456 France 16gb RAM 0 0 0 16gb我用下面的示例代码完成了这个过程,
import pandas as pd
import numpy as np
data_sample = pd.read_csv("data.csv")
feature_list = data_sample["option_label"].unique().tolist()
user_list = data_sample["user_id"].unique().tolist()
country_list = data_sample["country"].unique().tolist()
opt_val_list = data_sample["opt_val"].unique().tolist()
def filterd_id(check_id):
single_id_data= data_sample[data_sample['user_id'] == check_id]
return single_id_data
def finding_features(single_id_data):
user_features = single_id_data["option_labels"].unique().tolist()
return user_features
def check_feature(feature_list, user_features):
feature_prs_not = []
for i in feature_list:
if(i in user_features):
result = opt_val_list
else:
result = 0
feature_prs_not.append(result)
return feature_prs_not
user_id = []
country = []
for i in user_list:
check_id = i
user_id.append(i)
single_id_data = filterd_id(check_id)
c = single_id_data["country"].unique().tolist()
country.append(c)
user_features = finding_features(single_id_data)
feature_prst_not = check_feature(feature_list,user_features)
df = pd.DataFrame([feature_prst_not], columns = feature_list)
df_feature = df_feature.append(df)
df_user_id = pd.DataFrame(user_id, columns=['all_user_id'])
df_country = pd.DataFrame(country, columns=['country_name'])跑步要花更多的时间(如)。8-9小时),我的原始数据,近100 k ids。我仍然处于Python的学习阶段,我现在试图优化以减少脚本的运行时间。
发布于 2020-06-01 10:02:33
如果你想要它更快,你需要矢量化。我相信这段代码产生的输出与您的输出相同。
import numpy as np
for val in df['option_labels'].unique():
df[val] = np.where(df['option_labels']==val, df['option_values'], 0)我就是这样复制你的数据的
from io import StringIO
df = pd.read_csv(StringIO('''
"user_id","country","option_values","option_labels"
"abc456","Germany","256gb","SSD"
"abc123","Brazil","i5","intel"
"xyz456","France","128gb","SSD"
"xyz123","Turkey","i7","intel"
"abc123","Brazil","2gb","nvidia"
"abc456","Germany","32gb","RAM"
"xyz123","Turkey","4gb","nvidia"
"xyz456","France","16gb","RAM"'''))https://stackoverflow.com/questions/62129137
复制相似问题