基本上,我的数据文件看起来如下:
id | refers
----------------
1 | [2,3]
2 | [1,3]
3 | []我想添加另一列,它显示该id被另一个id引用了多少次。例如:
id | refers | referred_count
----------------------------------
1 | [2,3] | 1
2 | [1,3] | 1
3 | [] | 2我的当前代码如下:
citations_dict = {}
for index, row in data_ref.iterrows():
if len(row['reference_list']) > 0:
for reference in row['reference_list']:
if reference not in citations_dict:
citations_dict[reference] = {}
d = data_ref.loc[data_ref['id'] == reference]
citations_dict[reference]['venue'] = d['venue']
citations_dict[reference]['reference'] = d['reference']
citations_dict[reference]['citation'] = 1
else:
citations_dict[reference]['citation'] += 1问题是,这段代码花了很长时间。我想知道怎样做的更快,也许用熊猫?
发布于 2018-09-23 13:13:47
步骤1:获取refers列中每个ID的计数,并将其存储在字典中,并在创建新列时应用该函数。
import pandas as pd
from collections import Counter
df = pd.DataFrame({'id':[1,2,3],'refers':[[2,3],[1,3],[]]})
counter = dict(Counter([item for sublist in df['refers'] for item in sublist]))
df['refer_counts'] = df['id'].apply(lambda x: counter[x])输出
id refers refer_counts
0 1 [2, 3] 1
1 2 [1, 3] 1
2 3 [] 2认为这正是你所需要的!
发布于 2018-09-23 13:16:24
数据
df = pd.DataFrame({'id': [1,2,3], 'refers': [[1,2,3], [1,3], []]})
id refers referred_count
0 1 [1, 2, 3] 1
1 2 [1, 3] 1
2 3 [] 2创建引用次数的字典:
refer_count = df.refers.apply(pd.Series).stack()\
.reset_index(drop=True)\
.astype(int)\
.value_counts()\
.to_dict()通过其refer_count减去每个id中的引用:
df['referred_count'] = df.apply(lambda x: refer_count[x['id']] - x['refers'].count(x['id']), axis = 1)输出
id refers referred_count
0 1 [1, 2, 3] 1
1 2 [1, 3] 1
2 3 [] 2发布于 2018-09-23 13:19:06
首先,使用numpy.hstack和Series.value_counts创建一个助手Series.value_counts。
这将是以referred_count为索引的列‘id’的值。
然后您可以将df的reset_index转换为id,以便轻松地合并本系列,并最终将DataFrame恢复到原来的形状。
s = pd.Series(np.hstack(df['refers'])).value_counts()
df.set_index('id').assign(referred_count=s).reset_index()输出
id refers referred_count
0 1 [2, 3] 1
1 2 [1, 3] 1
2 3 [] 2https://stackoverflow.com/questions/52466282
复制相似问题