文章/答案/技术大牛

发布

社区首页 >问答首页 >循环pandas列以获取wmd相似度

问循环pandas列以获取wmd相似度
EN

Stack Overflow用户

提问于 2020-11-30 16:57:09

回答 1查看 76关注 0票数 0

我有两个数据帧。两者都有两列。我想使用wmd来查找列source_label中的每个实体与列target_label中的实体的最接近匹配，但是，在结束时，我希望有一个所有4列相对于实体的DataFrame。

df1

,source_Label,source_uri
'neuronal ceroid lipofuscinosis 8',"http://purl.obolibrary.org/obo/DOID_0110723"
'autosomal dominant distal hereditary motor neuronopathy',"http://purl.obolibrary.org/obo/DOID_0111198"

df2

,target_label,target_uri
'neuronal ceroid ',"http://purl.obolibrary.org/obo/DOID_0110748"
'autosomal dominanthereditary',"http://purl.obolibrary.org/obo/DOID_0111110"

预期结果

,source_label, target_label, source_uri, target_uri, wmd score
'neuronal ceroid lipofuscinosis 8', 'neuronal ceroid ', "http://purl.obolibrary.org/obo/DOID_0110723", "http://purl.obolibrary.org/obo/DOID_0110748", 0.98
'autosomal dominant distal hereditary motor neuronopathy', 'autosomal dominanthereditary', "http://purl.obolibrary.org/obo/DOID_0111198", "http://purl.obolibrary.org/obo/DOID_0111110", 0.65

数据帧如此之大，以至于我正在寻找一些更快的方法来迭代两个标签列。到目前为止，我尝试了以下方法：

list_distances = []
temp = []

def preprocess(sentence):
    return [w for w in sentence.lower().split()]

entity = df1['source_label']
target = df2['target_label']

 for i in tqdm(entity):
    for j in target:
        wmd_distance = model.wmdistance(preprocess(i), preprocess(j))
        temp.append(wmd_distance)
    list_distances.append(min(temp))
# print("list_distances", list_distances)
WMD_Dataframe = pd.DataFrame({'source_label': pd.Series(entity),
                              'target_label': pd.Series(target),
                              'source_uri': df1['source_uri'],
                              'target_uri': df2['target_uri'],
                              'wmd_Score': pd.Series(list_distances)}).sort_values(by=['wmd_Score'])
WMD_Dataframe = WMD_Dataframe.reset_index()

首先，此代码不能很好地工作，因为其他两列直接来自dfs，并且没有考虑到实体与uri的关系。如何让它变得更快，因为实体有数百万个。提前谢谢。

numpy

gensim

word2vec

python

pandas

回答 1

Stack Overflow用户

回答已采纳

发布于 2020-11-30 23:52:17

一个快速解决方案：

closest_neighbour_index_df2 = []


def preprocess(sentence):
    return [w for w in sentence.lower().split()]



 
for i in tqdm(entity):
    temp = []
    for j in target:
        wmd_distance = model.wmdistance(preprocess(i), preprocess(j))
        temp.append(wmd_distance)
    # maybe assert to make sure its always right
    closest_neighbour_index_df2.append(np.argmin(np.array(temp))) 
    # return argmin to return index rather than the value. 
    
# Add the indices from df2 to df1

df1['closest_neighbour'] = closest_neighbour_index_df2 
# add information to respective row from df2 using the closest_neighbour column

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/65070534

复制

相似问题

问循环pandas列以获取wmd相似度
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问循环pandas列以获取wmd相似度EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问循环pandas列以获取wmd相似度
EN