我正在尝试获取熊猫数据栏中的单词(字符串)的计数,这些单词(字符串)以任何顺序出现在另一列中。
我尝试了以下几种方法,这是接近的,但它不计算出现的次数(它只告诉我是否找到了任何顺序的单词)。
words='|'.join(df['Cluster Name'].unique())
df['frequency']=df['Keyword'].str.contains(words).astype(int)最小可重现性示例:
data = {'Keyword' : ['Nike', 'Nike Socks', 'Nike Stripy Socks', 'Socks Nike', 'Adidas Socks'],
'Cluster' : ['Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks']}
# Create DataFrame
df = pd.DataFrame(data)预期产出
Keyword Cluster Frequency
0 Nike Nike Socks 1
1 Nike Socks Nike Socks 2
2 Nike Stripy Socks Nike Socks 2
3 Socks Nike Nike Socks 2
4 Adidas Socks Nike Socks 1发布于 2022-01-08 19:33:05
您可以创建一个自定义函数,该函数以行作为输入,然后使用参数apply axis=1将其按行向dataframe发送。
def count_keywords(row):
freq = 0
for word in row['Keyword'].split(" "):
if word in row['Cluster']:
freq += 1
return freq
df['Frequency'] = df.apply(lambda row: count_keywords(row), axis=1)输出:
>>> df
Keyword Cluster Frequency
0 Nike Nike Socks 1
1 Nike Socks Nike Socks 2
2 Nike Stripy Socks Nike Socks 2
3 Socks Nike Nike Socks 2
4 Adidas Socks Nike Socks 1发布于 2022-01-08 19:46:09
我的回答类似于@Derek,但是如果Cluster列中的单词不仅用空格分隔,它就能正常工作。
from re import findall
import pandas as pd
def count_corresponding(row):
keywords = row.Keyword.split(' ')
count = sum([len(findall(keyword,row.Cluster)) for keyword in keywords])
return count
data = {'Keyword' : ['Nike', 'Nike Socks', 'Nike Stripy Socks', 'Socks Nike', 'Adidas Socks'],
'Cluster' : ['Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks', 'Nike Socks']}
df = pd.DataFrame(data)
df['Frequency'] = df.apply(count_corresponding, axis=1)发布于 2022-01-08 20:42:14
我们可以explode,然后计算出现的单词,然后sum的背面
x = df.assign(Keyword = df.Keyword.str.split(' ')).explode('Keyword')
df['freq'] = x.apply(lambda y : y['Keyword'] in y['Cluster'],axis=1).groupby(level=0).sum()
df
Keyword Cluster freq
0 Nike Nike Socks 1
1 Nike Socks Nike Socks 2
2 Nike Stripy Socks Nike Socks 2
3 Socks Nike Nike Socks 2
4 Adidas Socks Nike Socks 1https://stackoverflow.com/questions/70635675
复制相似问题