文章/答案/技术大牛

发布

社区首页 >问答首页 >关键词匹配给大熊猫列中的重复单词？

问关键词匹配给大熊猫列中的重复单词？
EN

Stack Overflow用户

提问于 2017-11-05 09:17:07

回答 2查看 962关注 0票数 2

我有一只熊猫，它由两栏组成：-

ID           text_data                               

1         companies are mainly working on two 
          technologies that is ai and health care.
          Company need to improve on health care.

2         Current trend are mainly depends on block chain
          and IOT where IOT is
          highly used.

3         ............
.         ...........
.         ...........
.         so on.

现在我有了另一个列表，名为Techlist=["block chain","health care","ai","IOT"]

我需要将list Techlist与text_data列的熊猫数据进行匹配，所以我使用了以下代码：-

df['tech_match']=df['text_data'].apply(lambda x: [reduce(op.add, re.findall(act,x)) for act in Techlist if re.findall(act,x) <> []] )

所以我得到的是一些不同的东西：-

ID         text_data                                           tech_match
1     companies are mainly working on two          [ai,healthcarehealthcare]             
      technologies that is ai and health care.
      Company need to improve on health care.

2     current trend are mainly                     [block chain,IOTIOT]
      depends on block chain and 
      IOT where IOT is highly used.

3    .................
.    ................             
.    ...............
.    so on.

列表和文本数据正确匹配，但匹配的列表单词在tech_match列中重复。

我需要的是：-

ID            text_data                             tech_match
1     companies are mainly working on two           [heatlh care,ai]
      technologies that is ai and health care.
      Company need to improve on health care.

2     Current trend are mainly depends on          [block chain,IOT]
      blockchain and IOT where IOT is
      highly used. 

3     ..................
.     ..................
.     .................
.     son on.

如何删除tech_match列中的这些重复单词？

text-mining

python

pandas

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-11-05 09:31:47

使用str.split，然后调用set.intersection

s = set(["blockchain", "healthcare", "ai", "IOT"])

df['matches'] = df.text_data.str.split(r'[^\w]+')\
                   .apply(lambda x: list(s.intersection(x)))
df

                                           text_data            matches
0  companies are mainly working on two technologi...   [healthcare, ai]
1  Current trend are mainly depends on blockchain...  [IOT, blockchain]

感谢Bharath提供的安装数据。

票数 3

Stack Overflow用户

发布于 2017-11-05 09:22:02

使用str.findall和boundary一起查找单词.感谢Anton vBR提供更简单的模式：

pat = '|'.join(r"\b{}\b".format(x) for x in Techlist)
print (pat)
\bblockchain\b|\bhealthcare\b|\bai\b|\bIOT\b

使用以下内容创建新列：

df['tech_match'] = df['text_data'].str.findall(pat).apply(lambda x: list(set(x)))

print (df)
                                           text_data         tech_match
0  companies are mainly working on two technologi...   [healthcare, ai]
1  Current trend are mainly depends on blockchain...  [blockchain, IOT]

您可以在Counter中返回每一个单词的计数，再次感谢Anton vBR的建议：

from collections import Counter

df['tech_match'] = df['text_data'].str.findall(pat).apply(lambda x: Counter(x))

print(df)

    text_data                                           tech_match
0   companies are mainly working on two technologi...   {'ai': 1, 'healthcare': 2}
1   Current trend are mainly depends on blockchain...   {'blockchain': 1, 'IOT': 2}

此外，还可以使用原始帧加入计数系列：

data = (df['text_data'].str.findall(pat).apply(lambda x: Counter(x))).tolist()
df = df.join(pd.DataFrame(data)).fillna(0) # join dfs
df['Total'] =df[Techlist].sum(axis=1) # create Total column

   text_data          IOT   ai  blockchain  healthcare  Total 
0  companies are ...  0.0  2.0         0.0        2.0    4.0
1  Current trend ...  2.0  0.0         1.0        0.0    3.0

时间

text_data = "companies are mainly working on two technologies that is ai and healthcare. Company need to improve on healthcare. Current trend are mainly depends on blockchain and IOT where IOT is highly used.".split()

np.random.seed(75)
#20000 random rows with all words from text_data
N = 20000
df = pd.DataFrame({'text_data': [np.random.choice(text_data, size=np.random.randint(3,10)) for x in range(N)]})
df['text_data'] = df['text_data'].str.join(' ')


Techlist=["blockchain","healthcare","ai","IOT"]
s = set(["blockchain", "healthcare", "ai", "IOT"])

#cᴏʟᴅsᴘᴇᴇᴅ's solution
In [401]: %timeit df['matches'] = df.text_data.str.split(r'[^\w]+').apply(lambda x: list(s.intersection(x)))
10 loops, best of 3: 165 ms per loop

#jezrael's solution
In [402]: %timeit df['tech_match'] = df['text_data'].str.findall('|'.join([r"\b{word}\b".format(word=word) for word in Techlist])).apply(lambda x: list(set(x)))
10 loops, best of 3: 74.7 ms per loop

#Bharath's solution
In [403]: %timeit df['new'] = df['text_data'].apply(lambda x :  list(set([i for i in nltk.word_tokenize(x) if i in Techlist])))
1 loop, best of 3: 3.73 s per loop

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/47119965

复制

相似问题

问关键词匹配给大熊猫列中的重复单词？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问关键词匹配给大熊猫列中的重复单词？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问关键词匹配给大熊猫列中的重复单词？
EN