首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >关键词匹配给大熊猫列中的重复单词?

关键词匹配给大熊猫列中的重复单词?
EN

Stack Overflow用户
提问于 2017-11-05 09:17:07
回答 2查看 962关注 0票数 2

我有一只熊猫,它由两栏组成:-

代码语言:javascript
复制
ID           text_data                               

1         companies are mainly working on two 
          technologies that is ai and health care.
          Company need to improve on health care.

2         Current trend are mainly depends on block chain
          and IOT where IOT is
          highly used.

3         ............
.         ...........
.         ...........
.         so on.

现在我有了另一个列表,名为Techlist=["block chain","health care","ai","IOT"]

我需要将list Techlisttext_data列的熊猫数据进行匹配,所以我使用了以下代码:-

代码语言:javascript
复制
df['tech_match']=df['text_data'].apply(lambda x: [reduce(op.add, re.findall(act,x)) for act in Techlist if re.findall(act,x) <> []] )

所以我得到的是一些不同的东西:-

代码语言:javascript
复制
ID         text_data                                           tech_match
1     companies are mainly working on two          [ai,healthcarehealthcare]             
      technologies that is ai and health care.
      Company need to improve on health care.

2     current trend are mainly                     [block chain,IOTIOT]
      depends on block chain and 
      IOT where IOT is highly used.

3    .................
.    ................             
.    ...............
.    so on.

列表和文本数据正确匹配,但匹配的列表单词在tech_match列中重复。

我需要的是:-

代码语言:javascript
复制
ID            text_data                             tech_match
1     companies are mainly working on two           [heatlh care,ai]
      technologies that is ai and health care.
      Company need to improve on health care.

2     Current trend are mainly depends on          [block chain,IOT]
      blockchain and IOT where IOT is
      highly used. 

3     ..................
.     ..................
.     .................
.     son on.

如何删除tech_match列中的这些重复单词?

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-11-05 09:31:47

使用str.split,然后调用set.intersection

代码语言:javascript
复制
s = set(["blockchain", "healthcare", "ai", "IOT"])

df['matches'] = df.text_data.str.split(r'[^\w]+')\
                   .apply(lambda x: list(s.intersection(x)))
df

                                           text_data            matches
0  companies are mainly working on two technologi...   [healthcare, ai]
1  Current trend are mainly depends on blockchain...  [IOT, blockchain]

感谢Bharath提供的安装数据。

票数 3
EN

Stack Overflow用户

发布于 2017-11-05 09:22:02

使用str.findallboundary一起查找单词.感谢Anton vBR提供更简单的模式:

代码语言:javascript
复制
pat = '|'.join(r"\b{}\b".format(x) for x in Techlist)
print (pat)
\bblockchain\b|\bhealthcare\b|\bai\b|\bIOT\b 

使用以下内容创建新列:

代码语言:javascript
复制
df['tech_match'] = df['text_data'].str.findall(pat).apply(lambda x: list(set(x)))

print (df)
                                           text_data         tech_match
0  companies are mainly working on two technologi...   [healthcare, ai]
1  Current trend are mainly depends on blockchain...  [blockchain, IOT]

您可以在Counter中返回每一个单词的计数,再次感谢Anton vBR的建议:

代码语言:javascript
复制
from collections import Counter

df['tech_match'] = df['text_data'].str.findall(pat).apply(lambda x: Counter(x))

print(df)

    text_data                                           tech_match
0   companies are mainly working on two technologi...   {'ai': 1, 'healthcare': 2}
1   Current trend are mainly depends on blockchain...   {'blockchain': 1, 'IOT': 2}

此外,还可以使用原始帧加入计数系列:

代码语言:javascript
复制
data = (df['text_data'].str.findall(pat).apply(lambda x: Counter(x))).tolist()
df = df.join(pd.DataFrame(data)).fillna(0) # join dfs
df['Total'] =df[Techlist].sum(axis=1) # create Total column

   text_data          IOT   ai  blockchain  healthcare  Total 
0  companies are ...  0.0  2.0         0.0        2.0    4.0
1  Current trend ...  2.0  0.0         1.0        0.0    3.0 

时间

代码语言:javascript
复制
text_data = "companies are mainly working on two technologies that is ai and healthcare. Company need to improve on healthcare. Current trend are mainly depends on blockchain and IOT where IOT is highly used.".split()

np.random.seed(75)
#20000 random rows with all words from text_data
N = 20000
df = pd.DataFrame({'text_data': [np.random.choice(text_data, size=np.random.randint(3,10)) for x in range(N)]})
df['text_data'] = df['text_data'].str.join(' ')


Techlist=["blockchain","healthcare","ai","IOT"]
s = set(["blockchain", "healthcare", "ai", "IOT"])

#cᴏʟᴅsᴘᴇᴇᴅ's solution
In [401]: %timeit df['matches'] = df.text_data.str.split(r'[^\w]+').apply(lambda x: list(s.intersection(x)))
10 loops, best of 3: 165 ms per loop

#jezrael's solution
In [402]: %timeit df['tech_match'] = df['text_data'].str.findall('|'.join([r"\b{word}\b".format(word=word) for word in Techlist])).apply(lambda x: list(set(x)))
10 loops, best of 3: 74.7 ms per loop

#Bharath's solution
In [403]: %timeit df['new'] = df['text_data'].apply(lambda x :  list(set([i for i in nltk.word_tokenize(x) if i in Techlist])))
1 loop, best of 3: 3.73 s per loop
票数 3
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/47119965

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档