我有一只熊猫,它由两栏组成:-
ID text_data
1 companies are mainly working on two
technologies that is ai and health care.
Company need to improve on health care.
2 Current trend are mainly depends on block chain
and IOT where IOT is
highly used.
3 ............
. ...........
. ...........
. so on.现在我有了另一个列表,名为Techlist=["block chain","health care","ai","IOT"]
我需要将list Techlist与text_data列的熊猫数据进行匹配,所以我使用了以下代码:-
df['tech_match']=df['text_data'].apply(lambda x: [reduce(op.add, re.findall(act,x)) for act in Techlist if re.findall(act,x) <> []] )所以我得到的是一些不同的东西:-
ID text_data tech_match
1 companies are mainly working on two [ai,healthcarehealthcare]
technologies that is ai and health care.
Company need to improve on health care.
2 current trend are mainly [block chain,IOTIOT]
depends on block chain and
IOT where IOT is highly used.
3 .................
. ................
. ...............
. so on.列表和文本数据正确匹配,但匹配的列表单词在tech_match列中重复。
我需要的是:-
ID text_data tech_match
1 companies are mainly working on two [heatlh care,ai]
technologies that is ai and health care.
Company need to improve on health care.
2 Current trend are mainly depends on [block chain,IOT]
blockchain and IOT where IOT is
highly used.
3 ..................
. ..................
. .................
. son on.如何删除tech_match列中的这些重复单词?
发布于 2017-11-05 09:31:47
使用str.split,然后调用set.intersection
s = set(["blockchain", "healthcare", "ai", "IOT"])
df['matches'] = df.text_data.str.split(r'[^\w]+')\
.apply(lambda x: list(s.intersection(x)))
df
text_data matches
0 companies are mainly working on two technologi... [healthcare, ai]
1 Current trend are mainly depends on blockchain... [IOT, blockchain]感谢Bharath提供的安装数据。
发布于 2017-11-05 09:22:02
使用str.findall和boundary一起查找单词.感谢Anton vBR提供更简单的模式:
pat = '|'.join(r"\b{}\b".format(x) for x in Techlist)
print (pat)
\bblockchain\b|\bhealthcare\b|\bai\b|\bIOT\b 使用以下内容创建新列:
df['tech_match'] = df['text_data'].str.findall(pat).apply(lambda x: list(set(x)))
print (df)
text_data tech_match
0 companies are mainly working on two technologi... [healthcare, ai]
1 Current trend are mainly depends on blockchain... [blockchain, IOT]您可以在Counter中返回每一个单词的计数,再次感谢Anton vBR的建议:
from collections import Counter
df['tech_match'] = df['text_data'].str.findall(pat).apply(lambda x: Counter(x))
print(df)
text_data tech_match
0 companies are mainly working on two technologi... {'ai': 1, 'healthcare': 2}
1 Current trend are mainly depends on blockchain... {'blockchain': 1, 'IOT': 2}此外,还可以使用原始帧加入计数系列:
data = (df['text_data'].str.findall(pat).apply(lambda x: Counter(x))).tolist()
df = df.join(pd.DataFrame(data)).fillna(0) # join dfs
df['Total'] =df[Techlist].sum(axis=1) # create Total column
text_data IOT ai blockchain healthcare Total
0 companies are ... 0.0 2.0 0.0 2.0 4.0
1 Current trend ... 2.0 0.0 1.0 0.0 3.0 时间
text_data = "companies are mainly working on two technologies that is ai and healthcare. Company need to improve on healthcare. Current trend are mainly depends on blockchain and IOT where IOT is highly used.".split()
np.random.seed(75)
#20000 random rows with all words from text_data
N = 20000
df = pd.DataFrame({'text_data': [np.random.choice(text_data, size=np.random.randint(3,10)) for x in range(N)]})
df['text_data'] = df['text_data'].str.join(' ')
Techlist=["blockchain","healthcare","ai","IOT"]
s = set(["blockchain", "healthcare", "ai", "IOT"])
#cᴏʟᴅsᴘᴇᴇᴅ's solution
In [401]: %timeit df['matches'] = df.text_data.str.split(r'[^\w]+').apply(lambda x: list(s.intersection(x)))
10 loops, best of 3: 165 ms per loop
#jezrael's solution
In [402]: %timeit df['tech_match'] = df['text_data'].str.findall('|'.join([r"\b{word}\b".format(word=word) for word in Techlist])).apply(lambda x: list(set(x)))
10 loops, best of 3: 74.7 ms per loop
#Bharath's solution
In [403]: %timeit df['new'] = df['text_data'].apply(lambda x : list(set([i for i in nltk.word_tokenize(x) if i in Techlist])))
1 loop, best of 3: 3.73 s per loophttps://stackoverflow.com/questions/47119965
复制相似问题