我在dataframe中有一个列名为“Text_Tweet”的列,其中每一行都包含一条tweet。
如何将每行推文替换为仅包含每行推文词条的字符串?
发布于 2021-11-30 06:26:02
在nltk包中有一个内置的实用程序,它允许我们以最简单的方式对单词进行词条分类。最后,我们将会是孤军奋战:
from nltk.stem import WordNetLemmatizer
import pandas as pd
your_dataframe = pd.DataFrame({
'Text_Tweet':['rocks corpora', 'corpora rocks']
})
lemmatizer = WordNetLemmatizer()
your_dataframe['Processed_Tweet'] = your_dataframe['Text_Tweet'].apply(lambda item:' '.join([lemmatizer.lemmatize(word) for word in item.split()]))
your_dataframe输出:
Text_Tweet Processed_Tweet
0 rocks corpora rock corpus
1 corpora rocks corpus rock发布于 2021-11-30 08:43:06
试试这个:
import nltk
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
#example of datasets
df = pd.DataFrame(['I am a boy',
'He likes these books',
'There were four columns'], columns=['Text_Tweet'])
df['lemm'] = df.Text_Tweet.apply(lemmatize_text)https://stackoverflow.com/questions/70164699
复制相似问题