目前,我只得到一排。我怎么才能得到所有的词?目前,我有一栏文字。词干机的问题。它只给出一行,而不是所有的单词。
我的目的是清理数据并打印所有用逗号分隔的单词。
输入: dftag列中每一行的word1、word2、word3、word4、word5
输出将是一个很长的列表,所有的值都是word1、word2、word3、word4、word5、word6、word7.
from nltk.corpus import stopwords
import re
from nltk.stem import PorterStemmer
import pandas as pd
import spacy
import pytextrank
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
def Clean_stop_words(data):
#print(stopwords.words('english'))
stop_words=stopwords.words('english')
new_data=""
for word in data:
np.char.lower(word)
if word not in stop_words:
new_data = data + " , " + word
print(new_data)
symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
for i in symbols:
new_data = np.char.replace(new_text, i, ' ')
#print(data)
stemmer=PorterStemmer()
new_data=stemmer.stem(word)
#print(new_data)
Clean_stop_words(df["Tag"])
#print(data)提前谢谢你
发布于 2021-09-23 16:59:29
公告-
我决定用regex清除特殊字符,如果愿意,可以更改方法。
此外,请看熊猫的应用函数,它取每一行并执行Clean_stop_words函数。
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import numpy as np
import pandas as pd
import re
l = ["'word1,wording,w#ord,he##llo,sleeping,don't"]
df = pd.DataFrame(l, columns=['Tag'])
def Clean_stop_words(data):
stemmer = PorterStemmer()
stop_words=stopwords.words('english')
new_data=""
data_split = data.split(',')
for word in data_split:
np.char.lower(word)
word = re.sub('[^A-Za-z0-9]+', '', word)
if word not in stop_words:
stemmer.stem(word)
new_data = new_data + " , " + word
return new_data
df['Tag'] = df['Tag'].apply(Clean_stop_words)
print(df['Tag'])https://stackoverflow.com/questions/69301927
复制相似问题