我正在尝试使用countvectorizer生成BiGrams,并将它们附加回数据帧。然而,它只给了我一个字,只作为输出。我想要创建比克,只有当特定的关键字存在。我使用词汇表参数传递它们。
我想要实现的是消除文本语料库中的其他单词,并在词汇表中列出n克的列表。
输入数据
Id Name
1 Industrial Floor chenidsd 34
2 Industrial Floor room 345
3 Central District 46
4 Central Industrial District Bay
5 Chinese District Bay
6 Bay Chinese xrty
7 Industrial Floor chenidsd 34
8 Industrial Floor room 345
9 Central District 46
10 Central Industrial District Bay
11 Chinese District Bay
12 Bay Chinese dffefef
13 Industrial Floor chenidsd 34
14 Industrial Floor room 345
15 Central District 46
16 Central Industrial District Bay
17 Chinese District Bay
18 Bay Chinese grtyNLTK
words=nltk.corpus.stopwords.words('english')
Nata['Clean_Name'] = Nata['Name'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if not item.isdigit()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
Nata['Clean_Name'] = Nata['Clean_Name'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))词汇定义
english_corpus=['bay','central','chinese','district', 'floor','industrial','room'] Bigram生成器
cv = CountVectorizer( max_features = 200,analyzer='word',vocabulary = english_corpus,ngram_range =(2,2))
cv_addr = cv.fit_transform(Nata.pop('Clean_Name'))
for i, col in enumerate(cv.get_feature_names()):
Nata[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)然而,它只给了我一个作为output.How来解决这个问题的单变量。
输出
In[26]:Nata.columns.tolist()
Out[26]:
['Id',
'Name',
'bay',
'central',
'chinese',
'district',
'floor',
'industrial',
'room']发布于 2017-12-13 09:17:39
TL;博士
from io import StringIO
from string import punctuation
import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
text = """Industrial Floor
Industrial Floor room
Central District
Central Industrial District Bay
Chinese District Bay
Bay Chinese
Industrial Floor
Industrial Floor room
Central District"""
stoplist = stopwords.words('english') + list(punctuation)
df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2), stop_words=stoplist)
vectorizer.fit_transform(df['Text'])
vectorizer.get_feature_names()请参阅带NLTK的基本NLP以了解它是如何自动小写、“令牌化”和删除停止字的。
输出
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']如果ngramization化在预处理步骤中,只需覆盖analyzer参数即可。
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from io import StringIO
from string import punctuation
from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords
stoplist = stopwords.words('english') + list(punctuation)
def preprocess(text):
return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2)
if not any([word for word in ng if word in stoplist or word.isdigit()])
]
text = """Industrial Floor
Industrial Floor room
Central District
Central Industrial District Bay
Chinese District Bay
Bay Chinese
Industrial Floor
Industrial Floor room
Central District"""
df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])
vectorizer.get_feature_names()输出
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']您误解了vocabulary参数在CountVectorizer中的含义。
从医生那里:
vocabulary: 映射或可迭代,可选的映射(例如,dict),其中键是术语,值是特征矩阵中的索引,或者可以对项进行迭代。如果未给出词汇表,则从输入文档中确定词汇表。映射中的指数不应重复,并且在0和最大指数之间不应该有任何差距。
这意味着您将只考虑词汇表中的任何内容作为您的feature_name。如果您需要在您的功能集中使用bigram,那么您的词汇表中就需要有bigram。
它不会生成ngram,然后检查ngram是否只包含词汇表中的单词。
在代码中,您可以看到,如果在词汇表中添加bigram,那么它们将出现在feature_names()中。
from io import StringIO
from string import punctuation
import pandas as pd
from nltk.corpus import stopwords
text = """Industrial Floor
Industrial Floor room
Central District
Central Industrial District Bay
Chinese District Bay
Bay Chinese
Industrial Floor
Industrial Floor room
Central District"""
english_corpus=['bay chinese','central district','chinese','district', 'floor','industrial','room']
df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2),vocabulary=english_corpus)
vectorizer.fit_transform(df['Text'])
vectorizer.get_feature_names()输出
['bay chinese',
'central district',
'chinese',
'district',
'floor',
'industrial',
'room']那么,如何根据单个单词(单字)列表在我的特征名中获得大写呢?
一种可能的解决方案:,您必须用生成的ngram来编写您自己的分析器,并检查生成的ngram是否在您想要保留的单词列表中。
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from io import StringIO
from string import punctuation
from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords
stoplist = stopwords.words('english') + list(punctuation)
def preprocess(text):
return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2)
if not any([word for word in ng if word in stoplist or word.isdigit()])
]
text = """Industrial Floor
Industrial Floor room
Central District
Central Industrial District Bay
Chinese District Bay
Bay Chinese
Industrial Floor
Industrial Floor room
Central District"""
df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])
vectorizer.get_feature_names()https://stackoverflow.com/questions/47788389
复制相似问题