文章/答案/技术大牛

发布

社区首页 >问答首页 >在Countvectorizer中使用词汇表参数时未生成双字元

问在Countvectorizer中使用词汇表参数时未生成双字元
EN

Stack Overflow用户

提问于 2017-12-13 08:18:19

回答 1查看 742关注 0票数 0

我正在尝试使用countvectorizer生成BiGrams，并将它们附加回数据帧。然而，它只给了我一个字，只作为输出。我想要创建比克，只有当特定的关键字存在。我使用词汇表参数传递它们。

我想要实现的是消除文本语料库中的其他单词，并在词汇表中列出n克的列表。

输入数据

 Id Name
    1   Industrial  Floor chenidsd 34
    2   Industrial  Floor room   345
    3   Central District    46
    4   Central Industrial District  Bay
    5   Chinese District Bay
    6   Bay Chinese xrty
    7   Industrial  Floor chenidsd 34
    8   Industrial  Floor room   345
    9   Central District    46
    10  Central Industrial District  Bay
    11  Chinese District Bay
    12  Bay Chinese dffefef
    13  Industrial  Floor chenidsd 34
    14  Industrial  Floor room   345
    15  Central District    46
    16  Central Industrial District  Bay
    17  Chinese District Bay
    18  Bay Chinese grty

NLTK

words=nltk.corpus.stopwords.words('english')
Nata['Clean_Name'] = Nata['Name'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if  not  item.isdigit()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
Nata['Clean_Name'] = Nata['Clean_Name'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))

词汇定义

 english_corpus=['bay','central','chinese','district', 'floor','industrial','room']

Bigram生成器

 cv = CountVectorizer( max_features = 200,analyzer='word',vocabulary = english_corpus,ngram_range =(2,2))
    cv_addr = cv.fit_transform(Nata.pop('Clean_Name'))
    for i, col in enumerate(cv.get_feature_names()):
        Nata[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)

然而，它只给了我一个作为output.How来解决这个问题的单变量。

输出

In[26]:Nata.columns.tolist()
Out[26]:

['Id',
 'Name',
 'bay',
 'central',
 'chinese',
 'district',
 'floor',
 'industrial',
 'room']

python

pandas

scikit-learn

nltk

回答 1

Stack Overflow用户

发布于 2017-12-13 09:17:39

TL；博士

from io import StringIO
from string import punctuation

import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

text = """Industrial  Floor
Industrial  Floor room
Central District
Central Industrial District  Bay
Chinese District Bay
Bay Chinese
Industrial  Floor
Industrial  Floor room
Central District"""

stoplist = stopwords.words('english') + list(punctuation)

df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2), stop_words=stoplist)
vectorizer.fit_transform(df['Text'])

vectorizer.get_feature_names()

请参阅带NLTK的基本NLP以了解它是如何自动小写、“令牌化”和删除停止字的。

输出

['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']

如果ngramization化在预处理步骤中，只需覆盖analyzer参数即可。

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer


from io import StringIO
from string import punctuation

from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords

stoplist = stopwords.words('english') + list(punctuation)

def preprocess(text):
    return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2) 
            if not any([word for word in ng if word in stoplist or word.isdigit()])
           ]

text = """Industrial  Floor
Industrial  Floor room
Central District
Central Industrial District  Bay
Chinese District Bay
Bay Chinese
Industrial  Floor
Industrial  Floor room
Central District"""

df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])


vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])    
vectorizer.get_feature_names()

输出

['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']

您误解了vocabulary参数在CountVectorizer中的含义。

从医生那里：

vocabulary：映射或可迭代，可选的映射(例如，dict)，其中键是术语，值是特征矩阵中的索引，或者可以对项进行迭代。如果未给出词汇表，则从输入文档中确定词汇表。映射中的指数不应重复，并且在0和最大指数之间不应该有任何差距。

这意味着您将只考虑词汇表中的任何内容作为您的feature_name。如果您需要在您的功能集中使用bigram，那么您的词汇表中就需要有bigram。

它不会生成ngram，然后检查ngram是否只包含词汇表中的单词。

在代码中，您可以看到，如果在词汇表中添加bigram，那么它们将出现在feature_names()中。

from io import StringIO
from string import punctuation

import pandas as pd
from nltk.corpus import stopwords

text = """Industrial  Floor
Industrial  Floor room
Central District
Central Industrial District  Bay
Chinese District Bay
Bay Chinese
Industrial  Floor
Industrial  Floor room
Central District"""

english_corpus=['bay chinese','central district','chinese','district', 'floor','industrial','room']

df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2),vocabulary=english_corpus)
vectorizer.fit_transform(df['Text'])

vectorizer.get_feature_names()

输出

['bay chinese',
 'central district',
 'chinese',
 'district',
 'floor',
 'industrial',
 'room']

那么，如何根据单个单词(单字)列表在我的特征名中获得大写呢？

一种可能的解决方案：，您必须用生成的ngram来编写您自己的分析器，并检查生成的ngram是否在您想要保留的单词列表中。

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer


from io import StringIO
from string import punctuation

from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords

stoplist = stopwords.words('english') + list(punctuation)

def preprocess(text):
    return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2) 
            if not any([word for word in ng if word in stoplist or word.isdigit()])
           ]

text = """Industrial  Floor
Industrial  Floor room
Central District
Central Industrial District  Bay
Chinese District Bay
Bay Chinese
Industrial  Floor
Industrial  Floor room
Central District"""

df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])


vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])    
vectorizer.get_feature_names()

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/47788389

复制

相似问题

问在Countvectorizer中使用词汇表参数时未生成双字元
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Countvectorizer中使用词汇表参数时未生成双字元EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Countvectorizer中使用词汇表参数时未生成双字元
EN