我有一个过程,就像:
问题是(大概是因为首先发生令牌化?)多字停止词(短语)不会被删除。
完整的例子:
import re
import nltk
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as ESW, CountVectorizer
# Make sure we have the corpora used by nltk's lemmatizer
try:
nltk.data.find('corpora/wordnet')
except:
nltk.download('wordnet')
# "Naive" token similar to that used by sklearn
TOKEN = re.compile(r'\b\w{2,}\b')
# Tokenize, then lemmatize these tokens
# Modified from:
# http://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes
class LemmaTokenizer(object):
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, doc):
return (self.wnl.lemmatize(t) for t in TOKEN.findall(doc))
# Add 1 more phrase to sklearn's stop word list
sw = ESW.union(frozenset(['sinclair broadcast group']))
vect = CountVectorizer(stop_words=sw, ngram_range=(1, 4),
tokenizer=LemmaTokenizer())
# These are nonsense babbling
docs = ["""And you ask Why You Are Sinclair Broadcast Group is Asking It""",
"""Why are you asking what Sinclair Broadcast Group and you"""]
tf = vect.fit_transform(docs)重申:单字句号已被正确删除,但这句话仍然存在:
vect.get_feature_names()
# ['ask',
# 'ask sinclair',
# 'ask sinclair broadcast',
# 'ask sinclair broadcast group',
# 'asking',
# 'asking sinclair',
# 'asking sinclair broadcast',
# 'asking sinclair broadcast group',
# 'broadcast',
# 'broadcast group',
# 'broadcast group asking',
# 'group',
# 'group asking',
# 'sinclair',
# 'sinclair broadcast',
# 'sinclair broadcast group',
# 'sinclair broadcast group asking']我该怎么纠正呢?
发布于 2018-02-27 18:19:04
来自CountVectorizer
stop_words :字符串{‘english’}、list或None (默认) 如果使用“英语”,则使用内置的英语停止词列表。 如果是一个列表,则假定该列表包含停止词,所有这些都将从结果标记中删除。只适用于分析器== 'word‘。 如果没有,就不会使用停止语。max_df可以设置为范围内的值[0.7,1.0),以自动检测和过滤基于语料库内文档中词条频率的停止词。
以及参数token_pattern的进一步下降。
token_pattern : string 正则表达式,表示什么构成“令牌”,仅用于分析器== 'word‘。默认的regexp选择2个或更多字母数字字符的标记(标点符号完全被忽略,并且始终被视为标记分隔符)。
因此,只有当analyzer(token)的结果等于'sinclair broadcast group'时,它才会删除停止词。但是默认的analyzer是'word',这意味着停止字检测只适用于单个单词,因为令牌是由默认的token_pattern定义的,如前所述。
标记不是n克(相反,n克是由标记组成的,停止字移除似乎发生在标记级,在构造n-gram之前)。
作为快速检查,您可以将您的自定义停止字更改为仅为实验的'sinclair',以便在将其视为孤立的单词时,它可以正确地删除该单词。
换句话说,您需要将您自己的可调用的analyzer传递给它,以便将分析器逻辑也应用于n克,您必须手动检查它。但是,默认行为假设停止字检测不能应用于n克,只适用于单个单词。
下面是一个用于您的情况的自定义分析器函数的示例。这是based on this answer ..。注意,我没有测试它,所以可能会有bug。
def trigram_match(i, trigram, words):
if i < len(words) - 2 and words[i:i + 3] == trigram:
return True
if (i > 0 and i < len(words) - 1) and words[i - 1:i + 2] == trigram:
return True
if i > 1 and words[i - 2:i + 1] == trigram:
return True
return False
def custom_analyzer(text):
bad_trigram = ['sinclair', 'broadcasting', 'group']
words = [str.lower(w) for w in re.findall(r'\w{2,}', text)]
for i, w in enumerate(words):
if w in sw or trigram_match(i, bad_trigram, words):
continue
yield w发布于 2018-02-27 18:10:26
这是一个对我有用的自定义分析器。这有点麻烦,但有效地一步完成了所有的文本处理,而且速度相当快:
from functools import partial
from itertools import islice
import re
import nltk
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
def window(seq, n=3):
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
class LemmaTokenizer(object):
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, doc, stop_words):
return tuple(self.wnl.lemmatize(i.lower()) for i in
re.findall(r'\b\w{3,}\b', doc)
if i.lower() not in stop_words)
def analyzer(doc, stop_words=None, stop_phr=None, ngram_range=(1, 4)):
if not stop_words:
stop_words = {}
if not stop_phr:
stop_phr = {}
start, stop = ngram_range
lt = LemmaTokenizer()
words = lt(doc, stop_words=stop_words)
for n in range(start, stop + 1):
for ngram in window(words, n=n):
res = ' '.join(ngram)
if res not in stop_phr:
yield res
for w in words:
yield w
analyzer_ = partial(analyzer, stop_words=ENGLISH_STOP_WORDS,
stop_phr={'sinclair broadcast group'})
vect = CountVectorizer(analyzer=analyzer_)https://stackoverflow.com/questions/49014129
复制相似问题