文章/答案/技术大牛

发布

社区首页 >问答首页 >具有NLTK的有效术语文档矩阵

问具有NLTK的有效术语文档矩阵
EN

Stack Overflow用户

提问于 2013-04-09 10:46:56

回答 3查看 46.1K关注 0票数 17

我正试图创建一个与NLTK和熊猫的学期文件矩阵。我编写了以下函数：

def fnDTM_Corpus(xCorpus):
    import pandas as pd
    '''to create a Term Document Matrix from a NLTK Corpus'''
    fd_list = []
    for x in range(0, len(xCorpus.fileids())):
        fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x])))
    DTM = pd.DataFrame(fd_list, index = xCorpus.fileids())
    DTM.fillna(0,inplace = True)
    return DTM.T

去运行它

import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'C:/Data/'

newcorpus = PlaintextCorpusReader(corpus_root, '.*')

x = fnDTM_Corpus(newcorpus)

它对语料库中的几个小文件很好地工作，但是当我尝试使用一个包含4,000个文件(每个文件约2 kb )的语料库运行它时，它会给我一个MemoryError。

，我错过了什么吗？

我用的是32位蟒蛇。(在windows 7，64位操作系统，，8GB RAM上)这么大的语料库我真的需要使用64位吗？

nltk

term-document-matrix

python

pandas

回答 3

Stack Overflow用户

回答已采纳

发布于 2013-04-10 17:39:39

多亏了拉迪姆和拉斯曼。我的目标是拥有一个像你在注册商标中得到的那样的DTM。我决定使用scikit学习，部分灵感来自于这个博客条目。这是我想出来的密码。

，我在这里发布它，希望其他人会发现它有用。

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 

def fn_tdm_df(docs, xColNames = None, **kwargs):
    ''' create a term document matrix as pandas DataFrame
    with **kwargs you can pass arguments of CountVectorizer
    if xColNames is given the dataframe gets columns Names'''

    #initialize the  vectorizer
    vectorizer = CountVectorizer(**kwargs)
    x1 = vectorizer.fit_transform(docs)
    #create dataFrame
    df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())
    if xColNames is not None:
        df.columns = xColNames

    return df

若要将其用于目录中的文本列表，请执行以下操作

DIR = 'C:/Data/'

def fn_CorpusFromDIR(xDIR):
    ''' functions to create corpus from a Directories
    Input: Directory
    Output: A dictionary with 
             Names of files ['ColNames']
             the text in corpus ['docs']'''
    import os
    Res = dict(docs = [open(os.path.join(xDIR,f)).read() for f in os.listdir(xDIR)],
               ColNames = map(lambda x: 'P_' + x[0:6], os.listdir(xDIR)))
    return Res

若要创建数据文件，请执行以下操作

d1 = fn_tdm_df(docs = fn_CorpusFromDIR(DIR)['docs'],
          xColNames = fn_CorpusFromDIR(DIR)['ColNames'], 
          stop_words=None, charset_error = 'replace')

票数 20

Stack Overflow用户

发布于 2015-02-25 18:43:07

我知道OP希望在NLTK中创建一个tdm，但是textmining包(pip install textmining)使它变得非常简单：

import textmining
    
# Create some very short sample documents
doc1 = 'John and Bob are brothers.'
doc2 = 'John went to the store. The store was closed.'
doc3 = 'Bob went to the store too.'

# Initialize class to create term-document matrix
tdm = textmining.TermDocumentMatrix()

# Add the documents
tdm.add_doc(doc1)
tdm.add_doc(doc2)
tdm.add_doc(doc3)

# Write matrix file -- cutoff=1 means words in 1+ documents are retained
tdm.write_csv('matrix.csv', cutoff=1)

# Instead of writing the matrix, access its rows directly
for row in tdm.rows(cutoff=1):
    print row

输出：

['and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too']
[1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0]
[0, 2, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0]
[0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]

或者，你可以使用熊猫和雪橇[来源]。

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = ['why hello there', 'omg hello pony', 'she went there? omg']
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)

输出：

   hello  omg  pony  she  there  went  why
0      1    0     0    0      1     0    1
1      1    1     1    0      0     0    0
2      0    1     0    1      1     1    0

票数 37

Stack Overflow用户

发布于 2017-12-13 04:43:51

使用令牌和数据帧的另一种方法

import nltk
comment #nltk.download() to get toenize
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)

tokens = nltk.word_tokenize(raw)
type(tokens)

tokens[1:10]
['Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

tokens2=pd.DataFrame(tokens)
tokens2.columns=['Words']
tokens2.head()


Words
0   The
1   Project
2   Gutenberg
3   EBook
4   of

    tokens2.Words.value_counts().head()
,                 16178
.                  9589
the                7436
and                6284
to                 5278

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/15899861

复制

相似问题

问具有NLTK的有效术语文档矩阵
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问具有NLTK的有效术语文档矩阵EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问具有NLTK的有效术语文档矩阵
EN