文章/答案/技术大牛

发布

问用Sastrawi堵住印尼语
EN

Stack Overflow用户

提问于 2016-09-02 06:01:42

回答 1查看 5.9K关注 0票数 2

我有一个csv数据集，它的值就在这里，在这里输入图像描述。

所以，我想对数据进行预处理。数据类型是文本，所以我将文本挖掘。但我对堵塞感到困惑。我试着堵住了数据，但结果是所有新闻的字数。我从我的朋友那里得到代码参考，但我想改变。我想要修改代码来提高结果。我希望结果是每一个新闻的单词计数，而不是分割所有的新闻。请帮我修改代码。

在这里，代码：

import os
import pandas as pd
from pandas import DataFrame, read_csv

data = r'D:/SKRIPSI/sample_200_data.csv'
df = pd.read_csv(data)

print "DF", type (df['content']), "\n", df['content']
isiberita = df['content'].tolist()
print "DF list isiberita ", isiberita, type(isiberita)
df.head()

---------------------------------------------------------

import nltk
import string
import os
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from nltk.corpus import stopwords
from collections import Counter


path = 'D:/SKRIPSI/sample_200_data.csv'
token_dict = {}

factory = StemmerFactory()
stemmer = factory.create_stemmer()

content_stemmed = map(lambda x: stemmer.stem(x), isiberita)
content_no_punc = map(lambda x: x.lower().translate(None, string.punctuation), content_stemmed)
content_final = []


for news in content_no_punc: 
	word_token = nltk.word_tokenize(news) # get word token for every news (split news into each separate words)
	word_token = [word for word in word_token if not word in nltk.corpus.stopwords.words('indonesian') and not word[0].isdigit()] # remove indonesian stop words and number
	content_final.append(" ".join(word_token))

counter = Counter() # counter initiate
[counter.update(news.split()) for news in content_final] # we split every news to get counter of each words
print(counter.most_common(100))

因此，该代码的结果是：

[('indonesia', 202), ('rp', 179), ('jakarta', 160), ('usaha', 149), ('investasi', 136), ('laku', 124), ('ekonomi', 100), ('negara', 86), ('harga', 86), ('industri', 84), ('izin', 84), ('menteri', 83), ('listrik', 79), ('juta', 76), ('pasar', 73), ('tani', 71), ('uang', 71), ('koperasi', 71), ('target', 66), ('perintah', 66), ('saham', 65), ('miliar', 64), ('kerja', 63), ('sektor', 62), ('investor', 61), ('bangun', 60), ('produk', 60), ('pajak', 60), ('capai', 60), ('layan', 58), ('bank', 57), ('produksi', 57), ('modal', 57), ('turun', 57), ('china', 56), ('milik', 55), ('tingkat', 54), ('us', 54), ('triliun', 53), ('tumbuh', 53), ('bkpm', 53), ('impor', 52), ('kembang', 51), ('pt', 49), ('jalan', 49), ('dana', 48), ('bandara', 48), ('negeri', 46), ('rencana', 45), ('nilai', 45), ('temu', 44), ('salah', 42), ('proyek', 41), ('masuk', 41), ('desember', 40), ('langsung', 40), ('hasil', 39), ('butuh', 39), ('rupa', 38), ('biaya', 37), ('kapal', 37), ('rusia', 37), ('franky', 37), ('hadap', 36), ('kredit', 35), ('utama', 35), ('carrefour', 35), ('bijak', 35), ('ikan', 35), ('tanam', 35), ('atur', 34), ('persero', 34), ('kait', 34), ('jam', 34), ('masyarakat', 32), ('gas', 32), ('pakai', 32), ('dagang', 31), ('kondisi', 31), ('transmart', 31), ('lihat', 31), ('bisnis', 31), ('nggak', 31), ('kawasan', 30), ('dorong', 30), ('tutup', 30), ('banding', 30), ('batas', 30), ('terima', 30), ('cepat', 30), ('jasa', 30), ('ton', 29), ('the', 29), ('pln', 29), ('ekspor', 29), ('barel', 29), ('as', 29), ('rumah', 29), ('orang', 28), ('pondok', 28)]

我希望任何人都能帮我修改代码，这样我就可以得到结果“每个新闻(内容)中单词的计数”，而不是所有新闻中的所有单词“”。谢谢你。

python

stemming

回答 1

Stack Overflow用户

发布于 2016-10-22 08:56:01

如果我正确理解了这一点，那么您的问题与PySastrawi没有直接关系。

问题是在处理新闻数据时使用counter.update()。最后，这将从所有新闻中返回累积的字数。如果您想单独计算单个新闻中的单词，那么每个新闻都需要一个单独的Counter实例。如下所示(这将从每个新闻中打印100个最常见的单词)：

for news in content_final:
    counter = Counter(news.split()) # counter initiate
    print(counter.most_common(100))

完整的演示示例：

>>> content_final = ['foo','foo foo bar','foo baz baz']
>>> for news in content_final:
...     counter = Counter(news.split())
...     print(counter.most_common(1))
...
[('foo', 1)]
[('foo', 2)]
[('baz', 2)]

现场观看：https://eval.in/664688

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/39285300

复制

相似问题

问用Sastrawi堵住印尼语
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Sastrawi堵住印尼语EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用Sastrawi堵住印尼语
EN