高性能任务视图指出,tm可以使用斯诺进行并行文本挖掘(基于R的高性能并行计算).然而,我还没有找到任何例子来说明如何做到这一点,尽管我已经发现了一些关于使用tm (R/Finance 2012)并行计算的讨论。有人能说明tm如何与snow创建的集群进行接口吗?
编辑:请参阅下面BenBarnes的评论。具体地说:
根据
?tm_startCluster的说法,该函数寻找一个MPI集群(而不是SOCK集群)并“允许'tm‘使用集群”。也许这将是hadoop的另一种选择,因为只要具备一些先决条件,snow就可以设置MPI集群。
发布于 2012-06-19 00:17:06
LMGTFY使用"r-project平行“作为搜索策略,这是第三次成功:
基于tm的分布式文本挖掘
从幻灯片中直接复制: 1.复制到DFS (‘DistributedCorpus’)的分布式存储数据集只保留在内存中,2.并行MapReduce范式中所有元素的并行计算计算操作(Map)可以按需检索tm_map()和TermDocumentMatrix()处理的文档(修订版)。
在tm: tm.plugin.dc的“plugin”包中实现。
#Distributed Text Mining in R
> library("tm.plugin.dc")
> dc <- DistributedCorpus(DirSource("Data/reuters"),
list(reader = readReut21578XML) )
> dc <- as.DistributedCorpus(Reuters21578)
> summary(dc)
#A corpus with 21578 text documents
#The metadata consists of 2 tag-value pairs and a data frame
#Available tags are:
#create_date creator
#Available variables in the data frame are:
#MetaID
--- Distributed Corpus ---
#Available revisions:
#20100417144823
#Active revision: 20100417144823
#DistributedCorpus: Storage
#- Description: Local Disk Storage
#- Base directory on storage: /tmp/RtmpuxX3W7/file5bd062c2
#- Current chunk size [bytes]: 10485760
> dc <- tm_map(dc, stemDocument)
> print(object.size(Reuters21578), units = "Mb")
#109.5 Mb
> dc
#A corpus with 21578 text documents
> dc_storage(dc)
DistributedCorpus: Storage
- Description: Local Disk Storage
- Base directory on storage: /tmp/RtmpuxX3W7/file5bd062c2
- Current chunk size [bytes]: 10485760
> dc[[3]]
#----------
Texas Commerce Bancshares Inc
'
s Texas
Commerce Bank-Houston said it filed an application with the
Comptroller of the Currency in an effort to create the largest
banking network in Harris County.
The bank said the network would link 31 banks having
13.5 billion dlrs in assets and 7.5 billion dlrs in deposits.
Reuter
#---------
> print(object.size(dc), units = "Mb")
# 0.6 Mb进一步搜索使用的术语: tm,斯诺,parLapply . 生成此链接:
使用此代码:
library(snow)
cl <- makeCluster(4, type="SOCK")
par(ask=TRUE)
bigsleep <- function(sleeptime, mat) Sys.sleep(sleeptime)
bigmatrix <- matrix(0, 2000, 2000)
sleeptime <- rep(1, 100)
tm <- snow.time(clusterApply(cl, sleeptime, bigsleep, bigmatrix))
plot(tm)
cat(sprintf("Elapsed time for clusterApply: %f\n", tm$elapsed))
tm <- snow.time(parLapply(cl, sleeptime, bigsleep, bigmatrix))
plot(tm)
cat(sprintf("Elapsed time for parLapply: %f\n", tm$elapsed))
stopCluster(cl)https://stackoverflow.com/questions/11092621
复制相似问题