首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >对非常庞大的数据集进行评分

对非常庞大的数据集进行评分
EN

Stack Overflow用户
提问于 2015-04-25 13:39:33
回答 3查看 578关注 0票数 3

我使用R/Python在1-2%的样本数据上建立了一个合适的机器学习分类器,我对准确率指标(精确度、召回率和F_score)相当满意。

现在我想给一个有7000万行/实例的大型数据库打分,这个数据库驻留在Hadoop/Hive环境中,分类器是用R编写的。

有关数据集的信息:

7000万X 40个变量(列):大约18个变量是分类变量,其余22个是数值变量(包括整数)

我该怎么做呢?有什么建议吗?

我想做的事情是:

a)以1 M为增量将数据分块出hadoop系统的csv文件,并将其提供给R

b)某种批处理。

这不是一个实时系统,所以不需要每天都发生,但我仍然希望在大约2-3个小时内得分。

EN

回答 3

Stack Overflow用户

发布于 2015-04-25 14:14:38

如果您可以在所有数据节点上安装R runtime,则可以执行一个简单的hadoop streaming仅映射作业,该作业将调用R代码

你也可以看看SparkR

票数 1
EN

Stack Overflow用户

发布于 2015-04-25 14:25:58

我推断您希望在完整数据集而不是样本数据集上运行R代码(分类器

因此,我们正在寻找在大规模分布式系统上执行R代码

此外,它必须与hadoop组件紧密集成。

因此,RHadoop将适合您的问题陈述。

http://www.rdatamining.com/big-data/r-hadoop-setup-guide

票数 1
EN

Stack Overflow用户

发布于 2015-04-25 20:17:15

代码语言:javascript
复制
The scoring of 80 million to 8.5 seconds

The code below was run on an off lease Dell T7400 workstation with 64gb ram, dual quad 3ghz XEONS and two raid 0 SSD arrays on separate channels which I purchased for $600. I also use the free SPDE to partition the dataset.

For small datasets like your 80 million you might want to consider SAS or WPS.
The code below scores  80 million 40 char records in 9 seconds

The combination of in memory R and SAS/WPS makes a great combinations. Many SAS users consider datasets less than 1TB to be small.

I ran 8 parallel processes, SAS 9.4 64bit Win Pro 64bit

8.5

%let pgm=utl_score_spde;

proc datasets library=spde;
delete gig23ful_spde;
run;quit;

libname spde spde 'd:/tmp'
  datapath=("f:/wrk/spde_f" "e:/wrk/spde_e" "g:/wrk/spde_g")
  partsize=4g;
;

data spde.littledata_spde (compress=char drop=idx);
  retain primary_key;
  array num[20] n1-n20;
  array chr[20] $4 c1-c20;
  do primary_key=1 to 80000000;
    do idx=31 to 50;
      num[idx-30]=uniform(-1);
      chr[idx-30]=repeat(byte(idx),40);
    end;
    output;
  end;
run;quit;



%let _s=%sysfunc(compbl(C:\Progra~1\SASHome\SASFoundation\9.4\sas.exe -sysin c:\nul -nosplash -sasautos c:\oto -autoexec c:\oto\Tut_Oto.sas));

* score it;


data _null_;file "c:\oto\utl_scoreit.sas" lrecl=512;input;put _infile_;putlog _infile_;
cards4;
%macro utl_scoreit(beg=1,end=10000000);

  libname spde spde 'd:/tmp'
  datapath=("f:/wrk/spde_f" "e:/wrk/spde_e" "g:/wrk/spde_g")
  partsize=4g;

  libname out "G:/wrk";

  data keyscore;

     set spde.littledata_spde(firstobs=&beg obs=&end
        keep=
           primary_key
           n1
           n12
           n3
           n14
           n5
           n16
           n7
           n18
           n9
           n10
           c18
           c19
           c12);
    score= (.1*n1   +
            .1*n12  +
            .1*n3   +
            .1*n14  +
            .1*n5   +
            .1*n16  +
            .1*n7   +
            .1*n18  +
            .1*n9   +
            .1*n10  +
             (c18='0000')  +
             (c19='0000')  +
             (c12='0000'))/3  ;
    keep primary_key score;
  run;

%mend utl_scoreit;
;;;;
run;quit;

%utl_scoreit;


%let tym=%sysfunc(time());
systask kill sys101 sys102 sys103 sys104  sys105 sys106 sys107 sys108;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=1,end=10000000);) -log G:\wrk\sys101.log" taskname=sys101;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=10000001,end=20000000);) -log G:\wrk\sys102.log" taskname=sys102 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=20000001,end=30000000);) -log G:\wrk\sys103.log" taskname=sys103 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=30000001,end=40000000);) -log G:\wrk\sys104.log" taskname=sys104 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=40000001,end=50000000);) -log G:\wrk\sys105.log" taskname=sys105 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=50000001,end=60000000);) -log G:\wrk\sys106.log" taskname=sys106 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=60000001,end=70000000);) -log G:\wrk\sys107.log" taskname=sys107 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=70000001,end=80000000);) -log G:\wrk\sys108.log" taskname=sys108 ;
waitfor _all_ sys101 sys102 sys103 sys104  sys105 sys106 sys107 sys108;
systask list;
%put %sysevalf( %sysfunc(time()) - &tym);

8.56500005719863

NOTE: AUTOEXEC processing completed.

NOTE: Libref SPDE was successfully assigned as follows: 
      Engine:        SPDE 
      Physical Name: d:\tmp\
NOTE: Libref OUT was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: G:\wrk

NOTE: There were 10000000 observations read from the data set SPDE.LITTLEDATA_SPDE.
NOTE: The data set WORK.KEYSCORE has 10000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           7.05 seconds
      cpu time            6.98 seconds



NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
      real time           8.34 seconds
      cpu time            7.36 seconds
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/29861458

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档