文章/答案/技术大牛

发布

问rmr2映射减少列的csv子集
EN

Stack Overflow用户

提问于 2014-12-16 20:12:45

回答 1查看 725关注 0票数 1

我有一个很大的CSV文件，有42个变量和20万条记录。我想通过映射减少(localbackend)来处理它，但是我总是得到以下错误：

Error: cannot allocate vector of size 15.6 Gb
In addition: Warning messages:
1: closing unused connection 3 (C:\Users\LSZL~1\AppData\Local\Temp\RtmpgJ2FXm\filea302f8a7363) 
2: In paste(rep(l, length(lvs)), rep(lvs, each = length(l)), sep = sep) :
  Reached total allocation of 8051Mb: see help(memory.size)
3: In paste(rep(l, length(lvs)), rep(lvs, each = length(l)), sep = sep) :
  Reached total allocation of 8051Mb: see help(memory.size)
4: In paste(rep(l, length(lvs)), rep(lvs, each = length(l)), sep = sep) :
  Reached total allocation of 8051Mb: see help(memory.size)
5: In paste(rep(l, length(lvs)), rep(lvs, each = length(l)), sep = sep) :
  Reached total allocation of 8051Mb: see help(memory.size)

我的代码：

inputformat <- make.input.format("csv", sep = ",", col.names=column_names)

a <- mapreduce(input="X:/BigData/working_dir/census-income.data", 
               input.format=inputformat,

               map = function(k, v){
                 key = v
                 return(keyval(key, v[1,1]))
               },

               reduce = function(k, v){
                 key = k[1, 1]
                 val = sum(k[, 2])
                 return(keyval(key, val))
               }               
)()

是否有可能不提供不必要的列(+数据)来映射、减少和选择那些需要数据的列？

mapreduce

csv

hadoop

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-12-16 23:49:19

我终于想出来了。

我不知道它是否有效，但它有效。

column_names <- c("age","class_of_worker", "industry_code", "occupation_code", "education", 
                  "wage_per_hour", "enrolled_in_edu_inst_last_wk", "marital_status", "major_industry_code", 
                  "major_occupation_code", "race", "hispanic_origin", "sex", "member_of_a_labor_union", 
                  "reason_for_unemployment","full_or_part_time_employment_stat", "capital_gains", "capital_losses", 
                  "divdends_from_stocks", "tax_filer_status", "region_of_previous_residence", 
                  "state_of_previous_residence", "detailed_household_and_family_stat", 
                  "detailed_household_summary_in_household", "instance_weight", "migration_code-change_in_msa",
                  "migration_code-change_in_reg","migration_code-move_within_reg","live_in_this_house_1_year_ago", 
                  "migration_prev_res_in_sunbelt",  "num_persons_worked_for_employer", "total_person_earnings", 
                  "country_of_birth_father", "country_of_birth_mother", "country_of_birth_self", "citizenship", 
                  "own_business_or_self_employed", "fill_inc_questionnaire_for_veteran's_admin", 
                  "veterans_benefits", "weeks_worked_in_year", "year", "CLASS")

important_columns = c("age", "education", "wage_per_hour", "weeks_worked_in_year")

input_file_format = 
  make.input.format(
    "csv", 
    sep = ",", 
    col.names = column_names)    

input_subset = 
  mapreduce(
    input = "X:/BigData/working_dir/census-income.data", 
    input.format = input_file_format,
    map = 
      function(k, v) 
        subset(v, select = important_columns))

input_dataframe = from.dfs(input_subset)
input_dataframe = values(input_dataframe)
input_dataframe

数据：http://kdd.ics.uci.edu/databases/census-income/

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/27513093

复制

相似问题

问rmr2映射减少列的csv子集
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问rmr2映射减少列的csv子集EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问rmr2映射减少列的csv子集
EN