首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Hadoop分布式缓存

Hadoop分布式缓存
EN

Stack Overflow用户
提问于 2013-12-20 06:56:21
回答 2查看 2.6K关注 0票数 0

我试图使用hadoop分布式缓存,以保持两个输入源与一个映射。

因此,我制作了一个原型,将两个输入文件连接起来,以便使用分布式缓存,这个问题成功地解决了。

但是,如果我编写包含多个mapreduce作业的程序,则分布式缓存api无法工作,并且在程序中,以前作业的输出被用作下一个作业中的两个输入文件之一。但是,分布式缓存文件不会发出任何内容。

这是我的工作司机。

代码语言:javascript
复制
public int run(String[] args) throws Exception {
    Path InputPath = new Path(args[0]);
    Path Inter = new Path("Inters") ;//new Path(args[1]);
    Path OutputPath = new Path(args[1]);        

  JobConf conf = new JobConf(getConf(), Temp.class);
    FileSystem fs = FileSystem.get(getConf());
    conf.setJobName("wordcount");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(FirstMap.class);
    //conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);

    conf.setMapOutputKeyClass(Text.class);
    conf.setMapOutputValueClass(IntWritable.class);
    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);
    //conf.setNumReduceTasks(0);


    //20131220 - to deal with paths as variables



    //fs.delete(Inter);

    //DistributedCache.addCacheFile(new Path(args[2]).toUri(), conf);
    FileInputFormat.setInputPaths(conf, InputPath);
    FileOutputFormat.setOutputPath(conf, Inter);
    conf.set("threshold", args[2]);
    JobClient.runJob(conf);


    // start job 2

    JobConf conf2 = new JobConf(getConf(), Temp.class);
    conf2.setJobName("shit");

    conf2.setMapOutputKeyClass(Text.class);
    conf2.setMapOutputValueClass(IntWritable.class);

    conf2.setOutputKeyClass(Text.class);
    conf2.setOutputValueClass(IntWritable.class);

    conf2.setMapperClass(Map.class);
    //conf.setCombinerClass(Reduce.class);
    conf2.setReducerClass(Reduce.class);
    conf2.setNumReduceTasks(0);
    conf2.setInputFormat(TextInputFormat.class);
    conf2.setOutputFormat(TextOutputFormat.class);


    //DistributedCache.addFileToClassPath(Inter, conf2);
    //DistributedCache.addCacheFile(Inter.toUri(), conf2);
    String InterToStroing = Inter.toString();
    Path Inters = new Path(InterToStroing);

    DistributedCache.addCacheFile(new Path(args[3]).toUri(), conf2);
    FileInputFormat.setInputPaths(conf2, InputPath);
    FileOutputFormat.setOutputPath(conf2, OutputPath);

    conf2.set("threshold", "0");
    JobClient.runJob(conf2);

    return 0;
}

此外,这里还有处理分布式缓存的map函数。

代码语言:javascript
复制
public static class Map extends MapReduceBase implements
        Mapper<LongWritable, Text, Text, IntWritable> {

    static enum Counters {
        INPUT_WORDS
    }

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    private boolean caseSensitive = true;
    private Set<String> patternsToSkip = new HashSet<String>();

    private long numRecords = 0;
    private String inputFile;
    private Iterator<String> Iterator;

    private Path[] localFiles;
    public void configure (JobConf job) {
        try {
            localFiles = DistributedCache.getLocalCacheFiles(job);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        for (Path patternsFile : localFiles) {
            parseSkipFile(patternsFile);
        }
    }
    private void parseSkipFile(Path patternsFile) {
        try {
            BufferedReader fis = new BufferedReader(new FileReader(
                    patternsFile.toString()));
            String pattern = null;
            while ((pattern = fis.readLine()) != null) {
                //String [] StrArr = pattern.split(" ");
                System.err.println("Pattern : " + pattern );
                patternsToSkip.add(pattern);
            }
        } catch (IOException ioe) {
            System.err
                    .println("Caught exception while parsing the cached file '"
                            + patternsFile
                            + "' : "
                            + StringUtils.stringifyException(ioe));
        }
    }

    public void map(LongWritable key, Text value,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        //output.collect(value, one);


        ArrayList<String> temp = new ArrayList<String>();

        String line = value.toString();

        Iterator = patternsToSkip.iterator();


        while (Iterator.hasNext()) {
            output.collect(new Text(Iterator.next()+"+"+value.toString()),one);
        }
        /*while (Iterator.hasNext()) {
            output.collect(new Text(Iterator.next().toString()), one);
        }*/
        //output.collect(value, one);


    }
}

有谁处理过这个问题吗?

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2014-03-20 21:20:45

以下是我练习hadoop时所做的一些事情。它包含多路输入和链接工作,在大学计算机实验室做减少边加入。

代码语言:javascript
复制
public class StockJoinJob extends Configured  {

public static class KeyPartitioner extends Partitioner<TextIntPair, TextLongIntPair> {
@Override
public int getPartition(TextIntPair key, TextLongIntPair value, int numPartitions) {
  return (key.getText().hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}  

public static int runJob(String[] args) throws Exception {
      Configuration conf = new Configuration();
      Job job = new Job(conf);
  job.setJarByClass(StockJoinJob.class);

  Path nasdaqPath = new Path(args[0]);
  Path listPath = new Path(args[1]);
  Path outputPath = new Path(args[2]+"-first");

  MultipleInputs.addInputPath(job, listPath, TextInputFormat.class, CompanyMapper.class);
  MultipleInputs.addInputPath(job, nasdaqPath,
  StockInputFormat.class, StockMapper.class);
  FileOutputFormat.setOutputPath(job, outputPath);

  job.setPartitionerClass(KeyPartitioner.class);
  job.setGroupingComparatorClass(TextIntPair.FirstComparator.class);

  job.setMapOutputKeyClass(TextIntPair.class);
  job.setMapOutputValueClass(TextLongIntPair.class);
  job.setReducerClass(JoinReducer.class);

  job.setOutputKeyClass(TextIntPair.class);
  job.setOutputValueClass(TextLongPair.class);

  return job.waitForCompletion(true) ? 0 : 1;
    }

    public static int runJob2(String[] args) throws Exception {
  //need first comparator like previous job
  Configuration conf = new Configuration();
      Job job = new Job(conf);

  job.setJarByClass(StockJoinJob.class);
  job.setReducerClass(TotalReducer.class);
      job.setMapperClass(TotalMapper.class);
  Path firstPath = new Path(args[2]+"-first");
  Path outputPath = new Path(args[2]+"-second");

  //reducer output//
  job.setOutputKeyClass(TextIntPair.class);
  job.setOutputValueClass(TextLongPair.class);

  //mapper output//
  job.setMapOutputKeyClass(TextIntPair.class);
  job.setMapOutputValueClass(TextIntPair.class);      

  //etc            
  FileInputFormat.setInputPaths(job, firstPath);
  FileOutputFormat.setOutputPath(job, outputPath);
  outputPath.getFileSystem(conf).delete(outputPath, true);

  return job.waitForCompletion(true) ? 0 : 1;
    }



public static void main(String[] args) throws Exception {
int firstCode = runJob(args);
if(firstCode==0){
 int secondCode =runJob2(args);
  System.exit(secondCode);
 }


 }
 }
票数 1
EN

Stack Overflow用户

发布于 2013-12-20 09:54:20

我不知道到底是什么问题(也许你应该换个说法),但我建议你读一下关于链接乔布斯的雅虎教程。我在这里看到两种选择:

  • 如果您执行完全相同的映射,而不关心执行顺序(换句话说,这两个作业可以并行执行),我建议创建一个具有两个输入路径的作业。您可以通过使用以下命令来做到这一点: FileInputFormat.setInputPaths(conf, new Path(args[0])); FileInputFormat.addInputPath(conf, new Path(args[1]));
  • 我认为您需要在新的“链”驱动程序中添加两个单独的作业驱动程序,然后添加依赖项(例如,第二个作业取决于第一个作业,因此应该在第一个作业完成时执行)。然后,可以在第二个作业的驱动程序中声明分布式缓存。希望这能帮上忙..。
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/20698001

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档