文章/答案/技术大牛

发布

社区首页 >问答首页 >Hadoop分布式缓存

问Hadoop分布式缓存
EN

Stack Overflow用户

提问于 2013-12-20 06:56:21

回答 2查看 2.6K关注 0票数 0

我试图使用hadoop分布式缓存，以保持两个输入源与一个映射。

因此，我制作了一个原型，将两个输入文件连接起来，以便使用分布式缓存，这个问题成功地解决了。

但是，如果我编写包含多个mapreduce作业的程序，则分布式缓存api无法工作，并且在程序中，以前作业的输出被用作下一个作业中的两个输入文件之一。但是，分布式缓存文件不会发出任何内容。

这是我的工作司机。

public int run(String[] args) throws Exception {
    Path InputPath = new Path(args[0]);
    Path Inter = new Path("Inters") ;//new Path(args[1]);
    Path OutputPath = new Path(args[1]);        

  JobConf conf = new JobConf(getConf(), Temp.class);
    FileSystem fs = FileSystem.get(getConf());
    conf.setJobName("wordcount");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(FirstMap.class);
    //conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);

    conf.setMapOutputKeyClass(Text.class);
    conf.setMapOutputValueClass(IntWritable.class);
    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);
    //conf.setNumReduceTasks(0);


    //20131220 - to deal with paths as variables



    //fs.delete(Inter);

    //DistributedCache.addCacheFile(new Path(args[2]).toUri(), conf);
    FileInputFormat.setInputPaths(conf, InputPath);
    FileOutputFormat.setOutputPath(conf, Inter);
    conf.set("threshold", args[2]);
    JobClient.runJob(conf);


    // start job 2

    JobConf conf2 = new JobConf(getConf(), Temp.class);
    conf2.setJobName("shit");

    conf2.setMapOutputKeyClass(Text.class);
    conf2.setMapOutputValueClass(IntWritable.class);

    conf2.setOutputKeyClass(Text.class);
    conf2.setOutputValueClass(IntWritable.class);

    conf2.setMapperClass(Map.class);
    //conf.setCombinerClass(Reduce.class);
    conf2.setReducerClass(Reduce.class);
    conf2.setNumReduceTasks(0);
    conf2.setInputFormat(TextInputFormat.class);
    conf2.setOutputFormat(TextOutputFormat.class);


    //DistributedCache.addFileToClassPath(Inter, conf2);
    //DistributedCache.addCacheFile(Inter.toUri(), conf2);
    String InterToStroing = Inter.toString();
    Path Inters = new Path(InterToStroing);

    DistributedCache.addCacheFile(new Path(args[3]).toUri(), conf2);
    FileInputFormat.setInputPaths(conf2, InputPath);
    FileOutputFormat.setOutputPath(conf2, OutputPath);

    conf2.set("threshold", "0");
    JobClient.runJob(conf2);

    return 0;
}

此外，这里还有处理分布式缓存的map函数。

public static class Map extends MapReduceBase implements
        Mapper<LongWritable, Text, Text, IntWritable> {

    static enum Counters {
        INPUT_WORDS
    }

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    private boolean caseSensitive = true;
    private Set<String> patternsToSkip = new HashSet<String>();

    private long numRecords = 0;
    private String inputFile;
    private Iterator<String> Iterator;

    private Path[] localFiles;
    public void configure (JobConf job) {
        try {
            localFiles = DistributedCache.getLocalCacheFiles(job);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        for (Path patternsFile : localFiles) {
            parseSkipFile(patternsFile);
        }
    }
    private void parseSkipFile(Path patternsFile) {
        try {
            BufferedReader fis = new BufferedReader(new FileReader(
                    patternsFile.toString()));
            String pattern = null;
            while ((pattern = fis.readLine()) != null) {
                //String [] StrArr = pattern.split(" ");
                System.err.println("Pattern : " + pattern );
                patternsToSkip.add(pattern);
            }
        } catch (IOException ioe) {
            System.err
                    .println("Caught exception while parsing the cached file '"
                            + patternsFile
                            + "' : "
                            + StringUtils.stringifyException(ioe));
        }
    }

    public void map(LongWritable key, Text value,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        //output.collect(value, one);


        ArrayList<String> temp = new ArrayList<String>();

        String line = value.toString();

        Iterator = patternsToSkip.iterator();


        while (Iterator.hasNext()) {
            output.collect(new Text(Iterator.next()+"+"+value.toString()),one);
        }
        /*while (Iterator.hasNext()) {
            output.collect(new Text(Iterator.next().toString()), one);
        }*/
        //output.collect(value, one);


    }
}

有谁处理过这个问题吗？

distributed

java

hadoop

回答 2

Stack Overflow用户

回答已采纳

发布于 2014-03-20 21:20:45

以下是我练习hadoop时所做的一些事情。它包含多路输入和链接工作，在大学计算机实验室做减少边加入。

public class StockJoinJob extends Configured  {

public static class KeyPartitioner extends Partitioner<TextIntPair, TextLongIntPair> {
@Override
public int getPartition(TextIntPair key, TextLongIntPair value, int numPartitions) {
  return (key.getText().hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}  

public static int runJob(String[] args) throws Exception {
      Configuration conf = new Configuration();
      Job job = new Job(conf);
  job.setJarByClass(StockJoinJob.class);

  Path nasdaqPath = new Path(args[0]);
  Path listPath = new Path(args[1]);
  Path outputPath = new Path(args[2]+"-first");

  MultipleInputs.addInputPath(job, listPath, TextInputFormat.class, CompanyMapper.class);
  MultipleInputs.addInputPath(job, nasdaqPath,
  StockInputFormat.class, StockMapper.class);
  FileOutputFormat.setOutputPath(job, outputPath);

  job.setPartitionerClass(KeyPartitioner.class);
  job.setGroupingComparatorClass(TextIntPair.FirstComparator.class);

  job.setMapOutputKeyClass(TextIntPair.class);
  job.setMapOutputValueClass(TextLongIntPair.class);
  job.setReducerClass(JoinReducer.class);

  job.setOutputKeyClass(TextIntPair.class);
  job.setOutputValueClass(TextLongPair.class);

  return job.waitForCompletion(true) ? 0 : 1;
    }

    public static int runJob2(String[] args) throws Exception {
  //need first comparator like previous job
  Configuration conf = new Configuration();
      Job job = new Job(conf);

  job.setJarByClass(StockJoinJob.class);
  job.setReducerClass(TotalReducer.class);
      job.setMapperClass(TotalMapper.class);
  Path firstPath = new Path(args[2]+"-first");
  Path outputPath = new Path(args[2]+"-second");

  //reducer output//
  job.setOutputKeyClass(TextIntPair.class);
  job.setOutputValueClass(TextLongPair.class);

  //mapper output//
  job.setMapOutputKeyClass(TextIntPair.class);
  job.setMapOutputValueClass(TextIntPair.class);      

  //etc            
  FileInputFormat.setInputPaths(job, firstPath);
  FileOutputFormat.setOutputPath(job, outputPath);
  outputPath.getFileSystem(conf).delete(outputPath, true);

  return job.waitForCompletion(true) ? 0 : 1;
    }



public static void main(String[] args) throws Exception {
int firstCode = runJob(args);
if(firstCode==0){
 int secondCode =runJob2(args);
  System.exit(secondCode);
 }


 }
 }

票数 1

Stack Overflow用户

发布于 2013-12-20 09:54:20

我不知道到底是什么问题(也许你应该换个说法)，但我建议你读一下关于链接乔布斯的雅虎教程。我在这里看到两种选择：

如果您执行完全相同的映射，而不关心执行顺序(换句话说，这两个作业可以并行执行)，我建议创建一个具有两个输入路径的作业。您可以通过使用以下命令来做到这一点： FileInputFormat.setInputPaths(conf, new Path(args[0])); FileInputFormat.addInputPath(conf, new Path(args[1]));
我认为您需要在新的“链”驱动程序中添加两个单独的作业驱动程序，然后添加依赖项(例如，第二个作业取决于第一个作业，因此应该在第一个作业完成时执行)。然后，可以在第二个作业的驱动程序中声明分布式缓存。希望这能帮上忙..。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/20698001

复制

相似问题

问Hadoop分布式缓存
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Hadoop分布式缓存EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Hadoop分布式缓存
EN