问MapReduce TotalOrderPartitioning是否仅将输出写入一个文件？
EN

Stack Overflow用户

提问于 2016-10-18 07:27:51

回答 1查看 93关注 0票数 1

我正在运行一个mapreduce作业，它读取输入并使用多个reduces对其进行排序。我能够得到的输出排序与减少的数量为5。然而，输出被写入只有一个文件，并有4个空文件。我正在使用输入采样器和totalorderpartitioner进行全局排序。

我的驱动程序如下所示：

int numReduceTasks = 5;
    Configuration conf = new Configuration();
    Job job = new Job(conf, "DictionarySorter");
    job.setJarByClass(SampleEMR.class);
    job.setMapperClass(SortMapper.class);
    job.setReducerClass(SortReducer.class);
    job.setPartitionerClass(TotalOrderPartitioner.class);
    job.setNumReduceTasks(numReduceTasks);
    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);


    FileInputFormat.setInputPaths(job, input);
    FileOutputFormat.setOutputPath(job, new Path(output
            + ".dictionary.sorted." + getCurrentDateTime()));
    job.setPartitionerClass(TotalOrderPartitioner.class);

    Path inputDir = new Path("/others/partitions");

    Path partitionFile = new Path(inputDir, "partitioning");
    TotalOrderPartitioner.setPartitionFile(job.getConfiguration(),
            partitionFile);

    double pcnt = 1.0;
    int numSamples = numReduceTasks;
    int maxSplits = numReduceTasks - 1;
    if (0 >= maxSplits)
        maxSplits = Integer.MAX_VALUE;

    InputSampler.Sampler<LongWritable, Text> sampler = new InputSampler.RandomSampler<LongWritable, Text>(pcnt,
            numSamples, maxSplits);
    InputSampler.writePartitionFile(job, sampler);
    job.waitForCompletion(true);

hadoop

totalorderpartitioner

回答 1

Stack Overflow用户

发布于 2016-10-18 14:59:49

你的RandomSampler参数在我看来很可疑：

第一个参数freq是概率，而不是百分比。对于pcnt = 1，您将100%地对记录进行采样。
第二个参数numSamples应该更大。它应该足以表示整个数据集的分布。

假设你有以下键:4，7，8，9，4，1，2，5，6，3，2，4，7，8，1，1，8，9，9，9，9，9

使用freq = 0.3和numSamples = 10。为了简单起见，假设0.3表示每3个关键点采样一个关键点。这将收集以下样本: 4,9,2,3,7,1,8,9。这将被排序为1,2,3,4,7,8,9,9。这个样本有8个元素，所以全部保留，因为它没有超过最大样本数numSamples = 10。在此示例中，Reducer的边界将类似于2,4,8,9。这意味着，如果一对密钥为"1“，它将以Reducer #1结束。密钥为"2”的对将以Reducer #2结束。密钥为"5“的对将以Reducer #3结束，依此类推……这将是一个很好的发行版。

现在，如果我们在相同的示例键上运行您的值。您的freq = 1，因此您可以将每个密钥放入示例中。因此，您的示例将与初始键集相同。除非您设置了max of samples numSamples = 4，这意味着您只在样本中保留4个元素。你的最终样本可能是9,9,9,9。在这种情况下，你所有的边界都是相同的，所以所有的对都会去Reducer #5。

在我的示例中，看起来我们非常不幸地拥有相同的最后4个密钥。但是，如果您的原始数据集已经排序，如果您使用高频和少量样本，则很可能会发生这种情况(并且边界分布肯定不好)。

这个blog post有很多关于采样和TotalOrderPartitioning的细节。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/40097216

复制

相似问题

问MapReduce TotalOrderPartitioning是否仅将输出写入一个文件？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问MapReduce TotalOrderPartitioning是否仅将输出写入一个文件？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问MapReduce TotalOrderPartitioning是否仅将输出写入一个文件？
EN