首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在mapreduce中读取DistributedCache中的HAR文件

在mapreduce中读取DistributedCache中的HAR文件
EN

Stack Overflow用户
提问于 2013-03-04 20:51:10
回答 1查看 747关注 0票数 1

我已经编写了一个oozie工作流,它创建HAR归档,然后运行需要从该归档读取数据的MR-job。1.创建归档文件2.作业运行时,mapper在分布式缓存中看到归档文件。3.?我怎样才能读懂这篇文章呢?从这个归档中逐行读取数据的API是什么(我的har是一批多个以新行分隔的文本文件)。注意:当我处理存储在DistirubtedCache中的常用文件(不是HAR存档)时,它工作得很好。我在尝试从HAR读取数据时遇到问题。

下面是一个代码片段:

代码语言:javascript
复制
    InputStream inputStream;
    String cachedDatafileName = System.getProperty(DIST_CACHE_FILE_NAME);
    LOG.info(String.format("Looking for[%s]=[%s] in DistributedCache",DIST_CACHE_FILE_NAME, cachedDatafileName));

    URI[] uris = DistributedCache.getCacheArchives(getContext().getConfiguration());
    URI uriToCachedDatafile = null;
    for(URI uri : uris){
        if(uri.toString().endsWith(cachedDatafileName)){
            uriToCachedDatafile = uri;
            break;
        }
    }
    if(uriToCachedDatafile == null){
        throw new RuntimeConfigurationException(String.format("Looking for[%s]=[%s] in DistributedCache failed. There is no such file",
                DIST_CACHE_FILE_NAME, cachedDatafileName));
    }

    Path pathToFile = new Path(uriToCachedDatafile);
    LOG.info(String.format("[%s] has been found. Uri is: [%s]. The path is:[%s]",cachedDatafileName, uriToCachedDatafile, pathToFile));

    FileSystem fileSystem =  pathToFile.getFileSystem(getContext().getConfiguration());
    HarFileSystem harFileSystem = new HarFileSystem(fileSystem);
    inputStream = harFileSystem.open(pathToFile); //NULL POINTER EXCEPTION IS HERE!
    return inputStream;
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2013-03-13 14:40:50

代码语言:javascript
复制
protected InputStream getInputStreamToDistCacheFile() throws IOException{
        InputStream inputStream;
        String cachedDatafileName = System.getProperty(DIST_CACHE_FILE_NAME);
        LOG.info(String.format("Looking for[%s]=[%s] in DistributedCache",DIST_CACHE_FILE_NAME, cachedDatafileName));

        URI[] uris = DistributedCache.getCacheArchives(getContext().getConfiguration());
        URI uriToCachedDatafile = null;
        for(URI uri : uris){
            if(uri.toString().endsWith(cachedDatafileName)){
                uriToCachedDatafile = uri;
                break;
            }
        }
        if(uriToCachedDatafile == null){
            throw new RuntimeConfigurationException(String.format("Looking for[%s]=[%s] in DistributedCache failed. There is no such file",
                    DIST_CACHE_FILE_NAME, cachedDatafileName));
        }

        //Path pathToFile = new Path(uriToCachedDatafile +"/stf/db_bts_stf.txt");
        Path pathToFile = new Path("har:///"+"home/ssa/devel/megalabs/kyc-solution/kyc-mrjob/target/test-classes/GSMCellSubscriberHomeIntersectionJobDescriptionClusterMRTest/in/gsm_cell_location_stf.har" +"/stf/db_bts_stf.txt");
        //Path pathToFile = new Path(("har://home/ssa/devel/megalabs/kyc-solution/kyc-mrjob/target/test-classes/GSMCellSubscriberHomeIntersectionJobDescriptionClusterMRTest/in/gsm_cell_location_stf.har"));

        LOG.info(String.format("[%s] has been found. Uri is: [%s]. The path is:[%s]",cachedDatafileName, uriToCachedDatafile, pathToFile));
        FileSystem harFileSystem = pathToFile.getFileSystem(context.getConfiguration());
        FSDataInputStream fin = harFileSystem.open(pathToFile);
        LOG.info("fin: " + fin);
//        FileSystem fileSystem =  pathToFile.getFileSystem(getContext().getConfiguration());
//        HarFileSystem harFileSystem = new HarFileSystem(fileSystem);
//        harFileSystem.exists(new Path("har://home/ssa/devel/mycompany/my-solution/my-mrjob/target/test-classes/HomeJobDescriptionClusterMRTest/in/locations.har"));
//        LOG.info("harFileSystem.exists(pathToFile):"+ harFileSystem.exists(pathToFile));
//        harFileSystem.initialize(uriToCachedDatafile, context.getConfiguration());



        FileStatus[] statuses = harFileSystem.listStatus(new Path("har:///"+"har://home/ssa/devel/mycompany/my-solution/my-mrjob/target/test-classes/HomeJobDescriptionClusterMRTest/in/locations.har"));
        for(FileStatus fileStatus : statuses){
            LOG.info("fileStatus isDir"+fileStatus.isDirectory() +" len:" + fileStatus.getLen());
        }

//        String tmpPathToFile = "har:///"+pathToFile.toString(); //+"/stf/db_bts_stf.txt";
//        Path tmpPath = new Path(tmpPathToFile);
//        LOG.info("KILL ME PATH TO FILE IN ARCHIVE: " +tmpPath);
//        inputStream = harFileSystem.open(tmpPath);
//        return inputStream;
        return fin;
    }

如你所见,这太可怕了。您已经手动读取了存储在归档中的索引文件,并使用索引文件元数据重建路径。如果您知道存档中存储的文件的确切名称(如我的示例中所示),则可以手动构建路径。

这并不方便,我确实期望像Zip->zipEntry这样的东西,当你可以在不知道它的结构的情况下迭代存档的条目。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/15202026

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档