首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >半结构化文本的高效导入

半结构化文本的高效导入
EN

Stack Overflow用户
提问于 2018-06-08 12:21:29
回答 1查看 78关注 0票数 1

我从Tekscan压力映射系统中保存了多个文本文件。我正在尝试寻找最有效的方法来将多个逗号分隔的矩阵导入到一个uint8类型的3-d矩阵中。我已经开发了一个解决方案,它反复调用MATLAB函数dlmread。不幸的是,导入数据大约需要1.5分钟。我已经包含了下面的代码。

这段代码调用了我编写的另外两个函数,metaextractframecount,我没有包括它们,因为它们与回答手头的问题并不真正相关。

下面是我正在使用的文件示例的两个链接。

The first is a shorter file with 90 samples

The second is a longer file with 3458 samples

任何帮助都将不胜感激

代码语言:javascript
复制
function pressureData = tekscanimport
% Import TekScan data from .asf file to 3d matrix of type double.

[id,path] = uigetfile('*.asf'); %User input for .asf file
if path == 0 %uigetfile returns zero on cancel
    error('You must select a file to continue')
end

path = strcat(path,id); %Concatenate path and id to full path

% function calls
pressureData.metaData = metaextract(path);
nLines = linecount(path); %Find number of lines in file
nFrames = framecount(path,nLines);%Find number of frames

rowStart = 25; %Default starting row to read from tekscan .asf file
rowEnd = rowStart + 41; %Frames are 42 rows long
colStart = 0;%Default starting col to read from tekscan .asf file
colEnd = 47;%Frames are 48 rows long
pressureData.frames = zeros([42,48,nFrames],'uint8');%Preallocate for speed

f = waitbar(0,'1','Name','loading Data...',...
    'CreateCancelBtn','setappdata(gcbf,''canceling'',1)');
setappdata(f,'canceling',0);

for i = 1:nFrames %Loop through file skipping frame metadata
    if getappdata(f,'canceling')
        break
    end
    waitbar(i/nFrames,f,sprintf('Loaded %.2f%%', i/nFrames*100));

    %Make repeated calls to dlmread
    pressureData.frames(:,:,i) = dlmread(path,',',[rowStart,colStart,rowEnd,colEnd]);
    rowStart = rowStart + 44;
    rowEnd = rowStart + 41;
end
delete(f)
end
EN

回答 1

Stack Overflow用户

回答已采纳

发布于 2018-06-09 02:36:14

我试过了。这段代码可以在3.6秒内在我的电脑上打开你的大文件。诀窍是使用sscanf而不是str2doublestr2number函数。

代码语言:javascript
复制
clear all;tic
fid = fopen('tekscanlarge.txt','rt');
%read the header, stop at frame
header='';
l = fgetl(fid);
while length(l)>5&&~strcmp(l(1:5),'Frame')
    header=[header,l,sprintf('\n')];
    l = fgetl(fid);
    if length(l)<5,l(end+1:5)=' ';end
end
%all data at once
dat = fread(fid,inf,'*char');
fclose(fid);
%allocate space
res = zeros([48,42,3458],'uint8');
%get all line endings
LE = [0,regexp(dat','\n')];
i=1;
for ct = 2:length(LE)-1 %go line by line
    L = dat(LE(ct-1)+1:LE(ct)-1);
    if isempty(L),continue;end
    if all(L(1:5)==['Frame']')
        fr = sscanf(L(7:end),'%u');
        i=1;
        continue;
    end
    % sscan can only handle row-char with space seperation.
    res(:,i,fr) = uint8(sscanf(strrep(L',',',' '),'%u'));
    i=i+1;
end
toc

有没有人知道比sscanf更快的转换方法?因为它在这个函数上花费了大部分时间(2.17秒)。对于13.1MB的数据集,我发现与内存的速度相比,它非常慢。

找到了一种在0.2秒内完成的方法,这可能对其他人也很有用。这个mex文件扫描数字的字符值列表,并报告它们。将其另存为mexscan.c并运行mex mexscan.c

代码语言:javascript
复制
#include "mex.h" 
/* The computational routine */
void calc(unsigned char *in, unsigned char *out, long Sout, long Sin)
{
    long ct = 0;
    int newnumber=0; 
    for (int i=0;i<Sin;i+=2){
        if (in[i]>=48 && in[i]<=57) { //it is a number
            out[ct]=out[ct]*10+in[i]-48;
            newnumber=1;
        } else { //it is not a number
            if (newnumber==1){
                ct++;
                if (ct>Sout){return;}
            }
            newnumber=0;
        }
    }    
}

/* The gateway function */
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    unsigned char *in;             /* input vector */
    long Sout;                     /* input size of output vector */
    long Sin;                      /* size of input vector */
    unsigned char *out;            /* output vector*/

    /* check for proper number of arguments */
    if(nrhs!=2) {
        mexErrMsgIdAndTxt("MyToolbox:arrayProduct:nrhs","two input required.");
    }
    if(nlhs!=1) {
        mexErrMsgIdAndTxt("MyToolbox:arrayProduct:nlhs","One output required.");
    }
    /* make sure the first input argument is type char */
    if(!mxIsClass(prhs[0], "char"))  {
        mexErrMsgIdAndTxt("MyToolbox:arrayProduct:notDouble","Input matrix must be type char.");
    }
    /* make sure the second input argument is type uint32 */
    if(!mxIsClass(prhs[0], "char"))  {
        mexErrMsgIdAndTxt("MyToolbox:arrayProduct:notDouble","Input matrix must be type char.");
    }

    /* get dimensions of the input matrix */
    Sin = mxGetM(prhs[0])*2;
    /* create a pointer to the real data in the input matrix  */
    in = (unsigned char *) mxGetPr(prhs[0]);
    Sout = mxGetScalar(prhs[1]);

    /* create the output matrix */
    plhs[0] = mxCreateNumericMatrix(1,Sout,mxUINT8_CLASS,0);

    /* get a pointer to the real data in the output matrix */
    out = (unsigned char *) mxGetPr(plhs[0]);

    /* call the computational routine */
    calc(in,out,Sout,Sin);
}

现在,此脚本在0.2秒内运行,并返回与前一个脚本相同的结果。

代码语言:javascript
复制
clear all;tic
fid = fopen('tekscanlarge.txt','rt');
%read the header, stop at frame
header='';
l = fgetl(fid);
while length(l)>5&&~strcmp(l(1:5),'Frame')
    header=[header,l,sprintf('\n')];
    l = fgetl(fid);
    if length(l)<5,l(end+1:5)=' ';end
end
%all data at once
dat = fread(fid,inf,'*char');
fclose(fid);
S=[48,42,3458];
d = mexscan(dat,uint32(prod(S)+3458));
d(1:prod(S(1:2))+1:end)=[];%remove frame numbers
d = reshape(d,S);
toc
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50753331

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档