我得到了>100个csv文件,每个文件都包含>1000个度量,如下所示
MR44825_radiomics_MCA.csv
Case-1_Image: MR44825_head.nii.gz
Case-1_diagnostics_Configuration_EnabledImageTypes: {'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}}
Case-1_diagnostics_Image-original_Mean: -917.2822725885565MR47987_radiomics_MCA.csv
Case-1_Image: MR47987_head.nii.gz
Case-1_diagnostics_Configuration_EnabledImageTypes: {'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}}
Case-1_diagnostics_Image-original_Mean: -442.31589128260026标签总是一些不同长度的字符串,测量的区别总是第一个:。每个度量包含相同的标签。度量本身可能包含,,但相关值则由{}封装。
现在我想合并这些文件,最好使用bash。输出csv的结构如下所示:
Case-1_Image,Case-1_diagnostics_Configuration_EnabledImageTypes,Case-1_diagnostics_Image-original_Mean
MR44825_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-917.2822725885565
MR47987_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-442.31589128260026发布于 2021-01-01 17:04:30
假设:
:
,)作为分隔符和数据)h 212h 113标签的编号和拼写是预先不知道的(也就是说,我们需要动态解析、存储和打印标签)h 214/code>f 215。awk的一个想法是:
注意::由于需要动态处理标签,所以有点长
awk '
BEGIN { split("",hdr) # declare hdr as an array
split("",data) # declare data as an array
ndx=1 # init array index
}
function print_row() { # function to print a row
pfx="" # first column will have a NULL prefix
if ( length(hdr) > 0 ) # print the header row?
{ for ( i in hdr )
{ printf "%s%s", pfx, hdr[i]
pfx="," # 2nd-nth columns will have a "," prefix
}
printf "\n"
split("",hdr) # clear hdr[] array so we do not print it again
}
pfx="" # reset prefix for printing data row
if ( length(data) > 0 ) # print a data row?
{ for ( i in data )
{ printf "%s%s", pfx, data[i]
pfx="," # 2nd-nth columns will have a "," prefix
}
printf "\n"
split("",data) # clear the data[] array for the next file
ndx=1 # reset our array index for the next file
}
}
FNR==1 { print_row() } # if this is a new file then print contents of last file
{ if ( FNR==NR ) # if this is the first file then make sure to populate the hdr[] array
hdr[ndx]=gensub(/:$/,"","g",$1) # strip trailing ":" from field #1; store in hdr[] array
$1="" # clear field #1
data[ndx]=gensub(/^ /,"","g",$0) # strip leading " " from the line; store in data[] array
ndx++ # increment array index
next
}
END { print_row() } # flush last set of data[] to stdout
' MR*MCA.csv当针对2x示例数据文件运行时,将生成:
Case-1_Image,Case-1_diagnostics_Configuration_EnabledImageTypes,Case-1_diagnostics_Image-original_Mean
MR44825_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-917.2822725885565
MR47987_head.nii.gz,{'Original': {}, 'LoG': {'sigma': [2.0, 4.0, 6.0]}, 'Wavelet': {}},-442.31589128260026发布于 2021-01-01 16:36:35
CSV对您提供的数据没有真正的意义,并且您说您希望输出的伪CSV没有什么意义,而且很难进一步处理。也许将每个输入文件转换为JSON会更有意义,并允许使用标准工具进行处理。
awk -F ': ' 'FNR==1 { name=$2 }
FNR==2 { j = substr($0, length($1)+3); gsub(/\047/, "\042", j) }
FNR==3 { sub(/^{/, "{\042name\042: \042" name "\042,", j);
sub(/}$/, ",\042mean\042: " $2 "}", j);
print j }' *.csv >output.jsonl输出应该类似于
{"name":"MR20584_head.nii.gz","Original": {}, "LoG": {"sigma": [2.0, 4.0, 6.0]}, "Wavelet": {},"mean": -917.2822725885565}
{"name":"MR30211_head.nii.gz","Original": {}, "LoG": {"sigma": [2.0, 4.0, 6.0]}, "Wavelet": {},"mean":-1024.287275914652}这种格式是JSON行,即每行都是有效的JSON,但是文件本身不是正确的JSON。
当然,如果您能够修复产生这种无用格式的工具,那就更好了。
https://stackoverflow.com/questions/65530819
复制相似问题