首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >使用awk拆分文本文件

使用awk拆分文本文件
EN

Stack Overflow用户
提问于 2018-02-27 06:20:57
回答 3查看 560关注 0票数 0

示例文本文件将如下所示

代码语言:javascript
复制
ID   Z4WTH3_9ACTN            Unreviewed;       182 AA.
AC   Z4WTH3; A0SD0SDF;
AC   Z12SDFG3; ADFFGDF;
DT   11-JUN-2014, integrated into UniProtKB/TrEMBL.
SQ   SEQUENCE   182 AA;  20675 MW;  B85D18AC3B1F0E75 CRC64;
     MNFLEYNKDE KLHFNYKKSC GLWLIVVALI IFAATVIGGK QIINMSVFSF GYVAAFLSIN
//
ID   Z4WXU8_9ACTN            Unreviewed;       203 AA.
AC   Z4WXU8;
AC   QWERDFV1;
DT   11-JUN-2014, integrated into UniProtKB/TrEMBL.
SQ   SEQUENCE   203 AA;  23224 MW;  35F1AE4342F6B3AC CRC64;
     MDCKSIRSEV LWQVVRLREK LMNFLEYNKD EKLCFNYKKS CGLWLIVVAL IIFAATVIGG
//
ID   Z9JHX1_9GAMM            Unreviewed;       132 AA.
AC   Z9JHX1;
SQ   SEQUENCE   132 AA;  13880 MW;  0E09988C0F3ED155 CRC64;
     MKISVDTNVL ARAVLQDDAN QGRSASTLLK DASLIAVSLP CLCELVWILS RGAKLSKEDV
//

实际文件是一个100 is的文件,该文件只包含一个"ID“行,并且始终以"ID”行开头。以"//“结尾

"AC“行可以是多个。我们必须以第一个"AC“行的第一个元素作为文件名。

需要根据"//“将该文件拆分为多个文件。每个文件都应该命名为行中以AC开头的文本。

所以输出文件看起来就像

Z4WTH3.txt

代码语言:javascript
复制
ID   Z4WTH3_9ACTN            Unreviewed;       182 AA.
AC   Z4WTH3; A0SD0SDF;
AC   Z12SDFG3; ADFFGDF;
DT   11-JUN-2014, integrated into UniProtKB/TrEMBL.
SQ   SEQUENCE   182 AA;  20675 MW;  B85D18AC3B1F0E75 CRC64;
     MNFLEYNKDE KLHFNYKKSC GLWLIVVALI IFAATVIGGK QIINMSVFSF GYVAAFLSIN
//

Z4WXU8.txt

代码语言:javascript
复制
ID   Z4WXU8_9ACTN            Unreviewed;       203 AA.
AC   Z4WXU8;
AC   QWERDFV1;
DT   11-JUN-2014, integrated into UniProtKB/TrEMBL.
SQ   SEQUENCE   203 AA;  23224 MW;  35F1AE4342F6B3AC CRC64;
     MDCKSIRSEV LWQVVRLREK LMNFLEYNKD EKLCFNYKKS CGLWLIVVAL IIFAATVIGG
//

Z9JHX1.txt

代码语言:javascript
复制
ID   Z9JHX1_9GAMM            Unreviewed;       132 AA.
AC   Z9JHX1;
SQ   SEQUENCE   132 AA;  13880 MW;  0E09988C0F3ED155 CRC64;
     MKISVDTNVL ARAVLQDDAN QGRSASTLLK DASLIAVSLP CLCELVWILS RGAKLSKEDV
//
EN

回答 3

Stack Overflow用户

回答已采纳

发布于 2018-02-27 06:25:28

遵循awk可能会帮助您在同样的。

代码语言:javascript
复制
awk '/^ID/{close(filename);val=$2;sub(/_.*/,"",val);filename=val".txt"} {print > filename}'  Input_file

解决方案2:根据OP文件名,应该来自字符串AC,因此现在也添加了以下解决方案。

代码语言:javascript
复制
awk '/^ID/{close(filename);first=$0 ORS;next} /^AC/{val=$2;sub(";","",val);filename=val".txt";print first $0 > filename;next} {print > filename}'  Input_file

或者,如果Input_file不是在所有部分都有ID标记,那么我们可以用AC标记编写close函数,如下所示:

代码语言:javascript
复制
awk '/^ID/{first=$0 ORS;next} /^AC/{close(filename);val=$2;sub(";","",val);filename=val".txt";print first $0 > filename;next} {print > filename}'  Input_file

Explanation:现在也添加了对解决方案的解释:

代码语言:javascript
复制
awk '
/^ID/{                       ##Searching string ID here if it is present in any line then do following:
  first=$0 ORS;              ##Creating variable named first whose value is current line with ORS(output record separator).
  next}                      ##next is awk default keyword which will sip further statements.
/^AC/{                       ##Checking here condition if a line contains string AC then do following:
  close(filename);           ##Closing the file which was previously written heer so that we will NOT get too many open files issues.
  val=$2;                    ##Creating variable named val and keeping its value as 2nd field of current line.
  sub(";","",val);           ##Using sub utility of awk to subsitute semi colon with NULL in variable val here.
  filename=val".txt";        ##Creating variable named filename whose value is variable val and .txt(creating output file names here).
  print first $0 > filename; ##Printing variable first and current line in the output file here.
  next                       ##next will skip all further statements now.
}
{
  print > filename           ##Printing the current lines into output file whoever are NOT satisfying the above 2 conditions.
}
'  Input_file                ##Mentioning the Input_file name here.
票数 2
EN

Stack Overflow用户

发布于 2018-02-27 07:26:58

另一种方法是使用RS (GNU,由于RS)来分离记录:

代码语言:javascript
复制
$ gawk '
BEGIN {
    RS=ORS="\n//\n"          # record separators
}
{
    for(i=1;i<=NF;i++)       # go thru each field in record
        if($i=="AC") {       # once AC found
            f=$(i+1) "TXT"   # next one is the filename
            sub(/;/,".",f)   # replace ; with .
            print > f        # print to file (multiple AC:s lead to multiple files)
            close(f)         # close to avoid problem with too many open files
                             # overwrites files when files with same name
        }
}' file

文件:

代码语言:javascript
复制
$ ls -l Z*
-rw-r--r-- 1 james james 254 Feb 27 09:23 Z4WTH3.TXT
-rw-r--r-- 1 james james 254 Feb 27 09:23 Z4WXU8.TXT
-rw-r--r-- 1 james james 202 Feb 27 09:23 Z9JHX1.TXT

在文件中:

代码语言:javascript
复制
$ cat Z9JHX1.TXT
ID   Z9JHX1_9GAMM            Unreviewed;       132 AA.
AC   Z9JHX1;
SQ   SEQUENCE   132 AA;  13880 MW;  0E09988C0F3ED155 CRC64;
     MKISVDTNVL ARAVLQDDAN QGRSASTLLK DASLIAVSLP CLCELVWILS RGAKLSKEDV
//
票数 1
EN

Stack Overflow用户

发布于 2018-02-27 13:09:54

与GNU awk一起用于多个字符RS和RT:

代码语言:javascript
复制
awk -v RS='\n//\n' -v ORS= -F'[[:space:];]+' '{print $0 RT > ($7".txt")}' file

用任何一只头巾:

代码语言:javascript
复制
awk -F'[[:space:];]+' '
    $1 == "AC" { out = $2".txt" }
    { rec = rec $0 ORS }
    $0 == "//" {
        printf "%s", rec > out
        close out
        rec = ""
    }
' file
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/49002341

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档