首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >grep/sed/awk将文本文件解析为一个文件,其中基因为行,描述为列

grep/sed/awk将文本文件解析为一个文件,其中基因为行,描述为列
EN

Stack Overflow用户
提问于 2022-05-26 21:03:25
回答 2查看 67关注 0票数 0

我想使用grep/awk/sed来解析包含各种基因描述的文本文件。

下载文件

代码语言:javascript
复制
wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz

示例案文如下:

代码语言:javascript
复制
WBGene00000001  aap-1   Y110A7A.10
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan 
and dauer development, and likely functions as the sole adaptor subunit for the 
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates 
insulin-like signaling, it is not absolutely required for insulin-like signaling 
under most conditions. 
Automated description: Enables protein kinase binding activity. Involved in dauer 
larval development; determination of adult lifespan; and insulin receptor signaling 
pathway. Part of phosphatidylinositol 3-kinase complex. Expressed in intestine and 
neurons. Human ortholog(s) of this gene implicated in several diseases, including 
Alzheimer's disease; SHORT syndrome; carcinoma (multiple); and immunodeficiency 
36. Is an ortholog of human PIK3R3 (phosphoinositide-3-kinase regulatory subunit 
3). 
Gene class description: phosphoinositide kinase AdAPter subunit 
=
WBGene00000002  aat-1   F27C8.1
Concise description: aat-1 encodes an amino acid transporter catalytic subunit; 
when co-expressed in Xenopus oocytes with the ATG-2 glycoprotein subunit, AAT-1 
is able to facilitate amino acid uptake and exchange, showing a relatively high 
affinity for small and some large neutral amino acids; in addition, AAT-1 is able 
to covalently associate with ATG-2 or ATG-1 to form heterodimers in the Xenopus 
expression system; when co-expressed with ATG-2, AAT-1 localizes to the cell surface 
of oocytes, but when expressed alone or with ATG-1, AAT-1 localizes intracellularly. 
Automated description: Contributes to L-amino acid transmembrane transporter activity. 
Involved in amino acid transmembrane transport. Located in plasma membrane. Part 
of amino acid transport complex. Expressed in egg-laying apparatus; head motor neurons; 
and tail. Human ortholog(s) of this gene implicated in cystinuria and lysinuric 
protein intolerance. Is an ortholog of human SLC7A8 (solute carrier family 7 member 
8). 
Gene class description: Amino Acid Transporter 
=
WBGene00000003  aat-2   F07C3.7
Concise description: aat-2 encodes a predicted amino acid transporter catalytic 
subunit; when co-expressed in Xenopus oocytes with a glycoprotein subunit, however, 
AAT-2 is not able to induce amino acid uptake. 
Automated description: Predicted to enable L-amino acid transmembrane transporter 
activity. Predicted to be involved in L-alpha-amino acid transmembrane transport 
and L-amino acid transport. Predicted to be located in membrane. Predicted to be 
integral component of membrane. Human ortholog(s) of this gene implicated in cystinuria. 
Is an ortholog of human SLC7A8 (solute carrier family 7 member 8). 
Gene class description: Amino Acid Transporter 

该文本文件包含每个基因名称(例如,WBGene00000004 AAT-3F52H2.2a)、简明描述:、自动描述:、基因类描述:用等号"=“分隔。

我一直在试图解析这个txt文件,所以我想我从提取每一列和每一行(基因)开始。下面是我的代码

代码语言:javascript
复制
#genes
grep "WBGene" c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_WBgenes.txt

#gene class description:
awk '/Gene class description:/' c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_geneclass.txt

#concise description
awk '
/Concise description:/  { flag=1; pfx="" }
/Automated description/ { flag=0; print "" }
flag                    { printf "%s%s",pfx,$0; pfx=" " }   # assuming appended lines are separated by a single space
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_concise.txt

#automated description
awk '
/Automated description:/  { flag=1; pfx="" }
/Gene class description:/ { flag=0; print "" }
flag                    { printf "%s%s",pfx,$0; pfx=" " }   # assuming appended lines are separated by a single space
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt > WB283_automated.txt

我的问题是:有什么方法可以组合我的代码或新代码来更好地解决我的问题?

我想提取每个基因名称,简明描述:,自动描述:和基因类描述:在单独的列和每一行代表一个基因。

我想要创建一个txt文件,其中包含每一行作为一个基因和每一列的描述选择。

想要的案文:

代码语言:javascript
复制
WBGene00000001  aap-1   Y110A7A.10      phosphoinositide kinase AdAPter subunit         aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan  and dauer development, and likely functions as the sole adaptor subunit for the  AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates  insulin-like signaling, it is not absolutely required for insulin-like signaling  under most conditions.      Enables protein kinase binding activity. Involved in dauer larval development; determination of adult lifespan; and insulin receptor signaling  pathway. Part of phosphatidylinositol 3-kinase complex. Expressed in intestine and  neurons. Human ortholog(s) of this gene implicated in several diseases, including  Alzheimer's disease; SHORT syndrome; carcinoma (multiple); and immunodeficiency  36. Is an ortholog of human PIK3R3 (phosphoinositide-3-kinase regulatory subunit  3).
WBGene00000002  aat-1   F27C8.1 Amino Acid Transporter  aat-1 encodes an amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with the ATG-2 glycoprotein subunit, AAT-1  is able to facilitate amino acid uptake and exchange, showing a relatively high  affinity for small and some large neutral amino acids; in addition, AAT-1 is able  to covalently associate with ATG-2 or ATG-1 to form heterodimers in the Xenopus  expression system; when co-expressed with ATG-2, AAT-1 localizes to the cell surface  of oocytes, but when expressed alone or with ATG-1, AAT-1 localizes intracellularly.     Contributes to L-amino acid transmembrane transporter activity. Involved in amino acid transmembrane transport. Located in plasma membrane. Part  of amino acid transport complex. Expressed in egg-laying apparatus; head motor neurons;  and tail. Human ortholog(s) of this gene implicated in cystinuria and lysinuric  protein intolerance. Is an ortholog of human SLC7A8 (solute carrier family 7 member  8).
WBGene00000003  aat-2   F07C3.7 Amino Acid Transporter  aat-2 encodes a predicted amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with a glycoprotein subunit, however,  AAT-2 is not able to induce amino acid uptake.  Predicted to enable L-amino acid transmembrane transporter activity. Predicted to be involved in L-alpha-amino acid transmembrane transport  and L-amino acid transport. Predicted to be located in membrane. Predicted to be  integral component of membrane. Human ortholog(s) of this gene implicated in cystinuria.  Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).
EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-05-26 21:54:33

假设输出是制表符分隔的,有一个awk的想法:

代码语言:javascript
复制
awk '
BEGIN { OFS="\t" }
function print_output()    { if (baseID) print baseID,gene_name,trans_name,gene_desc,concise_desc,auto_desc; baseID="" }

$1 ~ /WBGene/              { baseID=$1; gene_name=$2; trans_name=$3 }
/^Gene class description:/ { gene_desc    =substr($0, index($0,": ")+2) ; in_block="" }
/^Concise description:/    { concise_desc =substr($0, index($0,": ")+2) ; in_block="concise"; pfx=""; next }
/^Automated description:/  { auto_desc    =substr($0, index($0,": ")+2) ; in_block="auto"   ; pfx=""; next }

in_block                   { if (in_block == "concise")
                                concise_desc = concise_desc pfx $0
                             else
                                auto_desc = auto_desc pfx $0
                             pfx=" "
                           }
$1 == "="                  { print_output() }

END                        { print_output() }
' input.file

对于所提供的示例,这将生成:

代码语言:javascript
复制
WBGene00000001  aap-1   Y110A7A.10      phosphoinositide kinase AdAPter subunit         aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan  and dauer development, and likely functions as the sole adaptor subunit for the  AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates  insulin-like signaling, it is not absolutely required for insulin-like signaling  under most conditions.      Enables protein kinase binding activity. Involved in dauer larval development; determination of adult lifespan; and insulin receptor signaling  pathway. Part of phosphatidylinositol 3-kinase complex. Expressed in intestine and  neurons. Human ortholog(s) of this gene implicated in several diseases, including  Alzheimer's disease; SHORT syndrome; carcinoma (multiple); and immunodeficiency  36. Is an ortholog of human PIK3R3 (phosphoinositide-3-kinase regulatory subunit  3).
WBGene00000002  aat-1   F27C8.1 Amino Acid Transporter  aat-1 encodes an amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with the ATG-2 glycoprotein subunit, AAT-1  is able to facilitate amino acid uptake and exchange, showing a relatively high  affinity for small and some large neutral amino acids; in addition, AAT-1 is able  to covalently associate with ATG-2 or ATG-1 to form heterodimers in the Xenopus  expression system; when co-expressed with ATG-2, AAT-1 localizes to the cell surface  of oocytes, but when expressed alone or with ATG-1, AAT-1 localizes intracellularly.     Contributes to L-amino acid transmembrane transporter activity. Involved in amino acid transmembrane transport. Located in plasma membrane. Part  of amino acid transport complex. Expressed in egg-laying apparatus; head motor neurons;  and tail. Human ortholog(s) of this gene implicated in cystinuria and lysinuric  protein intolerance. Is an ortholog of human SLC7A8 (solute carrier family 7 member  8).
WBGene00000003  aat-2   F07C3.7 Amino Acid Transporter  aat-2 encodes a predicted amino acid transporter catalytic subunit; when co-expressed in Xenopus oocytes with a glycoprotein subunit, however,  AAT-2 is not able to induce amino acid uptake.  Predicted to enable L-amino acid transmembrane transporter activity. Predicted to be involved in L-alpha-amino acid transmembrane transport  and L-amino acid transport. Predicted to be located in membrane. Predicted to be  integral component of membrane. Human ortholog(s) of this gene implicated in cystinuria.  Is an ortholog of human SLC7A8 (solute carrier family 7 member 8).
票数 1
EN

Stack Overflow用户

发布于 2022-05-26 21:41:48

我不知道我是否理解你的问题。但是为了在你的数据帧图片中达到这个效果,我建议如下

代码语言:javascript
复制
awk '
BEGIN                      { COLSEP = "\t"; gcd = ""; ad = ""; cd = ""; flag = 0 }
/^WBGene/                  { printf "\n%s%s%s%s%s", $1, COLSEP, $2, COLSEP, $3 }
/^Gene class description:/ { flag = 1; $1=$2=$3=""; }
/^Automated description:/  { flag = 2; $1=$2=""; }
/^Concise description:/    { flag = 3; $1=$2=""; }
/=/                        { flag = 0; printf "%s%s%s%s%s", gcd, COLSEP, cd, COLSEP, ad; gcd = ""; ad = ""; cd = ""}
flag==1                    { gcd = gcd $0 }
flag==2                    { ad = ad $0 }
flag==3                    { cd = cd $0 }
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/72397981

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档