我正在尝试转换这样的文本文件(fasta格式):
>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG目标是在下游替换换行符5位置,但以>开头的行除外。
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG我想使用AWK,但我不知道如何进行。我在想一些类似的事情:
awk '{for(i=1;i<=NR;i++){ if($1 ~ /^>/){¿?¿?¿?}}}'你知道我该怎么解决这个问题吗?
发布于 2022-04-05 19:09:50
假设:
。
awk的一个想法是:
awk -v width=24 ' # pass width in as awk variable "width"
function print_sequence() {
if (sequence) # if sequence is not blank
while (sequence) { # while sequence is not blank
print substr(sequence,1,width) # print 1st 24 characters
sequence=substr(sequence,width+1) # remove 1st 24 characters
}
}
/^>/ { print_sequence() # flush previous set of data to stdout
print # print current input line
next # process next input line
}
{ sequence=sequence $1 } # append data to our "sequence" variable
END { print_sequence() } # flush last set of data to stdout
' fasta.in > fasta.out这就产生了:
$ cat fasta.out
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG发布于 2022-04-05 19:04:32
我会这样做,让file.txt内容
>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG然后
awk 'BEGIN{width=24}/>/&&x{print x;x=""}/>/{print;next}{x = x $0}length(x)>=width{print substr(x,1,width);x=substr(x,width+1)}END{print x}' file.txt给予输出
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG说明:我将宽度设置为24,这是所需字符的数量,如果找到了>,并且x中存储了一些内容,请打印该字符,并将x值设置为空字符串,如果遇到>行,请打印它并转到下一行。对于每一行,请将当前行内容附加到x中,如果length of x等于或大于width,请先打印x的width字符,然后从x中删除这些字符。在处理完所有行后,请打印x。免责声明解决方案:此解决方案假定当前宽度与所需宽度之间的比率小于0.5
(GNU Awk 5.0.1)
发布于 2022-04-05 20:00:27
您还可以尝试另一种方法,使用awk的字段和记录分隔符:
awk -v width=24 '
BEGIN {
FS="\n" # Set the Field separator to newline
RS=">" # Set the Record separator to ">"
ORS=OFS="" # Set the Output Record and Field separator to an empty string
}
NR>1 { # Using ">" as a record separator the first record is empty, so skip
header=$1 # Using "\n" as the Field separator, $1 contains the header, save it in a variable
$1=OFS # Assign an empty string to $1 so the record gets recalculated and the body becomes $0 i
# with all newlines are removed, since OFS == ""
gsub(".{" width "}", "&" FS) # Append every "width" characters with a newline (FS)
print RS header FS $0 FS # Print a ">", the header, a newline, the body and a newline
}
' fasta_in > fasta_outhttps://stackoverflow.com/questions/71755614
复制相似问题