RNA可以被分解成三个核苷酸序列,称为密码子,然后翻译成一个类似于这样的多肽: RNA:
"AUGUUUUCU"=>翻译成密码子:"AUG", "UUU", "UCU"=>,它是一个具有以下序列=>蛋白的多肽:"Methionine", "Phenylalanine", "Serine"有64个密码子,对应20个氨基酸,然而,所有的密码子序列和由此产生的氨基酸在这个过程中并不重要。如果它对一个密码子有效,程序应该对所有密码子都有效。但是,可以自由地扩展测试套件中的列表,将它们全部包括在内。也有三个终止密码子(也称为“停止”密码子);如果遇到这些密码子中的任何一个(由核糖体),所有翻译结束和蛋白质终止。之后的所有密码子都被忽略了,例如: RNA:"AUGUUUUCUUAAAUG"=>密码子:"AUG", "UUU", "UCU", "UAA", "AUG"=>蛋白:"Methionine", "Phenylalanine", "Serine"注意到停止密码子"UAA"终止翻译,而最终的蛋氨酸不被翻译成蛋白质序列。了解更多关于维基百科的蛋白质翻译的信息
这就是我们的任务。我最初是在Python 3中做这件事的。
def proteins(strand):
sub_len = 3
split_str = [strand[i:i+sub_len] for i in range(0, len(strand), sub_len)]
protein = []
for x in split_str:
if x == "UAA" or x == "UAG" or x == "UGA":
break
elif x == "AUG":
protein.append("Methionine")
elif x == "UUU" or x == "UUC":
protein.append("Phenylalanine")
elif x == "UUA" or x == "UUG":
protein.append("Leucine")
elif x == "UCU" or x == "UCC" or x == "UCA" or x == "UCG":
protein.append("Serine")
elif x == "UAU" or x == "UAC":
protein.append("Tyrosine")
elif x == "UGU" or x == "UGC":
protein.append("Cysteine")
elif x == "UGG":
protein.append("Tryptophan")
return protein这一次我在C#上做了这个。
// This file was auto-generated based on version 1.1.1 of the canonical data.
using System;
using System.Collections.Generic;
using System.Linq;
public static class ProteinTranslation
{
public static string[] Proteins(string strand)
{
// Create a list to house codons
List<string> protein = new List<string>();
// Convert string(RNA aka strand) to Array so we can iterate in chunks of 3's(codons)
IEnumerable<string> output = RnaToCodons(strand);
// Add codons to list and return results
return Codons(protein, output);
}
private static IEnumerable<string> RnaToCodons(string strand, int k = 0) => strand.ToLookup(c => Math.Floor(k++ / (double)3)).Select(e => new String(e.ToArray()));
private static string[] Codons(List<string> protein, IEnumerable<string> output)
{
foreach (var item in output)
{
switch (item)
{
case "UAA": case "UAG": case "UGA": return protein.ToArray();
case "UCU": case "UCC": case "UCA": case "UCG": protein.Add("Serine"); break;
case "UUU": case "UUC": protein.Add("Phenylalanine"); break;
case "UUA": case "UUG": protein.Add("Leucine"); break;
case "UAU": case "UAC": protein.Add("Tyrosine"); break;
case "UGU": case "UGC": protein.Add("Cysteine"); break;
case "UGG": protein.Add("Tryptophan"); break;
case "AUG": protein.Add("Methionine"); break;
}
}
return protein.ToArray();
}
}要求我通过这些测试。
// This file was auto-generated based on version 1.1.1 of the canonical data.
using Xunit;
public class ProteinTranslationTests
{
[Fact]
public void Methionine_rna_sequence() => Assert.Equal(new[] { "Methionine" }, ProteinTranslation.Proteins("AUG"));
[Fact]
public void Phenylalanine_rna_sequence_1() => Assert.Equal(new[] { "Phenylalanine" }, ProteinTranslation.Proteins("UUU"));
[Fact]
public void Phenylalanine_rna_sequence_2() => Assert.Equal(new[] { "Phenylalanine" }, ProteinTranslation.Proteins("UUC"));
[Fact]
public void Leucine_rna_sequence_1() => Assert.Equal(new[] { "Leucine" }, ProteinTranslation.Proteins("UUA"));
[Fact]
public void Leucine_rna_sequence_2() => Assert.Equal(new[] { "Leucine" }, ProteinTranslation.Proteins("UUG"));
[Fact]
public void Serine_rna_sequence_1() => Assert.Equal(new[] { "Serine" }, ProteinTranslation.Proteins("UCU"));
[Fact]
public void Serine_rna_sequence_2() => Assert.Equal(new[] { "Serine" }, ProteinTranslation.Proteins("UCC"));
[Fact]
public void Serine_rna_sequence_3() => Assert.Equal(new[] { "Serine" }, ProteinTranslation.Proteins("UCA"));
[Fact]
public void Serine_rna_sequence_4() => Assert.Equal(new[] { "Serine" }, ProteinTranslation.Proteins("UCG"));
[Fact]
public void Tyrosine_rna_sequence_1() => Assert.Equal(new[] { "Tyrosine" }, ProteinTranslation.Proteins("UAU"));
[Fact]
public void Tyrosine_rna_sequence_2() => Assert.Equal(new[] { "Tyrosine" }, ProteinTranslation.Proteins("UAC"));
[Fact]
public void Cysteine_rna_sequence_1() => Assert.Equal(new[] { "Cysteine" }, ProteinTranslation.Proteins("UGU"));
[Fact]
public void Cysteine_rna_sequence_2() => Assert.Equal(new[] { "Cysteine" }, ProteinTranslation.Proteins("UGC"));
[Fact]
public void Tryptophan_rna_sequence() => Assert.Equal(new[] { "Tryptophan" }, ProteinTranslation.Proteins("UGG"));
[Fact]
public void Stop_codon_rna_sequence_1() => Assert.Empty(ProteinTranslation.Proteins("UAA"));
[Fact]
public void Stop_codon_rna_sequence_2() => Assert.Empty(ProteinTranslation.Proteins("UAG"));
[Fact]
public void Stop_codon_rna_sequence_3() => Assert.Empty(ProteinTranslation.Proteins("UGA"));
[Fact]
public void Translate_rna_strand_into_correct_protein_list() => Assert.Equal(new[] { "Methionine", "Phenylalanine", "Tryptophan" }, ProteinTranslation.Proteins("AUGUUUUGG"));
[Fact]
public void Translation_stops_if_stop_codon_at_beginning_of_sequence() => Assert.Empty(ProteinTranslation.Proteins("UAGUGG"));
[Fact]
public void Translation_stops_if_stop_codon_at_end_of_two_codon_sequence() => Assert.Equal(new[] { "Tryptophan" }, ProteinTranslation.Proteins("UGGUAG"));
[Fact]
public void Translation_stops_if_stop_codon_at_end_of_three_codon_sequence() => Assert.Equal(new[] { "Methionine", "Phenylalanine" }, ProteinTranslation.Proteins("AUGUUUUAA"));
[Fact]
public void Translation_stops_if_stop_codon_in_middle_of_three_codon_sequence() => Assert.Equal(new[] { "Tryptophan" }, ProteinTranslation.Proteins("UGGUAGUGG"));
[Fact]
public void Translation_stops_if_stop_codon_in_middle_of_six_codon_sequence() => Assert.Equal(new[] { "Tryptophan", "Cysteine", "Tyrosine" }, ProteinTranslation.Proteins("UGGUGUUAUUAAUGGUUU"));
}在决定如何使用子字符串的Nth长度迭代字符串时,我遇到了问题。split_str和RnaToCodons是我在StackOverflow上借来的代码。我不确定,但我觉得有更好的方法来做这件事。在C#中,我想将我的代码解耦,而不是我制作的Python版本。我想确保我只用密码子穿过给定的字符串一次。不确定开关箱是否是最好的方式,但在我看来,它很容易理解。
这将是有趣的,看看这是否可以加快或更简洁,同时仍然是可读的学习,如果有一个更好的方法迭代一个字符串的子字符串。
发布于 2020-05-20 12:06:58
如果RNA序列包含诸如:"UXGUGUUAUUA"这样的无效字符,那么您的实现似乎就不麻烦了。是故意的吗?我想,我会期待一个例外,或至少在日志中的一些报告。
switch-statement的另一种选择通常是字典--尤其是如果情况将发生变化,或者应该本地化--因为可以在运行时从文件或数据库加载字典:
static readonly IDictionary<string, string> rnaProteinMap = new Dictionary<string, string>
{
{ "UAA", null },
{ "UAG", null },
{ "UGA", null },
{ "UCU", "Serine" },
{ "UCC", "Serine" },
{ "UCA", "Serine" },
{ "UCG", "Serine" },
{ "UUU", "Phenylalanine" },
{ "UUC", "Phenylalanine" },
{ "UUA", "Leucine" },
{ "UUG", "Leucine" },
{ "UAU", "Tyrosine" },
{ "UAC", "Tyrosine" },
{ "UGU", "Cysteine" },
{ "UGC", "Cysteine" },
{ "UGG", "Tryptophan" },
{ "AUG", "Methionine" },
};在这里,更多的RNA-条目映射到相同的蛋白质,但我不认为这是一个问题的背景。
private static string[] Codons(List<string> protein, IEnumerable<string> output)我不明白,为什么要用protein作为参数,而不是只在Codons()中创建它呢?
下面,我使用另一种方法中的相同部分重构了代码:
private static IEnumerable<string> RnaToCodons(string strand, int k = 0) => strand.ToLookup(c => Math.Floor(k++ / (double)3)).Select(e => new String(e.ToArray()));
private static bool TryGetProtein(string rna, out string protein)
{
protein = null;
switch (rna)
{
case "UAA": case "UAG": case "UGA":
return false;
case "UCU": case "UCC": case "UCA": case "UCG":
protein = "Serine";
break;
case "UUU": case "UUC":
protein = "Phenylalanine";
break;
case "UUA": case "UUG":
protein = "Leucine";
break;
case "UAU": case "UAC":
protein = "Tyrosine";
break;
case "UGU": case "UGC":
protein = "Cysteine";
break;
case "UGG":
protein = "Tryptophan";
break;
case "AUG":
protein = "Methionine";
break;
default:
// TODO log an invalid RNA
return true;
// OR throw new ArgumentException($"Invalid RNA sequence: {rna}", nameof(rna));
}
return true;
}
public static string[] Proteins(string strand)
{
List<string> proteins = new List<string>();
foreach (var rna in RnaToCodons(strand))
{
if (!TryGetProtein(rna, out string protein))
break;
if (protein != null)
proteins.Add(protein);
}
return proteins.ToArray();
}在TryGetProtein中,我在向日志报告无效的RNA-sequence之后返回它的true,以便允许进程继续进行,而不是异常地终止它。你应该考虑在这种情况下该怎么做。
就性能而言,RnaToCodons()似乎是瓶颈。你应该试着用一个巨大的RNA字符串来测试它。
下面是在一次迭代中处理所有内容的另一种解决方案:
IEnumerable<string> Slice(string data, int size)
{
if (size <= 0) throw new ArgumentOutOfRangeException(nameof(size), "Must be greater than zero");
char[] slice = new char[size];
for (int i = 0; i <= data.Length; i++)
{
if (i > 0 && i % size == 0)
{
yield return new string(slice);
}
if (i == data.Length)
yield break;
slice[i % size] = data[i];
}
}
IEnumerable<string> Proteins(string strand)
{
foreach (string rna in Slice(strand, 3))
{
if (rnaProteinMap.TryGetValue(rna, out string protein))
{
if (protein == null) yield break;
yield return protein;
}
else
{
// throw, report an error or just let is pass, as you do?
}
}
}如上面所示,它使用字典rnaProteinMap。
发布于 2020-05-19 21:27:55
如果你想要一张通行证,那么你可以这样做:
public static string[] Proteins(string strand)
{
return GetProteins(strand).ToArray();
}
private static IEnumerable<string> GetProteins(string strand)
{
if (string.IsNullOrEmpty(strand)) { throw new ArgumentNullException(nameof(strand)); }
for (var i = 0; i < strand.Length; i += 3)
{
var condon = strand.Substring(i, Math.Min(3, strand.Length - i));
if(!TryParseCodon(condon, out string protien)) { break; }
yield return protien;
}
}
private static string GetProteinName(string codon)
{
switch (codon)
{
case "UCU":
case "UCC":
case "UCA":
case "UCG":
return "Serine";
case "UUU":
case "UUC":
return "Phenylalanine";
case "UUA":
case "UUG":
return "Leucine";
case "UAU":
case "UAC":
return "Tyrosine";
case "UGU":
case "UGC":
return "Cysteine";
case "UGG":
return "Tryptophan";
case "AUG":
return "Methionine";
default:
return null;
}
}
private static bool TryParseCodon(string codon, out string protien)
{
protien = GetProteinName(codon);
return protien != null;
}https://codereview.stackexchange.com/questions/242588
复制相似问题