这是练习3.1.34。从“计算机科学”一书中看塞奇威克和韦恩的跨学科方法:
香农熵测量输入字符串的信息内容,在信息论和数据压缩中起着基础作用。给定n个字符的字符串,设f( c )是字符c的出现频率,数量p(c) = f(c)/n是对字符串中c如果是随机字符串的概率的估计,熵被定义为字符串中出现的所有字符的数量-p(C)*log2 2(p(C))之和。熵是用来度量字符串的信息含量的:如果每个字符出现相同的次数,那么熵在给定长度的字符串中处于其最小值。编写一个将文件名作为命令行参数并打印该文件中文本熵的程序。在你定期阅读的网页上运行你的程序,你最近写的一篇论文,以及在网站上找到的大肠杆菌基因组。
这是我的节目:
public class ShannonEntropy
{
public static String removeUnnecessaryChars()
{
String text = "";
while (!StdIn.isEmpty())
{
String word = StdIn.readString();
int wordLength = word.length();
String newWord = "";
for (int i = 0; i < wordLength; i++)
{
if (word.charAt(i) != '.' &&
word.charAt(i) != '!' &&
word.charAt(i) != '?' &&
word.charAt(i) != ',' &&
word.charAt(i) != '"' &&
word.charAt(i) != ':' &&
word.charAt(i) != ';' &&
word.charAt(i) != '(' &&
word.charAt(i) != ')')
{
newWord += word.charAt(i);
}
}
text += newWord;
}
return text.toLowerCase();
}
// this method (below) is written specifically for texts without
// unnecessary characters (e.g. E. coli genome)
public static String convertTextToString()
{
String text = "";
while (!StdIn.isEmpty())
{
String word = StdIn.readString();
text = word;
}
return text;
}
public static int[] findFrequencies(String text)
{
int textLength = text.length();
/*
char[] ALPHABET = {'a','b','c','d','e','f','g','h','i','j','k','l',
'm','n','o','p','q','r','s','t','u','v','w','x',
'y','z'};
*/
char[] ALPHABET = {'a','c','g','t'}; // specifically used for genes and genomes
int[] frequencies = new int[ALPHABET.length];
for (int i = 0; i < textLength; i++)
{
for (int j = 0; j < ALPHABET.length; j++)
{
if (text.charAt(i) == ALPHABET[j])
{
frequencies[j]++;
break; // to speed up the computation
}
}
}
return frequencies;
}
public static double[] findProbabilities(String text, int[] frequencies)
{
int textLength = text.length();
int n = frequencies.length;
double[] probabilities = new double[n];
for (int i = 0; i < n; i++)
{
probabilities[i] = (double) frequencies[i]/textLength;
}
return probabilities;
}
public static double log2(double x)
{
return (Math.log(x)/Math.log(2));
}
public static double calculateEntropy(double[] probabilities)
{
double shannonEntropy = 0;
int n = probabilities.length;
for (int i = 0; i < n; i++)
{
if (probabilities[i] != 0)
{
shannonEntropy += probabilities[i]*log2(probabilities[i]);
}
}
return -1*shannonEntropy;
}
public static void main(String[] args)
{
//final long time1 = System.currentTimeMillis();
//String text = removeUnnecessaryChars();
String text = convertTextToString();
//final long time2 = System.currentTimeMillis();
//System.out.println("Time to remove unnecessary characters: " + (time2-time1) + " ms");
int[] frequencies = findFrequencies(text);
//final long time3 = System.currentTimeMillis();
//System.out.println("Time to calculate character frequencies: " + (time3-time2) + " ms");
double[] probabilities = findProbabilities(text, frequencies);
System.out.println("Shannon entropy of the E. coli genome: " + calculateEntropy(probabilities));
String randomGene = "";
for (int i = 0; i < 1000000; i++)
{
double r = Math.random();
if (r < 0.25) randomGene += "a";
else if (r < 0.50) randomGene += "c";
else if (r < 0.75) randomGene += "g";
else if (r < 1.00) randomGene += "t";
}
int[] rFrequencies = findFrequencies(randomGene);
double[] rProbabilities = findProbabilities(randomGene, rFrequencies);
System.out.println("Shannon entropy of the random genome: " + calculateEntropy(rProbabilities));
}
}StdIn是本书作者编写的一个简单的API。下面是我的程序的一个实例:
输入:大肠杆菌基因组
输出:
大肠杆菌基因组的Shannon熵: 1.9998212455541713 (与在线香农熵计算器的答案完全一致)
随机基因组的Shannon熵: 1.9999979438235416
有什么方法可以改善我的程序(特别是它的性能(特别是方法removeUnnecessaryChars))?
感谢您的关注。
发布于 2020-09-07 03:26:15
在Java中,我们通常将打开的大括号放在同一行上,而不是换行符。
既然你对removeUnnecessaryChars特别感兴趣..。
Set<Character>保存集合将比在方法中枚举它们更干净。private。尽可能缩小范围。StdIn,则更好,但我假设这是赋值的工件。convertTextToString和removeUnnecessaryChars对相同的输入操作不同,没有不必要的字符。我希望convertTextToString中有一个bug。Set成为一个Set<Integer>,保留该声明的其余部分,并跳过mapToObj步骤,但它已经过了我的就寝时间。如果我重写它,它可能看起来像(未经测试的!)
private static final Set<Character> CHARACTERS_TO_IGNORE = Set.of('.', '!', '?', ',', '"', ':', ';', '(', ')');
public static String removeUnnecessaryChars() {
String text = "";
while (!StdIn.isEmpty()) {
for (char c : StdIn.readString().toCharArray()) {
if (!CHARACTERS_TO_IGNORE.contains(c)) {
text += c;
}
}
}
return text;
}
public static String removeUnnecessaryChars() {
String text = "";
while (!StdIn.isEmpty()) {
text += StdIn.readString()
.chars()
.mapToObj(i -> (char)i)
.filter(c -> !CHARACTERS_TO_IGNORE.contains(c))
.collect(Collectors.joining);
}
return text;
}发布于 2020-09-06 12:43:01
代码背后的思想是非常好的。您已经很好地将任务划分为所需的方法。你还可以做些改进。
例如,这一行有点偏离,看上去是否定的。这只是一种有趣的方法。
return -1*shannonEntropy;这一行,您可以从文本中派生出字母,即不同的字符。
char[] ALPHABET = {'a','c','g','t'};你正在对文本、字母表、频率、概率等做大量的循环,有没有办法用最小的循环来完成呢?
你的第一个循环,在字母表上不需要内环。只需增加文本中一个字符的计数,并累积当前字符的计数--甚至不需要指定一个字母--.就像这样。
Dictionary<char, int> frequencies = new Dictionary<char, int>();
for (int i = 0; i < text.Length; i++)
{
if (!frequencies.ContainsKey(text[i]))
{
frequencies.Add(text[i], 0);
}
frequencies[text[i]]++;
}其次,不需要单独的循环来计算概率和特征熵。这两种计算都可以在相同的循环上完成,并且运行的总数保持不变。
double totalEntropy;
foreach (KeyValuePair<char, int> frequency in frequencies)
{
double probability = ...;
double entropy = ...;
totalEntropy += entropy;
}那就会一直循环到最低限度。
https://codereview.stackexchange.com/questions/248998
复制相似问题