以下是我所知道的在马尔可夫链中计数转换并使用它填充转换矩阵的最基本方法:
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1我试着用三种不同的方式加快速度:
1)在此Matlab代码的基础上使用稀疏矩阵一行:
transition_matrix = full(sparse(markov_chain(1:end-1), markov_chain(2:end), 1))在Numpy/SciPy中,如下所示:
def get_sparse_counts_matrix(markov_chain, number_of_states):
return coo_matrix(([1]*(len(markov_chain) - 1), (markov_chain[0:-1], markov_chain[1:])), shape=(number_of_states, number_of_states)) 我还尝试了几个Python调整,比如使用zip():
for old_state, new_state in zip(markov_chain[0:-1], markov_chain[1:]):
transition_counts_matrix[old_state, new_state] += 1 和排队:
old_and_new_states_holder = Queue(maxsize=2)
old_and_new_states_holder.put(markov_chain[0])
for new_state in markov_chain[1:]:
old_and_new_states_holder.put(new_state)
old_state = old_and_new_states_holder.get()
transition_counts_matrix[old_state, new_state] += 1但这三种方法都没有加快速度。事实上,除了zip()解决方案之外,所有的解决方案都比我原来的解决方案慢了至少10倍。
还有什么其他的解决方案值得研究吗?
用多链构造过渡矩阵的改进解法
对上述问题最好的答案是DSM。然而,对于任何想要根据数百万马尔可夫链列表填充过渡矩阵的人来说,最快的方法是:
def fast_increment_transition_counts_from_chain(markov_chain, transition_counts_matrix):
flat_coords = numpy.ravel_multi_index((markov_chain[:-1], markov_chain[1:]), transition_counts_matrix.shape)
transition_counts_matrix.flat += numpy.bincount(flat_coords, minlength=transition_counts_matrix.size)
def get_fake_transitions(markov_chains):
fake_transitions = []
for i in xrange(1,len(markov_chains)):
old_chain = markov_chains[i - 1]
new_chain = markov_chains[i]
end_of_old = old_chain[-1]
beginning_of_new = new_chain[0]
fake_transitions.append((end_of_old, beginning_of_new))
return fake_transitions
def decrement_fake_transitions(fake_transitions, counts_matrix):
for old_state, new_state in fake_transitions:
counts_matrix[old_state, new_state] -= 1
def fast_get_transition_counts_matrix(markov_chains, number_of_states):
"""50% faster than original, but must store 2 additional slice copies of all markov chains in memory at once.
You might need to break up the chains into manageable chunks that don't exceed your memory.
"""
transition_counts_matrix = numpy.zeros([number_of_states, number_of_states])
fake_transitions = get_fake_transitions(markov_chains)
markov_chains = list(itertools.chain(*markov_chains))
fast_increment_transition_counts_from_chain(markov_chains, transition_counts_matrix)
decrement_fake_transitions(fake_transitions, transition_counts_matrix)
return transition_counts_matrix发布于 2012-11-04 15:55:24
像这样的东西,利用np.bincount怎么样?不是超级强壮,而是功能强大。感谢@ Weckesser的设置。
import numpy as np
from collections import Counter
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
def using_counter(chain, counts_matrix):
counts = Counter(zip(chain[:-1], chain[1:]))
from_, to = zip(*counts.keys())
counts_matrix[from_, to] = counts.values()
def using_bincount(chain, counts_matrix):
flat_coords = np.ravel_multi_index((chain[:-1], chain[1:]), counts_matrix.shape)
counts_matrix.flat = np.bincount(flat_coords, minlength=counts_matrix.size)
def using_bincount_reshape(chain, counts_matrix):
flat_coords = np.ravel_multi_index((chain[:-1], chain[1:]), counts_matrix.shape)
return np.bincount(flat_coords, minlength=counts_matrix.size).reshape(counts_matrix.shape)这意味着:
In [373]: t = np.random.randint(0,50, 500)
In [374]: m1 = np.zeros((50,50))
In [375]: m2 = m1.copy()
In [376]: m3 = m1.copy()
In [377]: timeit increment_counts_in_matrix_from_chain(t, m1)
100 loops, best of 3: 2.79 ms per loop
In [378]: timeit using_counter(t, m2)
1000 loops, best of 3: 924 us per loop
In [379]: timeit using_bincount(t, m3)
10000 loops, best of 3: 57.1 us per loop编辑
避免flat (代价是不能就地工作)可以为小型矩阵节省一些时间:
In [80]: timeit using_bincount_reshape(t, m3)
10000 loops, best of 3: 22.3 us per loop发布于 2012-11-04 18:41:51
因为我一直想尝试一下,所以我把Numba应用到了你的问题上。在代码中,这只需要添加一个修饰器(尽管我直接调用了numba在这里提供的jit变体):
import numpy as np
import numba
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
autojit_func = numba.autojit()(increment_counts_in_matrix_from_chain)
jit_func = numba.jit(argtypes=[numba.int64[:,::1],numba.double[:,::1]])(increment_counts_in_matrix_from_chain)
t = np.random.randint(0,50, 500)
m1 = np.zeros((50,50))
m2 = np.zeros((50,50))
m3 = np.zeros((50,50))然后计时:
In [10]: %timeit increment_counts_in_matrix_from_chain(t,m1)
100 loops, best of 3: 2.38 ms per loop
In [11]: %timeit autojit_func(t,m2)
10000 loops, best of 3: 67.5 us per loop
In [12]: %timeit jit_func(t,m3)
100000 loops, best of 3: 4.93 us per loopautojit方法根据运行时输入进行一些猜测,并且jit函数具有指定的类型。您必须稍微小心一点,因为在这些早期阶段,如果输入输入错误类型,numba就不会与jit传递错误。它只会说出一个错误的答案。
尽管如此,在我的书中,在没有任何代码更改的情况下获得35倍和485倍的速度,并且仅仅添加对numba的调用(也可以被称为装饰师)是非常令人印象深刻的。您可能会使用cython获得类似的结果,但是它需要更多的样板和编写setup.py文件。
我还喜欢这个解决方案,因为代码仍然是可读的,而且您可以按照最初对实现算法的想法来编写它。
发布于 2012-11-04 14:51:06
这是一个更快的方法。其思想是计算每个转换的次数,并在矩阵的向量化更新中使用计数。(我假设在markov_chain中可以多次发生相同的转换。)来自Counter库的collections类用于计算每个转换发生的次数。
from collections import Counter
def update_matrix(chain, counts_matrix):
counts = Counter(zip(chain[:-1], chain[1:]))
from_, to = zip(*counts.keys())
counts_matrix[from_, to] += counts.values()在ipython中,计时示例:
In [64]: t = np.random.randint(0,50, 500)
In [65]: m1 = zeros((50,50))
In [66]: m2 = zeros((50,50))
In [67]: %timeit increment_counts_in_matrix_from_chain(t, m1)
1000 loops, best of 3: 895 us per loop
In [68]: %timeit update_matrix(t, m2)
1000 loops, best of 3: 504 us per loop它更快,但不是数量级更快。为了真正加快速度,您可以考虑在Cython中实现这一点。
https://stackoverflow.com/questions/13219041
复制相似问题