<p>对于您的<code>cython</code>函数,您无能为力,因为它已经被很好地优化了。但是,通过完全避免调用<code>numpy</code>,您仍然可以获得适度的加速。在</p>
<pre><code>import numpy as np
cimport numpy as np
cimport cython
from libc.stdlib cimport malloc, free
from libc.math cimport pow
cdef inline double sum_axis(double *v, double *M, int n):
cdef:
int i, j
for i in range(n):
for j in range(n):
v[i] += M[j*n+i]
@cython.boundscheck(False)
@cython.wraparound(False)
def permfunc_modified(np.ndarray [double, ndim =2, mode='c'] M):
cdef:
int n = M.shape[0], j=0, s=1, i
int *f = <int*>malloc(n*sizeof(int))
double *d = <double*>malloc(n*sizeof(double))
double *v = <double*>malloc(n*sizeof(double))
double p = 1, prod
sum_axis(v,&M[0,0],n)
for i in range(n):
p *= v[i]
f[i] = i
d[i] = 1
while (j < n-1):
for i in range(n):
v[i] -= 2.*d[j]*M[j, i]
d[j] = -d[j]
s = -s
prod = 1
for i in range(n):
prod *= v[i]
p += s*prod
f[0] = 0
f[j] = f[j+1]
f[j+1] = j+1
j = f[0]
free(d)
free(f)
free(v)
return p/pow(2.,(n-1))
</code></pre>
<p>以下是必要的检查和计时:</p>
^{pr2}$
<p><strong>编辑</strong>
让我们通过展开内部<code>prod</code>循环来执行一些基本的<code>SSE</code>矢量化,也就是说,将上面代码中的循环更改为以下内容</p>
<pre><code># define t1, t2 and t3 earlier as doubles
t1,t2,t3=1.,1.,1.
for i in range(0,n-1,2):
t1 *= v[i]
t2 *= v[i+1]
# define k earlier as int
for k in range(i+2,n):
t3 *= v[k]
p += s*(t1*t2*t3)
</code></pre>
<p>现在是时机</p>
<pre><code>In [8]: %timeit permfunc_modified_vec(M) # vectorised
1 loop, best of 3: 14.0 s per loop
</code></pre>
<p>所以比原来优化过的cython代码快了2倍,还不错。在</p>