优化reedsolomon编码器（多项式除法）

3条回答

网友

1楼 · 编辑于 2024-09-28 05:20:17

或者，如果你知道C，我建议用纯C重写这个Python函数并调用它（比如用CFFI）。至少你知道你在函数的内部循环中达到了最高性能，而不需要知道PyPy或Cython技巧。在

参见：http://cffi.readthedocs.org/en/latest/overview.html#performance

网友

2楼 · 编辑于 2024-09-28 05:20:17

基于DavidW的答案，我现在使用的实现是这样的，通过使用nogil和并行计算，它大约快了20%：

from cython.parallel import parallel, prange

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
cdef rsenc_cython(msg_in_r, nsym, gen_t) :
    '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''

    cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can't make a memory view from a read only object
    #cdef int[::1] gen = array.array('i',gen_t) # convert list to array
    cdef uint8_t[::1] gen = gen_t

    cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1)
    cdef int i, j
    cdef uint8_t[::1] lgen = bytearray(gen.shape[0])
    for j in xrange(gen.shape[0]):
        lgen[j] = gf_log_c[gen[j]]

    cdef uint8_t coef,lcoef
    with nogil:
        for i in xrange(msg_in.shape[0]):
            coef = msg_out[i]
            if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw)
                lcoef = gf_log_c[coef] # precaching

                for j in prange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1)
                    msg_out[i + j] ^= gf_exp_c[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j]

    # Recopy the original message bytes
    msg_out[:msg_in.shape[0]] = msg_in
    return msg_out

我仍然希望它更快（在实际的实现中，数据以大约6.4mb/s的速度编码，n=255，n是消息+码字的大小）。在

我发现更快实现的主要原因是使用LUT（查找表）方法，通过预先计算乘法和加法数组。然而，在我的Python和Cython实现中，LUT方法比计算XOR和加法操作慢。在

有其他的方法来实现一个更快的RS编码器，但我没有能力，也没有时间去尝试它们。我将把它们留给其他感兴趣的读者参考：

^{bq}$

然而，我认为最好的方法是使用有效的多项式模降阶，而不是多项式除法：

"Modular Reduction in GF (2 n) without Pre-computational Phase". Kneževic, M., et al. Arithmetic of Finite Fields. Springer Berlin Heidelberg, 2008. 77-87.
"On computation of polynomial modular reduction". Wu, Huapeng. Technical report, Univ. of Waterloo, The Centre for applied cryptographic research, 2000.
"A fast software implementation for arithmetic operations in GF (2n)". De Win, E., Bosselaers, A., Vandenberghe, S., De Gersem, P., & Vandewalle, J. (1996, January). In Advances in Cryptology—Asiacrypt'96 (pp. 65-76). Springer Berlin Heidelberg. link
Barnett reduction

/EDIT:事实上，“关于多项式模约化的计算”使用的方法与我对变量rsenc_alt1（）和rsenc_alt2（）所用的方法相同（主要思想是我们预先计算所需的系数对，然后一次将其全部减少），不幸的是，它并不快（它实际上是慢的，因为预计算不能一次性完成，因为它依赖于消息输入）。在

/EDIT:我发现了一个库，里面有很多非常有趣的优化，甚至在任何学术论文中都找不到（作者说他读过btw），这可能是Reed-Solomon最快的软件实现：关于更多细节，wirehair project和{a2}。值得注意的是，作者还使用类似的优化技巧制作了一个Cauchy-Reed-Solomon codec called longhair。在

最终版本/最终版本如下：

Plank, James S., Kevin M. Greenan, and Ethan L. Miller. "Screaming fast Galois field arithmetic using intel SIMD instructions." FAST. 2013. link

implementation, in pure Go, is available here and is authored by Klaus Post。这是我读过的最快的实现，包括单线程和并行（它同时支持两者）。它声称单线程速度超过1GB/s，8线程时超过4Gb/s。然而，它依赖于优化的SIMD指令和对矩阵运算的各种低级优化（因为这里RS编解码器是面向矩阵的，而不是我问题中的多项式方法）。在

因此，如果你是一个有兴趣的读者，并想找到最快的里德所罗门编解码器可用，这是一个。在

网友

3楼 · 编辑于 2024-09-28 05:20:17

在我的机器上，以下是pypy的3倍（0.04s vs 0.15s）。使用Cython：

ctypedef unsigned char uint8_t # does not work with Microsoft's C Compiler: from libc.stdint cimport uint8_t
cimport cpython.array as array

cdef uint8_t[::1] gf_exp = bytearray([1, 3, 5, 15, 17, 51, 85, 255, 26, 46, 114, 150, 161, 248, 19,
   lots of numbers omitted for space reasons
   ...])

cdef uint8_t[::1] gf_log = bytearray([0, 0, 25, 1, 50, 2, 26, 198, 75, 199, 27, 104, 
    more numbers omitted for space reasons
    ...])

import cython

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
def rsenc(msg_in_r, nsym, gen_t):
    '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''

    cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can't make a memory view from a read only object
    cdef int[::1] gen = array.array('i',gen_t) # convert list to array

    cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1)
    cdef int j
    cdef uint8_t[::1] lgen = bytearray(gen.shape[0])
    for j in xrange(gen.shape[0]):
        lgen[j] = gf_log[gen[j]]

    cdef uint8_t coef,lcoef

    cdef int i
    for i in xrange(msg_in.shape[0]):
        coef = msg_out[i]
        if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw)
            lcoef = gf_log[coef] # precaching

            for j in xrange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1)
                msg_out[i + j] ^= gf_exp[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j]

    # Recopy the original message bytes
    msg_out[:msg_in.shape[0]] = msg_in
    return msg_out

它只是静态类型的最快版本（从cython -a检查html，直到循环没有用黄色突出显示）。在

一些简短的说明：

Cython更喜欢x.shape[0]而不是len(shape)
将memoryView定义为[::1]可以保证它们在内存中是连续的，这有助于
initializedcheck(False)对于避免对全局定义的gf_exp和{}进行大量的存在性检查是必要的。（您可能会发现，通过为基本Python/PyPy代码创建一个局部变量引用并使用该istead，可以加快基本Python/PyPy代码的速度）
我不得不复制几个输入参数。Cython无法从只读对象（在本例中是msg_in，一个字符串）创建memoryview。不过，我本可以把它做成一个字符。另外，gen（一个列表）需要在具有快速元素访问的东西中。

除此之外，一切都相当直截了当。（我还没有尝试过任何变化，因为它更快了）。PyPy的表现真的让我印象深刻。在

相关问题更多 >

编程相关推荐

热门问题

热门文章