优化reedsolomon编码器(多项式除法)

2024-09-28 05:20:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图优化一个Reed-Solomon编码器,它实际上只是一个对Galois字段2^8的多项式除法运算(这意味着值在255以上)。这段代码实际上非常类似于这里可以找到的Go:http://research.swtch.com/field

这里使用的多项式除法是synthetic division(也称为Horner方法)。在

我什么都试过了:numpy,pypy,cython。我得到的最佳性能是将pypy与这个简单的嵌套循环结合使用:

def rsenc(msg_in, nsym, gen):
    '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''
    msg_out = bytearray(msg_in) + bytearray(len(gen)-1)
    lgen = bytearray([gf_log[gen[j]] for j in xrange(len(gen))])

    for i in xrange(len(msg_in)):
        coef = msg_out[i]
        # coef = gf_mul(msg_out[i], gf_inverse(gen[0]))  // for general polynomial division (when polynomials are non-monic), we need to compute: coef = msg_out[i] / gen[0]
        if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw)
            lcoef = gf_log[coef] # precaching

            for j in xrange(1, len(gen)): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1)
                msg_out[i + j] ^= gf_exp[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] += msg_out[i] * gen[j]

    # Recopy the original message bytes
    msg_out[:len(msg_in)] = msg_in
    return msg_out

Python优化向导能引导我找到一些关于如何获得加速的线索吗?我的目标是得到至少3倍的加速,但更多的将是可怕的。任何方法或工具都可以接受,只要它是跨平台的(至少在Linux和Windows上有效)。在

下面是一个小的测试脚本,其中包含了我尝试过的一些其他方法(cython尝试不包括在内,因为它比原生python慢!)公司名称:

^{pr2}$

这是我机器上的结果:

With PyPy:
rsenc : total time elapsed 0.108183 seconds.
rsenc_alt1 : output is incorrect!
rsenc_alt1 : total time elapsed 0.164084 seconds.
rsenc_alt2 : output is incorrect!
rsenc_alt2 : total time elapsed 0.557697 seconds.

Without PyPy:
rsenc : total time elapsed 3.518857 seconds.
rsenc_alt1 : output is incorrect!
rsenc_alt1 : total time elapsed 5.630897 seconds.
rsenc_alt2 : output is incorrect!
rsenc_alt2 : total time elapsed 6.100434 seconds.
rsenc_numpy : output is incorrect!
rsenc_numpy : total time elapsed 1.631373 seconds

(注意:备选方案应该是正确的,有些索引一定有点偏离,但由于它们速度较慢,所以我没有尝试修复它们)

/UPDATE和赏金的目标:我发现了一个非常有趣的优化技巧,它承诺可以大大加快计算速度:to precompute the multiplication table。我用新函数rsenc_precomp()更新了上面的代码。但是,在我的实现中一点好处也没有,它甚至慢了一点:

rsenc : total time elapsed 0.107170 seconds.
rsenc_precomp : total time elapsed 0.108788 seconds.

为什么数组查找比加法或xor之类的操作花费更多?为什么它在ZFEC中工作而在Python中不起作用?

我将把奖金归于那些能向我展示如何使乘法/加法查找表优化工作(比xor和加法操作更快)的人,或者谁能通过引用或分析向我解释为什么这个优化在这里不能工作(使用Python/PyPy/Cython/Numpy等)。。我都试过了)。在


Tags: inoutputlentimeismsgouttotal
3条回答

或者,如果你知道C,我建议用纯C重写这个Python函数并调用它(比如用CFFI)。至少你知道你在函数的内部循环中达到了最高性能,而不需要知道PyPy或Cython技巧。在

参见:http://cffi.readthedocs.org/en/latest/overview.html#performance

基于DavidW的答案,我现在使用的实现是这样的,通过使用nogil和并行计算,它大约快了20%:

from cython.parallel import parallel, prange

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
cdef rsenc_cython(msg_in_r, nsym, gen_t) :
    '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''

    cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can't make a memory view from a read only object
    #cdef int[::1] gen = array.array('i',gen_t) # convert list to array
    cdef uint8_t[::1] gen = gen_t

    cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1)
    cdef int i, j
    cdef uint8_t[::1] lgen = bytearray(gen.shape[0])
    for j in xrange(gen.shape[0]):
        lgen[j] = gf_log_c[gen[j]]

    cdef uint8_t coef,lcoef
    with nogil:
        for i in xrange(msg_in.shape[0]):
            coef = msg_out[i]
            if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw)
                lcoef = gf_log_c[coef] # precaching

                for j in prange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1)
                    msg_out[i + j] ^= gf_exp_c[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j]

    # Recopy the original message bytes
    msg_out[:msg_in.shape[0]] = msg_in
    return msg_out

我仍然希望它更快(在实际的实现中,数据以大约6.4mb/s的速度编码,n=255,n是消息+码字的大小)。在

我发现更快实现的主要原因是使用LUT(查找表)方法,通过预先计算乘法和加法数组。然而,在我的Python和Cython实现中,LUT方法比计算XOR和加法操作慢。在

有其他的方法来实现一个更快的RS编码器,但我没有能力,也没有时间去尝试它们。我将把它们留给其他感兴趣的读者参考:

^{bq}$

然而,我认为最好的方法是使用有效的多项式模降阶,而不是多项式除法:

  • "Modular Reduction in GF (2 n) without Pre-computational Phase". Kneževic, M., et al. Arithmetic of Finite Fields. Springer Berlin Heidelberg, 2008. 77-87.
  • "On computation of polynomial modular reduction". Wu, Huapeng. Technical report, Univ. of Waterloo, The Centre for applied cryptographic research, 2000.
  • "A fast software implementation for arithmetic operations in GF (2n)". De Win, E., Bosselaers, A., Vandenberghe, S., De Gersem, P., & Vandewalle, J. (1996, January). In Advances in Cryptology—Asiacrypt'96 (pp. 65-76). Springer Berlin Heidelberg. link
  • Barnett reduction

/EDIT:事实上,“关于多项式模约化的计算”使用的方法与我对变量rsenc_alt1()和rsenc_alt2()所用的方法相同(主要思想是我们预先计算所需的系数对,然后一次将其全部减少),不幸的是,它并不快(它实际上是慢的,因为预计算不能一次性完成,因为它依赖于消息输入)。在

/EDIT:我发现了一个库,里面有很多非常有趣的优化,甚至在任何学术论文中都找不到(作者说他读过btw),这可能是Reed-Solomon最快的软件实现:关于更多细节,wirehair project和{a2}。值得注意的是,作者还使用类似的优化技巧制作了一个Cauchy-Reed-Solomon codec called longhair。在

最终版本/最终版本如下:

Plank, James S., Kevin M. Greenan, and Ethan L. Miller. "Screaming fast Galois field arithmetic using intel SIMD instructions." FAST. 2013. link

implementation, in pure Go, is available here and is authored by Klaus Post。这是我读过的最快的实现,包括单线程和并行(它同时支持两者)。它声称单线程速度超过1GB/s,8线程时超过4Gb/s。然而,它依赖于优化的SIMD指令和对矩阵运算的各种低级优化(因为这里RS编解码器是面向矩阵的,而不是我问题中的多项式方法)。在

因此,如果你是一个有兴趣的读者,并想找到最快的里德所罗门编解码器可用,这是一个。在

在我的机器上,以下是pypy的3倍(0.04s vs 0.15s)。使用Cython:

ctypedef unsigned char uint8_t # does not work with Microsoft's C Compiler: from libc.stdint cimport uint8_t
cimport cpython.array as array

cdef uint8_t[::1] gf_exp = bytearray([1, 3, 5, 15, 17, 51, 85, 255, 26, 46, 114, 150, 161, 248, 19,
   lots of numbers omitted for space reasons
   ...])

cdef uint8_t[::1] gf_log = bytearray([0, 0, 25, 1, 50, 2, 26, 198, 75, 199, 27, 104, 
    more numbers omitted for space reasons
    ...])

import cython

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.initializedcheck(False)
def rsenc(msg_in_r, nsym, gen_t):
    '''Reed-Solomon encoding using polynomial division, better explained at http://research.swtch.com/field'''

    cdef uint8_t[::1] msg_in = bytearray(msg_in_r) # have to copy, unfortunately - can't make a memory view from a read only object
    cdef int[::1] gen = array.array('i',gen_t) # convert list to array

    cdef uint8_t[::1] msg_out = bytearray(msg_in) + bytearray(len(gen)-1)
    cdef int j
    cdef uint8_t[::1] lgen = bytearray(gen.shape[0])
    for j in xrange(gen.shape[0]):
        lgen[j] = gf_log[gen[j]]

    cdef uint8_t coef,lcoef

    cdef int i
    for i in xrange(msg_in.shape[0]):
        coef = msg_out[i]
        if coef != 0: # coef 0 is normally undefined so we manage it manually here (and it also serves as an optimization btw)
            lcoef = gf_log[coef] # precaching

            for j in xrange(1, gen.shape[0]): # optimization: can skip g0 because the first coefficient of the generator is always 1! (that's why we start at position 1)
                msg_out[i + j] ^= gf_exp[lcoef + lgen[j]] # equivalent (in Galois Field 2^8) to msg_out[i+j] -= msg_out[i] * gen[j]

    # Recopy the original message bytes
    msg_out[:msg_in.shape[0]] = msg_in
    return msg_out

它只是静态类型的最快版本(从cython -a检查html,直到循环没有用黄色突出显示)。在

一些简短的说明:

  • Cython更喜欢x.shape[0]而不是len(shape)

  • 将memoryView定义为[::1]可以保证它们在内存中是连续的,这有助于

  • initializedcheck(False)对于避免对全局定义的gf_exp和{}进行大量的存在性检查是必要的。(您可能会发现,通过为基本Python/PyPy代码创建一个局部变量引用并使用该istead,可以加快基本Python/PyPy代码的速度)

  • 我不得不复制几个输入参数。Cython无法从只读对象(在本例中是msg_in,一个字符串)创建memoryview。不过,我本可以把它做成一个字符。另外,gen(一个列表)需要在具有快速元素访问的东西中。

除此之外,一切都相当直截了当。(我还没有尝试过任何变化,因为它更快了)。PyPy的表现真的让我印象深刻。在

相关问题 更多 >

    热门问题