cpu的java矩阵访问与乘法优化

1 年，3 月 Questions & Answers 1102

我正在用java（借助JNI）制作一些内部优化的矩阵包装器。需要肯定这一点，你能给出一些关于矩阵优化的提示吗我要实施的是：

矩阵可以表示为四组缓冲区/数组，一组用于水平访问，一组用于垂直访问，一组用于对角访问，以及一个仅在需要时计算矩阵元素的命令缓冲区。这是一个例子

Matrix signature: 

0  1  2  3  

4  5  6  7

8  9  1  3

3  5  2  9

First(hroizontal) set: 
horSet[0]={0,1,2,3} horSet[1]={4,5,6,7} horSet[2]={8,9,1,3} horSet[3]={3,5,2,9}

Second(vertical) set:
verSet[0]={0,4,8,3} verSet[1]={1,5,9,5} verSet[2]={2,6,1,2} verSet[3]={3,7,3,9}

Third(optional) a diagonal set:
diagS={0,5,1,9} //just in case some calculation needs this

Fourth(calcuation list, in a "one calculation one data" fashion) set:
calc={0,2,1,3,2,5} --->0 means multiply by the next element
                       1 means add the next element
                       2 means divide by the next element
                       so this list means
                       ( (a[i]*2)+3 ) / 5  when only a[i] is needed.
Example for fourth set: 
A.mult(2),   A.sum(3),  A.div(5), A.mult(B)
(to list)   (to list)  (to list) (calculate *+/ just in time when A is needed )
 so only one memory access for four operations.
 loop start
 a[i] = b[i] * ( ( a[i]*2) +3 ) / 5  only for A.mult(B)
 loop end

如上所述，当需要访问列元素时，第二个集合提供连续访问。没有跳跃。第一套水平通道也实现了同样的效果

这会让一些事情变得更容易一些事情变得更难：

 Easier: 
 **Matrix transpozing operation. 
 Just swapping the pointers horSet[x] and verSet[x] is enough.

 **Matrix * Matrix multiplication.
 One matrix gives one of its horizontal set and other matrix gives vertical buffer.
 Dot product of these must be highly parallelizable for intrinsics/multithreading.
 If the multiplication order is inverse, then horizontal and verticals are switched.

 **Matrix * vector multiplication.
 Same as above, just a vector can be taken as horizontal or vertical freely.

 Harder:
 ** Doubling memory requirement is bad for many cases.
 ** Initializing a matrix takes longer.
 ** When a matrix is multiplied from left, needs an update vertical-->horizontal
 sets if its going to be multiplied from right after.(same for opposite)
 (if a tranposition is taken between, this does not count)


 Neutral:
 ** Same matrix can be multiplied with two other matrices to get two different
 results such as A=A*B(saved in horizontal sets)   A=C*A(saved in vertical sets)
 then A=A*A gives   A*B*C*A(in horizontal) and C*A*A*B (in vertical) without
 copying A. 

 ** If a matrix always multiplied from left or always from right, every access
 and multiplication will not need update and be contiguous on ram.

 ** Only using horizontals before transpozing, only using verticals after, 
 should not break any rules.

主要目的是拥有一个（8的倍数，8的倍数）大小的矩阵，并使用多线程应用avx Intrinsic（每个踏板同时在一组上工作）

我只得到了向量*向量点积如果编程大师们给出指导，我将对此进行探讨

我写的dotproduct（使用内部函数）比循环展开版本快6倍（它是一个一个乘法的两倍），当包装器中启用多线程时（8x-->；使用接近我的ddr3限制的近20GB/s），它已经尝试了opencl，并且cpu速度有点慢，但对gpu来说很棒

谢谢

编辑：一个“块矩阵”缓冲区将如何执行？当将大矩阵相乘时，小的补丁会以一种特殊的方式相乘，缓存可能用于减少主内存访问。但这需要在垂直-水平对角线和该块之间的矩阵乘法之间进行更多更新

Python中文网

有 Java 编程相关的问题?

cpu的java矩阵访问与乘法优化

共 (1) 个答案

# 1 楼答案