如何填补数值数组中的NaN值以应用SVD？

>>> df1 = pd.DataFrame(np.random.rand(6, 4), columns=['A', 'B', 'C', 'D']) >>> df1 A B C D 0 0.763144 0.752176 0.601228 0.290276 1 0.632144 0.202513 0.111766 0.317838 2 0.494587 0.318276 0.951354 0.051253 3 0.184826 0.429469 0.280297 0.014895 4 0.236955 0.560095 0.357246 0.302688 5 0.729145 0.293810 0.525223 0.744513 >>> df2 = pd.DataFrame(np.random.rand(6, 4), columns=['A', 'B', 'C', 'E']) >>> df2 A B C E 0 0.969758 0.650887 0.821926 0.884600 1 0.657851 0.158992 0.731678 0.841507 2 0.923716 0.524547 0.783581 0.268123 3 0.935014 0.219135 0.152794 0.433324 4 0.327104 0.581433 0.474131 0.521481 5 0.366469 0.709115 0.462106 0.416601 >>> df3 = pd.concat([df1,df2], axis=0) >>> df3 A B C D E 0 0.763144 0.752176 0.601228 0.290276 NaN 1 0.632144 0.202513 0.111766 0.317838 NaN 2 0.494587 0.318276 0.951354 0.051253 NaN 3 0.184826 0.429469 0.280297 0.014895 NaN 4 0.236955 0.560095 0.357246 0.302688 NaN 5 0.729145 0.293810 0.525223 0.744513 NaN 0 0.969758 0.650887 0.821926 NaN 0.884600 1 0.657851 0.158992 0.731678 NaN 0.841507 2 0.923716 0.524547 0.783581 NaN 0.268123 3 0.935014 0.219135 0.152794 NaN 0.433324 4 0.327104 0.581433 0.474131 NaN 0.521481 5 0.366469 0.709115 0.462106 NaN 0.416601 >>> U, s, V = np.linalg.svd(df3.values, full_matrices=True) Traceback (most recent call last): File "<input>", line 1, in <module> File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/numpy-1.11.0b3-py3.4-macosx-10.6-intel.egg/numpy/linalg/linalg.py", line 1359, in svd u, s, vt = gufunc(a, signature=signature, extobj=extobj) File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/numpy-1.11.0b3-py3.4-macosx-10.6-intel.egg/numpy/linalg/linalg.py", line 99, in _raise_linalgerror_svd_nonconvergence raise LinAlgError("SVD did not converge") numpy.linalg.linalg.LinAlgError: SVD did not converge

1条回答

网友

1楼 · 发布于 2024-05-05 12:00:43

有可能使用迭代过程来近似具有缺失值的矩阵的奇异值分解：

用粗略近似值填写缺失值（例如用列平均值替换）
对填写的矩阵执行SVD
从奇异值分解中重构数据矩阵，以获得对缺失值的更好逼近
重复步骤2-3直到收敛

这是期望最大化（EM）算法的一种形式，其中E步骤从SVD更新缺失值的估计，M步骤根据更新后的数据矩阵（see Section 1.3 here for more details）计算SVD。在

import numpy as np
from scipy.sparse.linalg import svds
from functools import partial


def emsvd(Y, k=None, tol=1E-3, maxiter=None):
    """
    Approximate SVD on data with missing values via expectation-maximization

    Inputs:
         -
    Y:          (nobs, ndim) data matrix, missing values denoted by NaN/Inf
    k:          number of singular values/vectors to find (default: k=ndim)
    tol:        convergence tolerance on change in trace norm
    maxiter:    maximum number of EM steps to perform (default: no limit)

    Returns:
         -
    Y_hat:      (nobs, ndim) reconstructed data matrix
    mu_hat:     (ndim,) estimated column means for reconstructed data
    U, s, Vt:   singular values and vectors (see np.linalg.svd and 
                scipy.sparse.linalg.svds for details)
    """

    if k is None:
        svdmethod = partial(np.linalg.svd, full_matrices=False)
    else:
        svdmethod = partial(svds, k=k)
    if maxiter is None:
        maxiter = np.inf

    # initialize the missing values to their respective column means
    mu_hat = np.nanmean(Y, axis=0, keepdims=1)
    valid = np.isfinite(Y)
    Y_hat = np.where(valid, Y, mu_hat)

    halt = False
    ii = 1
    v_prev = 0

    while not halt:

        # SVD on filled-in data
        U, s, Vt = svdmethod(Y_hat - mu_hat)

        # impute missing values
        Y_hat[~valid] = (U.dot(np.diag(s)).dot(Vt) + mu_hat)[~valid]

        # update bias parameter
        mu_hat = Y_hat.mean(axis=0, keepdims=1)

        # test convergence using relative change in trace norm
        v = s.sum()
        if ii >= maxiter or ((v - v_prev) / v_prev) < tol:
            halt = True
        ii += 1
        v_prev = v

    return Y_hat, mu_hat, U, s, Vt

相关问题更多 >

编程相关推荐

热门问题

热门文章