用Python计算累积分布函数（CDF）

2条回答

网友

1楼 · 编辑于 2024-10-01 22:35:42

假设您知道数据是如何分布的（即您知道数据的pdf），那么scipy在计算cdf时支持离散数据

import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns

x = np.random.randn(10000) # generate samples from normal distribution (discrete data)
norm_cdf = scipy.stats.norm.cdf(x) # calculate the cdf - also discrete

# plot the cdf
sns.lineplot(x=x, y=norm_cdf)
plt.show()

我们甚至可以打印cdf的前几个值来显示它们是离散的

print(norm_cdf[:10])
>>> array([0.39216484, 0.09554546, 0.71268696, 0.5007396 , 0.76484329,
       0.37920836, 0.86010018, 0.9191937 , 0.46374527, 0.4576634 ])

同样的计算cdf的方法也适用于多维：我们使用下面的二维数据来说明

mu = np.zeros(2) # mean vector
cov = np.array([[1,0.6],[0.6,1]]) # covariance matrix
# generate 2d normally distributed samples using 0 mean and the covariance matrix above
x = np.random.multivariate_normal(mean=mu, cov=cov, size=1000) # 1000 samples
norm_cdf = scipy.stats.norm.cdf(x)
print(norm_cdf.shape)
>>> (1000, 2)

在上面的例子中，我事先知道我的数据是正态分布的，这就是我使用scipy.stats.norm()的原因-scipy支持多个分布。但同样，您需要事先知道数据是如何分布的才能使用这些函数。如果您不知道数据是如何分布的，而只是使用任何分布来计算cdf，那么您很可能会得到不正确的结果。

网友

2楼 · 编辑于 2024-10-01 22:35:42

（有可能我对这个问题的解释是错误的。如果问题是如何从离散PDF获取离散CDF，那么np.cumsum除以适当的常数就可以了，如果样本是等距的。如果数组不是等距的，那么数组的np.cumsum乘以点之间的距离就可以了。）

如果您有一个离散的样本数组，并且您想知道该样本的CDF，那么您可以对该数组进行排序。如果您查看排序结果，您会发现最小值表示0%，最大值表示100%。如果您想知道分布的50%的值，只需查看位于排序数组中间的array元素。

让我们用一个简单的例子来详细了解一下：

import matplotlib.pyplot as plt
import numpy as np

# create some randomly ddistributed data:
data = np.random.randn(10000)

# sort the data:
data_sorted = np.sort(data)

# calculate the proportional values of samples
p = 1. * np.arange(len(data)) / (len(data) - 1)

# plot the sorted data:
fig = figure()
ax1 = fig.add_subplot(121)
ax1.plot(p, data_sorted)
ax1.set_xlabel('$p$')
ax1.set_ylabel('$x$')

ax2 = fig.add_subplot(122)
ax2.plot(data_sorted, p)
ax2.set_xlabel('$x$')
ax2.set_ylabel('$p$')

这给出了下面的图，其中右边的图是传统的累积分布函数。它应该反映点后面的过程的CDF，但自然不是只要点的数量是有限的。

cumulative distribution function

这个函数很容易反转，这取决于你的申请表你需要。

相关问题更多 >

编程相关推荐

热门问题

热门文章