<h2>前言</h2>
<p>据我所知,seaborn distplot默认情况下会进行kde估计。
如果您想要一个规范化的distplot图,可能是因为您假设图的Ys应该在[0;1]中的范围内。如果是,堆栈溢出问题引发了<a href="https://stackoverflow.com/questions/46441481/why-does-this-kernel-density-estimation-have-values-over-1-0">kde estimators showing values above 1</a>问题。</p>
<p>引用<a href="https://stackoverflow.com/a/46448001/7237062">one answer</a>:</p>
<blockquote>
<p>a continous pdf <em>(pdf=probability density function)</em> never says the value to be less than 1, with the pdf for continous random variable, f<strong>unction p(x) is not the probability</strong>. you can refer for continuous random variables and their distrubutions</p>
</blockquote>
<p>引用<a href="https://stackoverflow.com/users/4124317/importanceofbeingernest">importanceofbeingernest</a>的第一条注释:</p>
<blockquote>
<p><strong>The integral over a pdf is 1</strong>. There is no contradiction to be seen here.</p>
</blockquote>
<p>据我所知,它的值应该在[0;1]中。</p>
<p><em>注意:所有可能的连续适配函数都是<a href="https://docs.scipy.org/doc/scipy-0.14.0/reference/stats.html#continuous-distributions" rel="nofollow noreferrer">on SciPy site and available in the package scipy.stats</a></em></p>
<p>或许也可以看看<a href="https://en.wikipedia.org/wiki/Probability_mass_function" rel="nofollow noreferrer">probability mass functions</a>?</p>
<hr/>
<p>如果您真的想将同一个图规范化,那么您应该收集绘制的函数(选项1)或函数定义(选项2)的实际数据点,然后自己将它们规范化并再次绘制。</p>
<h2>选择1</h2>
<p><a href="https://i.stack.imgur.com/kHeod.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/kHeod.png" alt="enter image description here"/></a></p>
<pre><code>import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys
print('System versions : {}'.format(sys.version))
print('System versions : {}'.format(sys.version_info))
print('Numpy versqion : {}'.format(np.__version__))
print('matplotlib.pyplot version: {}'.format(matplotlib.__version__))
print('seaborn version : {}'.format(sns.__version__))
protocols = {}
types = {"data_v": "data_v.csv"}
for protname, fname in types.items():
col_time,col_window = np.loadtxt(fname,delimiter=',').T
trailing_window = col_window[:-1] # "past" values at a given index
leading_window = col_window[1:] # "current values at a given index
decreasing_inds = np.where(leading_window < trailing_window)[0]
quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
quotient_times = col_time[decreasing_inds]
protocols[protname] = {
"col_time": col_time,
"col_window": col_window,
"quotient_times": quotient_times,
"quotient": quotient,
}
fig, (ax1, ax2) = plt.subplots(1,2, sharey=False, sharex=False)
g = sns.distplot(quotient, hist=True, label=protname, ax=ax1, rug=True)
ax1.set_title('basic distplot (kde=True)')
# get distplot line points
line = g.get_lines()[0]
xd = line.get_xdata()
yd = line.get_ydata()
# https://stackoverflow.com/questions/29661574/normalize-numpy-array-columns-in-python
def normalize(x):
return (x - x.min(0)) / x.ptp(0)
#normalize points
yd2 = normalize(yd)
# plot them in another graph
ax2.plot(xd, yd2)
ax2.set_title('basic distplot (kde=True)\nwith normalized y plot values')
plt.show()
</code></pre>
<h2>选择2</h2>
<p>下面,我尝试执行kde并规范化获得的估计。我不是一个统计专家,所以kde的使用在某些方面可能是错误的(它不同于屏幕截图上的seaborn,这是因为seaborn比我做得更好。它只是试图用scipy来模拟kde。<em>结果还不错,我想</em>)</p>
<p>截图:</p>
<p><a href="https://i.stack.imgur.com/dqhzy.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/dqhzy.png" alt="enter image description here"/></a></p>
<p>代码:</p>
<pre><code>import numpy as np
from scipy import stats
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys
print('System versions : {}'.format(sys.version))
print('System versions : {}'.format(sys.version_info))
print('Numpy versqion : {}'.format(np.__version__))
print('matplotlib.pyplot version: {}'.format(matplotlib.__version__))
print('seaborn version : {}'.format(sns.__version__))
protocols = {}
types = {"data_v": "data_v.csv"}
for protname, fname in types.items():
col_time,col_window = np.loadtxt(fname,delimiter=',').T
trailing_window = col_window[:-1] # "past" values at a given index
leading_window = col_window[1:] # "current values at a given index
decreasing_inds = np.where(leading_window < trailing_window)[0]
quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
quotient_times = col_time[decreasing_inds]
protocols[protname] = {
"col_time": col_time,
"col_window": col_window,
"quotient_times": quotient_times,
"quotient": quotient,
}
fig, (ax1, ax2, ax3, ax4) = plt.subplots(1,4, sharey=False, sharex=False)
diff=quotient_times
ax1.plot(diff, quotient, ".", label=protname, color="blue")
ax1.set_ylim(0, 1.0001)
ax1.set_title(protname)
ax1.set_xlabel("quotient_times")
ax1.set_ylabel("quotient")
ax1.legend()
sns.distplot(quotient, hist=True, label=protname, ax=ax2, rug=True)
ax2.set_title('basic distplot (kde=True)')
# taken from seaborn's source code (utils.py and distributions.py)
def seaborn_kde_support(data, bw, gridsize, cut, clip):
if clip is None:
clip = (-np.inf, np.inf)
support_min = max(data.min() - bw * cut, clip[0])
support_max = min(data.max() + bw * cut, clip[1])
return np.linspace(support_min, support_max, gridsize)
kde_estim = stats.gaussian_kde(quotient, bw_method='scott')
# manual linearization of data
#linearized = np.linspace(quotient.min(), quotient.max(), num=500)
# or better: mimic seaborn's internal stuff
bw = kde_estim.scotts_factor() * np.std(quotient)
linearized = seaborn_kde_support(quotient, bw, 100, 3, None)
# computes values of the estimated function on the estimated linearized inputs
Z = kde_estim.evaluate(linearized)
# https://stackoverflow.com/questions/29661574/normalize-numpy-array-columns-in-python
def normalize(x):
return (x - x.min(0)) / x.ptp(0)
# normalize so it is between 0;1
Z2 = normalize(Z)
for name, func in {'min': np.min, 'max': np.max}.items():
print('{}: source={}, normalized={}'.format(name, func(Z), func(Z2)))
# plot is different from seaborns because not exact same method applied
ax3.plot(linearized, Z, ".", label=protname, color="orange")
ax3.set_title('Non linearized gaussian kde values')
# manual kde result with Y axis avalues normalized (between 0;1)
ax4.plot(linearized, Z2, ".", label=protname, color="green")
ax4.set_title('Normalized gaussian kde values')
plt.show()
</code></pre>
<p>输出:</p>
<pre><code>System versions : 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
System versions : sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)
Numpy versqion : 1.16.2
matplotlib.pyplot version: 3.0.2
seaborn version : 0.9.0
min: source=0.0021601491646143518, normalized=0.0
max: source=9.67319154426489, normalized=1.0
</code></pre>
<hr/>
<p>与评论相反,策划:</p>
<pre><code>[(x-min(quotient))/(max(quotient)-min(quotient)) for x in quotient]
</code></pre>
<p>不改变行为!它只改变核密度估计的源数据。曲线形状将保持不变。</p>
<p><a href="https://seaborn.pydata.org/generated/seaborn.distplot.html" rel="nofollow noreferrer">Quoting seaborn's distplot doc</a>:</p>
<blockquote>
<p>This function combines the matplotlib hist function (with automatic
calculation of a good default bin size) with the seaborn kdeplot() and
rugplot() functions. It can also fit scipy.stats distributions and
plot the estimated PDF over the data.</p>
</blockquote>
<p>默认情况下:</p>
<blockquote>
<p>kde : bool, optional set to True
Whether to plot a gaussian kernel density estimate.</p>
</blockquote>
<p>它默认使用kde。引用seaborn的kde文档:</p>
<blockquote>
<p>Fit and plot a univariate or bivariate kernel density estimate.</p>
</blockquote>
<p>引用<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html" rel="nofollow noreferrer">SCiPy gaussian kde method doc</a>:</p>
<blockquote>
<p>Representation of a kernel-density estimate using Gaussian kernels.</p>
<p>Kernel density estimation is a way to estimate the probability density
function (PDF) of a random variable in a non-parametric way.
gaussian_kde works for both uni-variate and multi-variate data. It
includes automatic bandwidth determination. The estimation works best
for a unimodal distribution; bimodal or multi-modal distributions tend
to be oversmoothed.</p>
</blockquote>
<p>注意,我确实相信你的数据是双峰的,正如你自己提到的。它们看起来也是离散的。据我所知,离散分布函数的分析方法可能与连续分布函数的分析方法不同,而且拟合可能会很棘手。</p>
<p>下面是一个有各种规律的例子:</p>
<pre><code>import numpy as np
from scipy.stats import uniform, powerlaw, logistic
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import sys
print('System versions : {}'.format(sys.version))
print('System versions : {}'.format(sys.version_info))
print('Numpy versqion : {}'.format(np.__version__))
print('matplotlib.pyplot version: {}'.format(matplotlib.__version__))
print('seaborn version : {}'.format(sns.__version__))
protocols = {}
types = {"data_v": "data_v.csv"}
for protname, fname in types.items():
col_time,col_window = np.loadtxt(fname,delimiter=',').T
trailing_window = col_window[:-1] # "past" values at a given index
leading_window = col_window[1:] # "current values at a given index
decreasing_inds = np.where(leading_window < trailing_window)[0]
quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
quotient_times = col_time[decreasing_inds]
protocols[protname] = {
"col_time": col_time,
"col_window": col_window,
"quotient_times": quotient_times,
"quotient": quotient,
}
fig, [(ax1, ax2, ax3), (ax4, ax5, ax6)] = plt.subplots(2,3, sharey=False, sharex=False)
diff=quotient_times
ax1.plot(diff, quotient, ".", label=protname, color="blue")
ax1.set_ylim(0, 1.0001)
ax1.set_title(protname)
ax1.set_xlabel("quotient_times")
ax1.set_ylabel("quotient")
ax1.legend()
quotient2 = [(x-min(quotient))/(max(quotient)-min(quotient)) for x in quotient]
print(quotient2)
sns.distplot(quotient, hist=True, label=protname, ax=ax2, rug=True)
ax2.set_title('basic distplot (kde=True)')
sns.distplot(quotient2, hist=True, label=protname, ax=ax3, rug=True)
ax3.set_title('logistic distplot')
sns.distplot(quotient, hist=True, label=protname, ax=ax4, rug=True, kde=False, fit=uniform)
ax4.set_title('uniform distplot')
sns.distplot(quotient, hist=True, label=protname, ax=ax5, rug=True, kde=False, fit=powerlaw)
ax5.set_title('powerlaw distplot')
sns.distplot(quotient, hist=True, label=protname, ax=ax6, rug=True, kde=False, fit=logistic)
ax6.set_title('logistic distplot')
plt.show()
</code></pre>
<p>输出:</p>
<pre><code>System versions : 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
System versions : sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)
Numpy versqion : 1.16.2
matplotlib.pyplot version: 3.0.2
seaborn version : 0.9.0
[1.0, 0.05230125523012544, 0.0433775382360589, 0.024590765616971128, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.02836946874603772, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.03393500048652319, 0.05230125523012544, 0.05230125523012544, 0.05230125523012544, 0.0037013196009011043, 0.0, 0.05230125523012544]
</code></pre>
<p>截图:</p>
<p><a href="https://i.stack.imgur.com/ESrZy.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/ESrZy.png" alt="enter image description here"/></a></p>