<ol>
<li>您需要在LAT、LONG上发现异常值
<ul>
<li>您的绘图是单向的,但这里有一种自动方式</li>
</ul>
</li>
<li>首先,使用<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html" rel="nofollow noreferrer">^{<cd1>}</a>查看哪些列是数字的,哪些数据类型是数字的。您对<code>LAT</code>,<code>LONG</code>感兴趣</李>
<li><strong>在您感兴趣的两列上使用<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html" rel="nofollow noreferrer">^{<cd4>}</a>获得描述性统计数据并找出它们的异常值。</strong></li>
</ol>
<ul>
<li><code>.describe()</code>接受一个参数<code>percentiles</code>,它是一个列表,默认为
<code>[.25, .5, .75]</code>,返回第25、50和75百分位</李>
<li>..但您希望排除罕见/异常值,因此<strong>尝试包括(比如)第1/99和第5/95百分位<strong>:</li>
</ul>
<pre><code>>>> pd.options.display.float_format = '{:.2f}'.format # suppress unwanted dp's
>>> dat[['LAT','LONG']].describe(percentiles=[.01,.05,.1,.25,.5,.9,.95,.99])
# OR:
>>> dat[dat['LAT'].between(33.97,36.96) & dat['LONG'].between(-101.80,-95.48)]
LAT LONG
count 11125.00 11125.00
mean 35.21 -96.85
std 2.69 7.58
min 0.00 -203.63
1% 33.97 -101.80 # < 1st percentile
5% 34.20 -99.76
10% 34.29 -98.25
25% 34.44 -97.63
50% 35.15 -97.37
90% 36.78 -95.95
95% 36.85 -95.74
99% 36.96 -95.48 # < 99th percentile
max 73.99 97.70
</code></pre>
<p>因此,LAT和LONG值的第1-99个百分位范围为:</p>
<pre><code> 33.97 <= LAT <= 36.96
-101.80 <= LONG <= -95.48
</code></pre>
<ol start=“4”>
<li>因此,现在可以用一行<code>apply(..., axis=1)</code>排除这些:</li>
</ol>
<pre><code> dat2 = dat[ dat.apply(lambda row: (33.97<=row['LAT']<= 36.96) and (-101.80<=row['LONG']<=-95.48), axis=1) ]
API# Operator Operator ID WellType ... ZONE Unnamed: 18 Unnamed: 19 Unnamed: 20
0 3500300026.00 PHOENIX PETROCORP INC 19499.00 2R ... CHEROKEE NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
11121 3515323507.00 SANDRIDGE EXPLORATION & PRODUCTION LLC 22281.00 2D ... MUSSELLEM, OKLAHOMA NaN NaN NaN
[10760 rows x 21 columns]
</code></pre>
<p>注意,这已经从11125行下降到10760行。所以我们减少了365行</p>
<p>最后,最好检查过滤后的<code>LAT, LONG</code>的极值是否在您预期的范围内:</p>
<pre><code>>>> dat2[['LAT','LONG']].describe(percentiles=[.01,.05,.1,.25,.5,.9,.95,.99])
LAT LONG
count 10760.00 10760.00
mean 35.33 -97.25
std 0.91 1.11
min 33.97 -101.76
1% 34.08 -101.62
5% 34.21 -99.19
10% 34.30 -98.20
25% 34.44 -97.62
50% 35.13 -97.36
90% 36.77 -95.99
95% 36.83 -95.80
99% 36.93 -95.56
max 36.96 -95.49
</code></pre>
<p>PS第1/99百分位数没有什么神奇的。你可以自己玩<code>describe(... percentiles)</code>。您可以使用0.005、0.002、0.001百分位数等-您可以决定什么构成异常值</p>