<p>我不确定我是否理解您为什么要做您正在做的事情,但是您可以通过简单地使用索引来获得所需的输出。e、 g</p>
<pre><code># assume your data is stored in <df>
# call the temporary dataframe <tmp>
tmp = df[ ['chr','start','stop','geneID'] ][(df.stop - df.start.shift(-1))>0]
</code></pre>
<p>这就是你最终想要做的吗?</p>
<p>更新
好吧,我知道你在做什么了。请记住,我从来没有处理过任何基因组数据,所以我不知道你的列中有多少行如此简单的“循环”可能相当慢(如果你有几十亿行这可能需要一段时间),但这是唯一想到的解决方案。首先要想到的是(注意:这不是一个成品,因为您需要确定如何处理引入的NaN以及如何处理循环终止)。</p>
<pre><code>import pandas as pd
df = pd.DataFrame(index = [0,1,2,3,4,5],columns=['chr','start','stop','geneID'])
df['chr'] = np.array( ['chr13']*6 )
df['start'] = np.array( [32889584,32890536,32893194,32893282,32893363,32899127] )
df['stop'] = np.array( [32889814,32890737,32893307,32893400,32893466,32899242] )
df['geneID'] = np.array( ['BRCA2']*6 )
# calculate difference between start/stop times for adjacent rows
# this will effectively "look into the future" to see if the upcoming row has
# a start time that is greater than the current stop time
df['tdiff'] = (df.start - df.stop.shift(1)).shift(-1)
# create new dataframe
df_cut = df.copy()*0
r = 0
while r < df.shape[0]:
if df.tdiff[r] > 0:
df_cut.iloc[r] = df.iloc[r]
r+=1
elif df.tdiff.iloc[r] < 0: # have to determine how you will handle the NaN's later
df_cut.chr.iloc[r] = df.chr.iloc[r]
df_cut.start.iloc[r] = df.start.iloc[r]
df_cut.geneID.iloc[r] = df.geneID.iloc[r]
# get the next-valid row and put "stop" value into <df_cut>
df_cut.stop.iloc[r] = df.ix[r:][df.tdiff>0].stop.iloc[0]
# determine new index location for <r>
r = df.ix[r:][df.tdiff>0].index[0] + 1
# eliminate empty rows
df_cut = df_cut[df_cut.start<>0]
</code></pre>
<p>运行后:</p>
<pre><code>>>> df_cut
chr start stop geneID tdiff
0 chr13 32889584 32889814 BRCA2 722
1 chr13 32890536 32890737 BRCA2 2457
2 chr13 32893194 32893466 BRCA2 -0
</code></pre>