<p>我修改了bed_prepare功能,以检查上一个和下一个基因组区域的重叠:</p>
<pre><code>def bed_prepare(inp_bed):
''' Takes pandas dataframe bed file and identifies which regions overlap '''
inp_bed['next_start'] = inp_bed['start'].shift(periods=-1)
inp_bed['distance_to_next'] = inp_bed['next_start'] - inp_bed['stop']
inp_bed['next_region_overlap'] = inp_bed['next_start'] <= inp_bed['stop']
inp_bed['previous_stop'] = inp_bed['stop'].shift(periods=1)
inp_bed['distance_from_previous'] = inp_bed['start'] - inp_bed['previous_stop']
inp_bed['previous_region_overlap'] = inp_bed['previous_stop'] >= inp_bed['start']
intermediate_bed = inp_bed
return intermediate_bed
</code></pre>
<p>然后,我使用它们的布尔输出来存储用于写入步骤的变量:</p>
<pre><code># Create empty dataframe to fill with parsed values
new_bed = pd.DataFrame(data=np.zeros((0,len(columns))),columns=columns,dtype=int)
def bed_collapse(intermediate_bed, new_bed,columns=columns):
''' Takes a pandas dataframe bed file with overlap information and returns
genomic regions without overlaps '''
output_row = []
for row in bed.itertuples():
output = {}
if row[7] == False and row[10] == False:
# If row doesn't overlap next row, insert into new dataframe unchanged.
output_row = list(row[1:5])
elif row[7] == True and row[10] == False:
# Only next region overlaps; take the chromosome and start coordinate
output_row = list(row[1:3])
elif row[7] == True and row[10] == True:
# Next and previous regions overlap. Skip row.
pass
elif row[7] == False and row[10] == True:
# Only previous region overlaps; append stop coordinate and geneID to output_row variable
output_row.append(row[3])
output_row.append(row[4])
if row[7] == False:
#Zip columns and output_row values together to form a dict for appending
for k, v in zip(columns,output_row): output[k] = v
#print output
new_bed = new_bed.append(output,ignore_index=True)
output_bed = new_bed
return output_bed
</code></pre>
<p>这已经解决了我的问题,并给出了问题中指定的所需输出。:)</p>