bed = WRDIR + "AGGCAGAA_mm9_AlignedSorted_final_nameSorted_uniq.bed"
sam = WRDIR + "AGGCAGAA_bcs_nameSorted.sam"
out = WRDIR + "AGGCAGAA_joined.txt"
with open(bed) as b, open(sam) as s, open(out, 'w') as o:
for x, b_line in enumerate(b, start=1):
b_line = b_line.strip().split("\t")
b_id = b_line[9]
match_line = 1
if x < 200: # for testing because file is huge..
for s_line in s:
s_line= s_line.strip().split("\t")
s_id = s_line[0]
if b_id == s_id:
output = b_id + "\tGENE:" + b_line[4] + "\tACC:" + b_line[5] + "\tCHR:" + b_id[6] + "\tBC:" + s_line[9][:12] + "\tUMI:" + s_line[9][12:] + "\n"
o.write(output)
match_line+=1
print "MATCH: (BED line " + str(x) + ")\t" + b_id + "\t\t(SAM line " + str(match_line) + ")\t" + s_id
break
match_line+=1
这个脚本应该做什么?它应该查看两个以制表符分隔的文件,它们按照每个文件中的唯一标识符排序。bed
文件本质上是sam
文件的子集。如果此唯一标识符在两个文件之间匹配,则从它们匹配的行中获取数据,并将该信息写入文件。完成后,我们移动到bed
的下一行,继续我们在sam
中停止的地方(基本上是向下处理文件)
我认为这段代码实现了它的目的,但是有一些bugmatch_line
被假定为第二个文件的行号,但事实并非如此。它似乎受到中断的影响,当中断被命中时,变量在1处重新初始化(我认为)。output
变量是一行以制表符分隔的数据,我想将其写入文件。我也不确定这是否是休息的正确用法
接下来的步骤:对于一个示例,此脚本仅说明1组文件。基本上有9个样本,每个样本包含两个特定物种的文件。例如:
samples = [
"AGGCAGAA",
"CAGAGAGG",
"CGTACTAG",
"CTCTCTAC",
"GCTACGCT",
"GGACTCCT",
"TAAGGCGA",
"TAGGCATG",
"TCCTGAGC"
]
for idx, s in enumerate(samples, start=1):
if idx == 1: ## TESTING
print "Processing: %s (%s/%s)" % (s, str(idx), str(len(samples)))
for sp in ["mm9", "hg19"]:
# if sp == "mm9": ## TESTING
bed = WRDIR + s + "_" + sp + "_AlignedSorted_final_nameSorted.bed"
sam = WRDIR + s + "_bcs_nameSorted.sam"
out = WRDIR + s +"_joined.txt"
我很快就厌倦了将其集成到脚本中,但收到一个错误声明:TypeError: cannot concatenate 'str' and 'file' objects.
我感谢你的帮助
目前没有回答
相关问题 更多 >
编程相关推荐