如何使用gff3文件创建基因id及其5_prime_utr编码区的字典？我不能用Biopython来完成这个任务

GFF = raw_input("Please enter gff3 file: ") GFF = open(GFF, "r") GFF= GFF.read() new_dict = {} for i in GFF: element = i.split() if (element[2] == "five_prime_UTR"): if element[7] in new_dict: new_dict[element[2]]+= 1 if element[3] in new_dict: new_dict[element[3]] += 1

1 gramene exon 55222 55682 . - . Parent=transcript:Zm00001d027231_T003;Name=Zm00001d027231_T003.exon1;constitutive=0;ensembl_end_phase=0;ensembl_phase=-1;exon_id=Zm00001d027231_T003.exon1;rank=1 1 gramene five_prime_UTR 55549 55682 . - . Parent=transcript:Zm00001d027231_T003 1 gramene mRNA 50887 55668 . - . ID=transcript:Zm00001d027231_T004;Parent=gene:Zm00001d027231;biotype=protein_coding;transcript_id=Zm00001d027231_T004 1 gramene three_prime_UTR 50887 51120 . - . Parent=transcript:Zm00001d027231_T004 1 gramene exon 50887 51239 . - . Parent=transcript:Zm00001d027231_T004;Name=Zm00001d027231_T004.exon9;constitutive=0;ensembl_e

1条回答

网友
1楼 · 发布于 2024-09-28 22:25:12

变量GFF保存gff3文件的内容
现在，您正在每个字符上循环文件字符的内容
>>> for i in GFF: >>> print(i) 1 g r a m e n e e x o n [and so on]
您想使用for i in GFF.splitlines():逐行循环文件的内容
您还可以使代码更加清晰，为正在解析的字段命名，如：
new_dict = {} # https://m.ensembl.org/info/website/upload/gff3.html gff3_fields = ['seqid', # name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seq ID must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below. 'source', # name of the program that generated this feature, or the data source (database or project name) 'type', # type of feature. Must be a term or accession from the SOFA sequence ontology 'start', # Start position of the feature, with sequence numbering starting at 1. 'end', # End position of the feature, with sequence numbering starting at 1. 'score', # A floating point value. 'strand', # defined as + (forward) or - (reverse). 'phase', # One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on.. 'attributes' # A semicolon-separated list of tag-value pairs, providing additional information about each feature. Some of these tags are predefined, e.g. ID, Name, Alias, Parent ] for line in GFF.splitlines(): feature = dict(zip(gff3_fields, line.split())) if feature['type'] == 'three_prime_UTR': attributes = feature['attributes'] geneid = attributes.split(':')[-1].split('_')[0] new_dict[geneid] = feature['start']

相关问题更多 >

编程相关推荐

热门问题

热门文章