如何从平面文件(基因本体OBO文件)生成一个递归树型字典?

2024-09-30 01:31:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试编写代码来解析基因本体(GO)OBO文件,并将GO术语id(例如GO:0003824)放入树型嵌套字典中。海外建筑运营管理局文件中的分级go结构用“is_a”标识符表示,该标识符用于标记每个go术语的每个父项。GO术语可能有多个父项,而层次结构中最高的GO术语没有父项。在

GO OBO文件的一个小示例如下所示:

[Term]
id: GO:0003674
name: molecular_function
namespace: molecular_function
alt_id: GO:0005554
def: "A molecular process that can be carried out by the action of a single macromolecular machine, usually via direct physical interactions with other molecular entities. Function in this sense denotes an action, or activity, that a gene product (or a complex) performs. These actions are described from two distinct but related perspectives: (1) biochemical activity, and (2) role as a component in a larger system/process." [GOC:pdt]
comment: Note that, in addition to forming the root of the molecular function ontology, this term is recommended for use for the annotation of gene products whose molecular function is unknown. When this term is used for annotation, it indicates that no information was available about the molecular function of the gene product annotated as of the date the annotation was made; the evidence code "no data" (ND), is used to indicate this. Despite its name, this is not a type of 'function' in the sense typically defined by upper ontologies such as Basic Formal Ontology (BFO). It is instead a BFO:process carried out by a single gene product or complex.
subset: goslim_aspergillus
subset: goslim_candida
subset: goslim_chembl
subset: goslim_generic
subset: goslim_metagenomics
subset: goslim_pir
subset: goslim_plant
subset: goslim_yeast
synonym: "molecular function" EXACT []

[Term]
id: GO:0003824
name: catalytic activity
namespace: molecular_function
def: "Catalysis of a biochemical reaction at physiological temperatures. In biologically catalyzed reactions, the reactants are known as substrates, and the catalysts are naturally occurring macromolecular substances known as enzymes. Enzymes possess specific binding sites for substrates, and are usually composed wholly or largely of protein, but RNA that has catalytic activity (ribozyme) is often also regarded as enzymatic." [GOC:vw, ISBN:0198506732]
subset: goslim_chembl
subset: goslim_flybase_ribbon
subset: goslim_metagenomics
subset: goslim_pir
subset: goslim_plant
synonym: "enzyme activity" EXACT [GOC:dph, GOC:tb]
xref: Wikipedia:Enzyme
is_a: GO:0003674 ! molecular_function

[Term]
id: GO:0005198
name: structural molecule activity
namespace: molecular_function
def: "The action of a molecule that contributes to the structural integrity of a complex or its assembly within or outside a cell." [GOC:mah, GOC:vw]
subset: goslim_agr
subset: goslim_aspergillus
subset: goslim_candida
subset: goslim_chembl
subset: goslim_flybase_ribbon
subset: goslim_generic
subset: goslim_pir
subset: goslim_plant
subset: goslim_yeast
is_a: GO:0003674 ! molecular_function

[Term]
id: GO:0005488
name: binding
namespace: molecular_function
def: "The selective, non-covalent, often stoichiometric, interaction of a molecule with one or more specific sites on another molecule." [GOC:ceb, GOC:mah, ISBN:0198506732]
comment: Note that this term is in the subset of terms that should not be used for direct, manual gene product annotation. Please choose a more specific child term, or request a new one if no suitable term is available. For ligands that bind to signal transducing receptors, consider the molecular function term 'receptor binding ; GO:0005102' and its children.
subset: gocheck_do_not_manually_annotate
subset: goslim_pir
subset: goslim_plant
synonym: "ligand" NARROW []
xref: Wikipedia:Binding_(molecular)
is_a: GO:0003674 ! molecular_function

[Term]
id: GO:0005515
name: protein binding
namespace: molecular_function
alt_id: GO:0001948
alt_id: GO:0045308
def: "Interacting selectively and non-covalently with any protein or protein complex (a complex of two or more proteins that may include other nonprotein molecules)." [GOC:go_curators]
subset: goslim_aspergillus
subset: goslim_candida
subset: goslim_chembl
subset: goslim_metagenomics
subset: goslim_pir
subset: goslim_plant
synonym: "glycoprotein binding" NARROW []
synonym: "protein amino acid binding" EXACT []
xref: reactome:R-HSA-170835 "An anchoring protein, ZFYVE9 (SARA), recruits SMAD2/3"
xref: reactome:R-HSA-170846 "TGFBR2 recruits TGFBR1"
xref: reactome:R-HSA-3645786 "TGFBR2 mutant dimers recruit TGFBR1"
xref: reactome:R-HSA-3656484 "TGFBR2 recruits TGFBR1 KD Mutants"
xref: reactome:R-HSA-3702153 "An anchoring protein, ZFYVE9 (SARA), recruits SMAD2/3 MH2 domain mutants"
xref: reactome:R-HSA-3713560 "An anchoring protein, ZFYVE9 (SARA), recruits SMAD2/3 phosphorylation motif mutants"
is_a: GO:0005488 ! binding

[Term]
id: GO:0005549
name: odorant binding
namespace: molecular_function
def: "Interacting selectively and non-covalently with an odorant, any substance capable of stimulating the sense of smell." [GOC:jl, ISBN:0721662544]
subset: goslim_pir
is_a: GO:0005488 ! binding

[Term]
id: GO:0005550
name: pheromone binding
namespace: molecular_function
def: "Interacting selectively and non-covalently with a pheromone, a substance, or characteristic mixture of substances, that is secreted and released by an organism and detected by a second organism of the same or a closely related species, in which it causes a specific reaction, such as a definite behavioral reaction or a developmental process." [GOC:ai]
is_a: GO:0005549 ! odorant binding

下面是一个递归函数(以及一些支持代码)的尝试,用于将GO term id存储在树状字典中:

^{pr2}$

显然,我没有正确地构造递归函数,因为我在输出中看到了键重复:

{'GO:0003674': {'GO:0003824': {'GO:0003824': {}},
  'GO:0005198': {'GO:0005198': {}},
  'GO:0005488': {'GO:0005488': {'GO:0005515': {'GO:0005515': {}},
    'GO:0005549': {'GO:0005549': {'GO:0005550': {'GO:0005550': {}}}}}}}}

对于如何修复递归函数的建议将不胜感激!谢谢您!在


Tags: orandofthenameidgothat
2条回答

对于较短的解决方案,可以使用递归:

import itertools, re, json
content = list(filter(None, [i.strip('\n') for i in open('filename.txt')]))
entries = [[a, list(b)] for a, b in itertools.groupby(content, key=lambda x:x== '[Term]')]
terms = [(lambda x:x if 'is_a' not in x else {**x, 'is_a':re.findall('^GO:\d+', x['is_a'])[0]})(dict(i.split(': ', 1) for i in b)) for a, b in entries if not a]
terms = sorted(terms, key=lambda x:'is_a' in x)
def tree(d, _start):
  t = [i for i in d if i.get('is_a') == _start]
  return {} if not t else {i['id']:tree(d, i['id']) for i in t}

print(json.dumps({terms[0]['id']:tree(terms, terms[0]['id'])}, indent=4))

输出:

^{pr2}$

如果父数据集在其子数据集之前没有定义,则此方法也适用。例如,当父对象位于其原始位置以下三个位置时,仍会生成相同的结果(see file):

print(json.dumps({terms[0]['id']:tree(terms, terms[0]['id'])}, indent=4))

输出:

^{4}$

你写的

if (parent_go_id in parent_list):
    go_dict[parent_go_id][go_id] = generate_go_tree([go_term], all_go_terms, True)

是正确的

^{pr2}$

在这种变化之后,它会产生:

{
    'GO:0003674': {
        'GO:0003824': {}, 
        'GO:0005198': {}, 
        'GO:0005488': {
            'GO:0005515': {},
            'GO:0005549': {
                'GO:0005550': {}
            }
        }
    }
}

但我建议完全不同的方法。创建一个类来解析这些术语并在此过程中构建依赖关系树。在

为了方便起见,我从dict派生了它,因此您可以编写term.id,而不是{}:

^{4}$

现在,您可以一次性将文件删除:

with open('tiny_go.obo', 'rt') as f:
    contents = f.read()

terms = [Term(text) for text in contents.split('\n\n')]

递归树变得很容易。例如,只输出非过时节点的简单“print”函数:

def print_tree(terms, indent=''):
    valid_terms = [term for term in terms if term.is_valid()]
    for term in valid_terms:
        print(indent + 'Term %s - %s' % (term.id, term.name))
        print_tree(term.children, indent + '  ')

top_terms = [term for term in terms if term.is_top()]

print_tree(top_terms)

打印:

Term GO:0003674 - molecular_function
  Term GO:0003824 - catalytic activity
  Term GO:0005198 - structural molecule activity
  Term GO:0005488 - binding
    Term GO:0005515 - protein binding
    Term GO:0005549 - odorant binding
      Term GO:0005550 - pheromone binding

您也可以执行类似Term.registry['GO:0005549'].parent.name的操作,这将得到"binding"。在

我把生成嵌套的goid的dicts作为练习(就像在您自己的例子中一样),但是您可能根本不需要,因为{}已经非常类似于此。在

相关问题 更多 >

    热门问题