Python:跨多个映射查找子集

2024-10-03 09:09:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用多对多映射,找到一个集合的子集映射到另一个集合的特定子集。你知道吗

我有很多基因。每个基因都是一个或多个COG的成员(反之亦然)

  • gene1是COG1的成员
  • gene1是COG1003的成员
  • gene2是COG2的成员
  • gene3是COG273的成员
  • gene4是COG1的成员
  • gene5是COG273的成员
  • gene5是COG71的成员
  • gene6是COG1的成员
  • gene6是COG273的成员

我有一组代表酶的短COG,例如COG1,COG273。你知道吗

我想找出它们之间的所有基因集,它们都是酶中每个COG的成员,但没有不必要的重叠(例如,在这种情况下,'gene1和gene6'将是虚假的,因为gene6已经是两个COG的成员)。你知道吗

在这个例子中,答案是:

  • 基因1和基因3
  • 基因1和基因5
  • 基因3和基因4
  • 基因4和基因5
  • 基因6

虽然我可以得到每个COG的所有成员并创建一个“产品”,但这将包含虚假的结果(如上所述),其中的基因多于必需的。你知道吗

我的映射当前包含在一个字典中,其中键是gene ID,值是该gene是其成员的COG ID的列表。但是,我承认这可能不是存储映射的最佳方式。你知道吗


Tags: id基因成员子集genecoggene1gene2
3条回答

这对你有用吗?注意,既然你说你有一个很短的齿轮集,我继续做嵌套for循环;可能有一些方法可以优化这个。。。你知道吗

为了将来的参考,请张贴任何代码,你已经随着你的问题。你知道吗

import itertools

d = {'gene1':['COG1','COG1003'], 'gene2':['COG2'], 'gene3':['COG273'], 'gene4':['COG1'], 'gene5':['COG273','COG71'], 'gene6':['COG1','COG273']}

COGs = [set(['COG1','COG273'])] # example list of COGs containing only one enzyme; NOTE: your data should be a list of multiple sets

# create all pair-wise combinations of our data
gene_pairs = [l for l in itertools.combinations(d.keys(),2)]

found = set()
for pair in gene_pairs:

    join = set(d[pair[0]] + d[pair[1]]) # set of COGs for gene pairs

    for COG in COGs:

        # check if gene already part of enzyme
        if sorted(d[pair[0]]) == sorted(list(COG)):
            found.add(pair[0])
        elif sorted(d[pair[1]]) == sorted(list(COG)):
            found.add(pair[1])

        # check if gene combinations are part of enzyme
        if COG <= join and pair[0] not in found and pair[1] not in found:
            found.add(pair)

for l in found:
    if isinstance(l, tuple): # if tuple
        print l[0], l[1]
    else:
        print l
def findGenes(seq1, seq2, llist):

    from collections import OrderedDict
    from collections import Counter
    from itertools import product

    od  = OrderedDict()

    for b,a in llist:
        od.setdefault(a,[]).append(b)

    llv = []
    for k,v in od.items():
        if seq1 == k or seq2 == k:
            llv.append(v)

    # flat list needed for counting genes frequencies
    flatL = [ x for  sublist in llv for x in sublist]


    cFlatl = Counter(flatL)

    # this will gather genes that like gene6 have both sequencies
    l_lonely = []

    for k in cFlatl:
        if cFlatl[k] > 1:
            l_lonely.append(k)

    newL = []
    temp = []

    for sublist in llv:
        for el in sublist:
            if el not in l_lonely:
                  newL.append(el)
        temp.append(newL)
        newL = []

    # temp contains only genes that do not belong to both sequences
    # product will connect genes from different sequence groups
    p = product(*temp)

    for el in list(p):
        print(el)

    print(l_lonely)

输出:

lt=[('gene1','COG1'),('gene1','COG1003'),('gene2','COG2'),('gene3','COG273'),('gene4','COG1'), ('gene5','COG273'),('gene5','COG71'),('gene6','COG1'),('gene6','COG273')]

findGenes('COG1','COG273',lt)

('gene1','gene3')

('gene1','gene5')

('gene4','gene3')

('gene4','gene5')

['gene6']

一个基本攻击:

Keep your representation as it is for now.
Initialize a dictionary with the COGs as keys; each value is an initial count of 0.

Now start building your list of enzyme coverage sets (ecs_list), one ecs at a time.  Do this by starting at the front of the gene list and working your way to the end, considering all combinations.

Write a recursive routine to solve the remaining COGs in the enzyme.  Something like this:

def pick_a_gene(gene_list, cog_list, solution_set, cog_count_dict):
   pick the first gene in the list that is in at least one cog in the list.
   let the rest of the list be remaining_gene_list.
   add the gene to the solution set.
   for each of the gene's cogs:
      increment the cog's count in cog_count_dict
      remove the cog from cog_list (if it's still there).
   add the gene to the solution set.

   is there anything left in the cog_list?
   yes:
      pick_a_gene(remaining_gene_list, cog_list, solution_set, cog_count_dict)
   no:    # we have a solution: check it for minimality
      from every non-zero entry in cog_count_dict, subtract 1.  This gives us a list of excess coverage.
      while the excess list is not empty:
         pick the next gene in the solution set, starting from the *end* (if none, break the loop)
         if the gene's cogs are all covered by the excess:
            remove the gene from the solution set.
            decrement the excess count of each of its cogs.

      The remaining set of genes is an ECS; add it to ecs_list

这对你有用吗?我相信它正确地覆盖了最小集,给出了一个表现良好的例子。请注意,从高端开始,当我们检查最小值时,会防范这样的情况:

gene1: cog1, cog5
gene2: cog2, cog5
gene3: cog3
gene4: cog1, cog2, cog4
enzyme: cog1 - cog5

我们可以看到我们需要gene3、gene4和gene1或gene2。如果我们从低端淘汰掉gene1,就永远找不到解决方案。如果我们从高端开始,我们将消除gene2,但在主循环的稍后过程中找到解决方案。你知道吗

有可能构造一个这样一种情况,即这种情况存在三方冲突。在这种情况下,我们必须在最小值检查中编写一个额外的循环才能找到所有的值。不过,我想你的数据对我们来说并不是那么糟糕。你知道吗

相关问题 更多 >