在字典中查找重复项

2024-05-19 00:21:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个字典列表,如:

regReads=[{'QNAME': 'HWI-ST1024_0272:8:2105:10935:58524', 'FLAG': 16, 'RNAME': 'chr2', 'POS': 143138210, 'MAPQ': 42, 'CIGAR': '50M', 'RNEXT': '*', 'PNEXT': 0, 'TLEN': 0, 'SEQ': 'GAGGTCCAAACTTTAAATACTCAGAAGGATTTCTGAACTAGTTCTCTGTG', 'QUAL': 'JIJJIJJJJIGGBIIIHBCHEJIJJIIJIIJJIJIIHHDFHDDDBFFCC@'}, 
{'QNAME': 'HWI-ST1024_0272:8:1106:21049:70180', 'FLAG': 0, 'RNAME': 'chr2', 'POS': 143070473, 'MAPQ': 42, 'CIGAR': '50M', 'RNEXT': '*', 'PNEXT': 0, 'TLEN': 0, 'SEQ': 'AGGGTGACCAACTTATTCCTATTTTTCTAAGACTTTCCCCATTTTAGCAC', 'QUAL': '@CCFDDFFHHHHHJJJJJJJJJJJJJIJJJJJJJJJJJJJJIJJJIJIGI'}, 
{'QNAME': 'HWI-ST1024_0272:8:1101:7474:56141', 'FLAG': 0, 'RNAME': 'chr2', 'POS': 143045262, 'MAPQ': 42, 'CIGAR': '50M', 'RNEXT': '*', 'PNEXT': 0, 'TLEN': 0, 'SEQ': 'TTTAGCCTCCATTTCTGATTCAATCACCCAAGACAGCAGACTCAGAGTTG', 'QUAL': 'CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJIJJJJJIJJJJJJJJFIH'}]

我需要找到重复读取并从列表中删除除一个之外的所有读取。因为我不能只做“如果不在uniqueReads中读取”(每个QNAME值都是唯一的),所以我尝试做类似的事情,但我认为我把这个问题复杂化了

for read in regReads:
        compareReads.append((read['POS'],len(read['SEQ']),read['FLAG']))
    n=0
    duplicates={}
    singles=[]
    addThese=[]
    for comp in compareReads:
        if compareReads.count(comp) == 1:
            uniqueReads.append(regReads[n])
        dups=[n]
        if compareReads.count(comp) != 1:
            p=0
            for alt in compareReads:
                if comp == alt:
                    dups.append(p)
                p+=1
            duplicates[n]=sorted(dups)
        n+=1
    for dup in duplicates:
        if duplicates[dups] not in singles:
            singles.append(tuple(duplicates[dup]))
            addThese.append(dup)
    for i in addThese:
        uniqueReads.append(regReads[i])
    uniqueReadCnt=len(uniqueReads)
    print(uniqueReadCnt)

我只需要比较每个字典的3个值(POS、FLAG和len(SEQ))。在Python中是否有一种更简单的方法,在做出决定之前检查列表中的所有条目,然后只从列表中添加一个实例?我想不出更好的方法来“标记”下游重复读取为无效,这样它们就不会添加到我的唯一读取列表中

对于我目前拥有的代码,我不知道如何将值作为元组(或嵌套列表?我不确定)移动到“singles”列表中,然后对照该值检查每个重复键的值


Tags: inpos列表forreadseqflagduplicates
2条回答

在Python中,可以将元组用作字典键:

seen = {}
for read in regReads:
    key = (read['POS'],len(read['SEQ']),read['FLAG'])
    if key not in seen:
        # first time seeing this key
        seen[key] = read
# seen.values() now contains all unique entries according to the key

另一种方式,在前面答案的基础上:

filtered = {(x['POS'],len(x['SEQ']),x['FLAG']): x for x in regReads}.values()

创建一个字典,其中的值作为键需要是唯一的,这样可以获得一组唯一的读取。然后使用.values()获取唯一读取的列表。如果您需要的是真实列表而不是视图,请在之后强制转换

相关问题 更多 >

    热门问题