从制表符分隔文件的列表产品中删除重复项并进一步分类

def remove_duplicates(l): # define function to remove duplicates return list(set(l)) input = sys.argv[1] # command line arguments to open tab file infile = open(input) for lines in infile: # split content into lines words = lines.split("\t") # split lines into words i.e. columns dataB2.append(words[11]) # column 12 contains the desired repetitive categories dataB2 = dataA.sort() # sort the categories dataB2 = remove_duplicates(dataA) # attempting to remove duplicates but this just returns an infinite list of 0's in the print command print(len(dataB2)) infile.close()

2条回答

网友

1楼 · 编辑于 2024-10-05 10:53:15

你需要做的就是从一个文件中读取每一行，按标签将其拆分，为每一行抓取第12列并将其放入一个列表中。（如果您不关心重复行，只需生成column_12 = set()，并使用add(item)而不是append(item)）。然后您只需使用len（）来获取集合的长度。或者如果你想要两者，你可以使用一个列表，然后把它改成一个集合。你知道吗

编辑：数一数每一个类别（感谢汤姆·莫里斯提醒我，我实际上没有回答这个问题）。您迭代列12的集合，以便不超过一次计数，并使用count()方法中构建的列表。你知道吗

with open(infile, 'r') as fob:
    column_12 = []
    for line in fob:
        column_12.append(line.split('\t')[11])

print 'Unique lines in column 12 %d' % len(set(column_12))
print 'All lines in column 12 %d' % len(column_12)
print 'Count per catagory:'
for cat in set(column_12):
    print '%s - %d' % (cat, column_12.count(cat))

网友

2楼 · 编辑于 2024-10-05 10:53:15

我建议使用pythonCounter来实现这一点。计数器几乎完全满足您的要求，因此您的代码如下所示：

from collections import Counter
import sys

count = Counter()

# Note that the with open()... syntax is generally preferred.
with open(sys.argv[1]) as infile:
  for lines in infile: # split content into lines
      words = lines.split("\t") # split lines into words i.e. columns
      count.update([words[11]])

print count

相关问题更多 >

编程相关推荐

热门问题

热门文章