用Python搜索和输出

2024-10-03 23:20:58 发布

您现在位置:Python中文网/ 问答频道 /正文

帮帮大家!!!你知道吗

150个文本文件的列表

One text file with query texts: (  
    SRR1005851  
    SRR1299210  
    SRR1021605  
    SRR1299782  
    SRR1299369  
    SRR1006158  
    ...etc).   

我想从150个文本文件列表中搜索每个查询文本。
例如,如果在至少120个文件中找到SRR1005851,则SRR1005851将附加在输出文件中。
搜索将迭代所有搜索查询文本并遍历所有150个文件。你知道吗

摘要:我正在查找150个文件中至少90%的查询文本。你知道吗


Tags: 文件text文本列表withqueryonefile
1条回答
网友
1楼 · 发布于 2024-10-03 23:20:58

我想我还没有完全理解你的问题。发布你的代码和一个示例文件会很有帮助。你知道吗

此代码将统计所有文件中的所有条目,然后它将标识每个文件中的唯一条目。之后,它将统计每个条目在每个文件中的出现次数。然后,它将只选择出现在所有文件中至少90%的条目。你知道吗

而且,这段代码本来可以短一些,但为了可读性,我创建了许多变量,它们的名称很长,很有意义。你知道吗

请阅读评论;)

import os
from collections import Counter
from sys import argv

# adjust your cut point
PERCENT_CUT = 0.9

# here we are going to save each file's entries, so we can sum them later
files_dict = {}

# total files seems to be the number you'll need to check against count
total_files  = 0;

# raw total entries, even duplicates
total_entries = 0;

unique_entries = 0;

# first argument is script name, so have the second one be the folder to search
search_dir = argv[1]

# list everything under search dir - ideally only your input files
# CHECK HOW TO READ ONLY SPECIFIC FILE types if you have something inside the same folder
files_list = os.listdir(search_dir)

total_files = len(files_list)

print('Files READ:')

# iterate over each file found at given folder
for file_name in files_list:
    print("    "+file_name)

    file_object = open(search_dir+file_name, 'r')

    # returns a list of entries with 'newline' stripped
    file_entries = map(lambda it: it.strip("\r\n"), file_object.readlines())

    # gotta count'em all
    total_entries += len(file_entries)

    # set doesn't allow duplicate entries
    entries_set = set(file_entries)

    #creates a dict from the set, set each key's value to 1.
    file_entries_dict = dict.fromkeys(entries_set, 1)

    # entries dict is now used differenty, each key will hold a COUNTER
    files_dict[file_name] = Counter(file_entries_dict)

    file_object.close();


print("\n\nALL ENTRIES COUNT: "+str(total_entries))

# now we create a dict that will hold each unique key's count so we can sum all dicts read from files
entries_dict = Counter({})

for file_dict_key, file_dict_value in files_dict.items():
    print(str(file_dict_key)+" - "+str(file_dict_value))
    entries_dict += file_dict_value

print("\nUNIQUE ENTRIES COUNT: "+str(len(entries_dict.keys())))

# print(entries_dict)

# 90% from your question
cut_line = total_files * PERCENT_CUT
print("\nNeeds at least "+str(int(cut_line))+" entries to be listed below")
#output dict is the final dict, where we put entries that were present in > 90%  of the files.
output_dict = {}
# this is PYTHON 3 - CHECK YOUR VERSION as older versions might use iteritems() instead of items() in the line belows
for entry, count in entries_dict.items():
    if count > cut_line:
        output_dict[entry] = count;

print(output_dict)

相关问题 更多 >