如何在与python匹配后打印行？

1条回答

网友

1楼 · 发布于 2024-10-04 11:27:31

所以您似乎需要根据它们在第一个文件中的顺序来排序Uniref90_XXXXXX。你知道吗

这里UniRef_ids.txt是第一个文件，UniRef_data.txt是第二个文件，UniRef_data_ordered.txt是输出文件。你知道吗

我注意到每个Uniref90_XXXXXX似乎都以一个>开始，然后继续，跨越可变的行数，直到下一个>或者，我假设，文件结束。你知道吗

我只处理了一个例外：如果Uniref90_XXXXXX出现在第一个文件中，而不是第二个文件中。它只是向控制台（而不是文件）输出一个警告。你知道吗

如果其余文件的格式不同，这可能无法工作。类似地，如果您的文件是几GB，我的方法可能不合适，因为我会将第二个文件的全部内容读入内存。你知道吗

# We first go through the second file, get all the Uniref90_XXXXXX IDs, and 
# put their sequences (including the Uniref90_XXXXXX header line) into a dict.
# A sequence can be accessed like so: uniref_dict["UniRef90_A0A0K2VG56"]
with open("UniRef_data.txt", "rt") as f:
    data = f.read()

uniref_dict = {}
for uniref in [f">{chunk.rstrip()}" for chunk in data.split(">")]:
    uniref_id = uniref[1:uniref.find(" ")]
    uniref_dict[uniref_id] = uniref

# Then we go through the first file, line by line, id by id, and write to 
# a new file the corresponding sequence (again, including the Uniref90_XXXXXX 
# header line, as per your output) and append the Uniref90_XXXXXX at the end.
with open("UniRef_ids.txt", "rt") as fin:
    with open("UniRef_data_ordered.txt", "wt") as fout:
        for line in fin:
            line = line.rstrip()
            uniref_ids = line.split(" ")
            for uniref_id in uniref_ids:
                try:
                    fout.write("{} ##{}\n".format(uniref_dict[uniref_id], uniref_id))
                except KeyError as e:
                    print(f"uniref_id '{uniref_id}' found in id file but not data file. Continuing...")

统一参考数据_已订购.txt地址：

>UniRef90_A0A0K2VG56 - Cluster: titin isoform X29
MATQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGVQISFSD
GRAKLMIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIVEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
IDGAAGQELPHKTPPRIPLKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPT ##UniRef90_A0A0K2VG56
>UniRef90_A0A0P5UY87 - Cluster: titin isoform X4
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGELYSLLIVEAYPEDSGTYSVNATN
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
GRAKLMIPAVAAGHSGRYTLQATNGSGQATSTAELLVTAETAPPNFSQRLQSTTARQGSQ ##UniRef90_A0A0P5UY87
>UniRef90_A0A095VQ09 - Cluster: LOW QUALITY PROTEIN: titin
MTTKAPTFTQPLQSVVALEGSAATFEAHISGSPVPEVSWYRDGQVLSAATLPGVQISFSD
GRAKLMIPAVAAGHSGRYTLQATNGSGQATSTAELLVTAETAPPNFSQRLQSTTARQGSQ
VRLDVRVTGIPTPVVKFYRDRAEIQSSPDFQILQEGDLYSLIIAEAYPEDSGTYSVNATN ##UniRef90_A0A095VQ09
>UniRef90_A0A0C1UI80 - Cluster: LOW QUALITY PROTEIN: lafev
GRAKLMIPAVTKANSGRYSLRATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGLARQQSPSPIRHSPSPVRHVRAPT ##UniRef90_A0A0C1UI80
>UniRef90_A0A1M4ZSK2  - Cluster: titin isoform X54
SVGRATSTAELLVQGEEVVPAKKTKTIVSTSTAELLVTAETAPPNFSQRLQSTTARQGSQ
SVGRATSTAELLVQGEEVVPAKKTKTIVSTAQISKSRETRIEKKIEAHFDARSIATVEMV
IDGAAGQELPHKTPPRIPLKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPT ##UniRef90_A0A1M4ZSK2

is it possible to create separate files for each iteration of the loop? I mean, for each row of the first file, I would like to create a file with the ID and the corresponding sequences?

是的，那是可能的。我们只需要打开输出文件，在遍历第一个文件中的行的for循环中编写代码，并为每个文件指定一个唯一的名称。你知道吗

# We first go through the second file, get all the Uniref90_XXXXXX IDs, and 
# put their sequences (including the Uniref90_XXXXXX header line) into a dict.
# A sequence can be accessed like so: uniref_dict["UniRef90_A0A0K2VG56"]
with open("UniRef_data.txt", "rt") as f:
    data = f.read()

uniref_dict = {}
for uniref in [f">{chunk.rstrip()}" for chunk in data.split(">")]:
    uniref_id = uniref[1:uniref.find(" ")]
    uniref_dict[uniref_id] = uniref

# Then we go through the first file, line by line, and write to a new  
# file the ids and their corresponding sequences (again, including the 
# Uniref90_XXXXXX header line, as per your output)
with open("UniRef_ids.txt", "rt") as fin:
    # Each iteration of this for loop is a new line of Uniref90_XXXXXX ids,
    # so we've moved the file writing code inside of this loop.
    # enumerate gives us a counter - i - that starts at 1, and increments by 1
    # after each iteration. We use this to give each file a unique name.
    for i, line in enumerate(fin, start=1):
        line = line.rstrip()
        uniref_ids = line.split(" ")
        with open(f"UniRef_data_by_id_row_{i:03}.txt", "wt") as fout:
            for uniref_id in uniref_ids:
                try:
                    fout.write(uniref_dict[uniref_id] + "\n")
                except KeyError as e:
                    print(f"uniref_id '{uniref_id}' found in id file but not data file. Continuing...")

顺便说一下，这是生成文件名的代码：

f"UniRef_data_by_id_row_{i:03}.txt"

f前缀告诉Python它是一个f-string。它计算{}中的内容并返回一个字符串。在:前面是值，后面是格式说明符。在本例中，格式说明符0-padsi宽度为3，给出如下文件名：

UniRef_data_by_id_row_001.txt
UniRef_data_by_id_row_999.txt

这样，在文件管理器中对文件进行排序非常容易。你知道吗

可以用不同的名称命名文件。例如，如果不需要下划线，并且要用空格而不是0填充数字：

f"UniRef Data Ordered by ID - Row {i: >4}.txt"

UniRef Data Ordered by ID - Row    1.txt
UniRef Data Ordered by ID - Row 9999.txt

相关问题更多 >

编程相关推荐

热门问题

热门文章