如何使用python有条件地从txt文件中删除行序列

2024-06-28 11:33:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我从MS-DIAL metabolomics MSP spectral kit containing EI-MS, MS/MS下载了一个大文本文件

该文件将作为化合物的txt文件打开,如下所示:

NAME: C11H11NO5; PlaSMA ID-967
PRECURSORMZ: 238.0712
PRECURSORTYPE: [M+H]+
FORMULA: C11H11NO5
Ontology: Formula predicted
INCHIKEY:
SMILES:
RETENTIONTIME: 1.74
CCS: -1
IONMODE: Positive
COLLISIONENERGY:
Comment: Annotation level-3; PlaSMA ID-967; ID title-AC_Bulb_Pos-629; Max plant tissue-LE_Ripe_Pos
Num Peaks: 2
192.06602   53
238.0757    31

NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141
PRECURSORMZ: 656.19415
PRECURSORTYPE: [M+H]+
FORMULA: C29H35O17
Ontology: Anthocyanidin O-glycosides
INCHIKEY: CILLXFBAACIQNS-UHFFFAOYNA-O
SMILES: COC1=CC(=CC(OC)=C1O)C1=C(OC2OC(CO)C(O)C(O)C2O)C=C2C(OC3OC(CO)C(O)C(O)C3O)=CC(O)=CC2=[O+]1
RETENTIONTIME: 2.81
CCS: 241.3010517
IONMODE: Positive
COLLISIONENERGY:
Comment: Annotation level-1; PlaSMA ID-3141; ID title-Malvidin-3,5-di-O-glucoside; Max plant tissue-Standard only
Num Peaks: 0

每个化合物都有从NAME到下一个NAME之间的数据

我想做的是删除所有在Num Peaks:中值为零的化合物(即Num Peaks: 0。如果化合物的12行是Num Peaks: 0,删除所有化合物的数据-向上删除12行)

在上面的化合物中,删除NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141Num Peaks: 0之间的行是很重要的 之后,我需要将数据保存回txt或msp格式

我所做的只是将数据作为列表导入:

with open('path\to\MSMS-Public-Pos-VS15.msp') as f:
    lines = f.readlines()

然后创建一个包含索引的列表,其中每个复合开头link

indices = [i for i, s in enumerate(lines) if 'NAME' in s]

我想,现在我需要附加连续的索引,差值大于14(意味着峰值num大于零)link

# to find the difference between consecutive indices.

v = np.diff(indices)

选择那些有差异的14 并在第一个位置添加一个元素零


diff14 = np.where(v == 14)

diff14 = np.append([0],diff14[0])

现在我只想选择那些不在diff14中的值,以便创建一个包含峰数大于零的化合物的新列表

现在我需要一些循环来选择正确的索引,但不知道如何:

lines[indices[diff14[0]]: indices[diff14[1]]]

lines[indices[diff14[1]+1] : indices[diff14[2]]]

lines[indices[diff14[2]+1] : lines[indices[diff14[3]]]]

lines[indices[diff14[3]+1] : indices[diff14[4]]]

如有任何更好的想法或提示,我们将不胜感激


Tags: 数据nameposidnummslinesplasma
3条回答

下面是一种处理文件的相当简单的方法

打开数据文件并遍历其行,将它们存储在列表(缓存)中。如果一行以NAME:开头,则该行是新记录的开头,如果缓存不是空的,则可以打印缓存

如果该行以Num Peaks:开头,则检查该值。如果为零,则缓存被清空,导致此记录被遗忘

跳过仅包含空格的行

with open('data') as f:
    line_cache = []
    for line in f:
        if line.startswith('NAME:'):
            if line_cache:
                print(*line_cache, sep='')
                line_cache = []
        elif line.startswith('Num Peaks:'):
            num_peaks = int(line.partition(': ')[2])
            if num_peaks == 0:
                line_cache = []
                continue

        if line.strip():        # filter empty lines
            line_cache.append(line)

    if line_cache:    # don't forget the last record
        print(*line_cache, sep='', end='')

输出到标准输出。它可以重定向到shell环境中的文件中。如果要直接写入文件,可以在开始时打开它并修改print()语句:

with open('output', 'w') as output, open('data') as f:
    ...

并将print()更改为

print(*line_cache, sep='', file=output)
# Open / read tmp file created with the text you supplied
filedat = open('tmpWrt.txt','r')
filelines = filedat.readlines()

# Open output file object
file_out = open('tmp_out.txt','w')

line_count = 0

# Iterate through all file lines
for line in filelines:
    # If line is beginning of section
    # reset tmp variables
    if line != "\n" and line.split()[0] == "NAME:":
        tmp_lines = []
        flag = 'n'

    tmp_lines.append(line)
    line_count += 1

    # If line is the end of a section and peaks > 0
    # write to file
    if (line == "\n" or line_count == len(filelines)) and flag == 'y':
        #tmp_lines.append("\n")
        for tmp_line in tmp_lines:
            file_out.write(tmp_line)

    # If peaks > 0 set flag to "y"
    if line != "\n" and line.split()[0] == "Num":
            if int(line.split()[2]) != 0:
                flag = "y"

file_out.close()

这并不像其他答案那样紧凑和高效,但希望它更容易理解和扩展

我建议的方法是将您的输入解析为列表列表,每个元素包含一个化合物。我建议三个步骤:(1)将数据解析为化合物列表,(2)迭代此化合物列表,删除您不需要的化合物,(3)将列表输出回文件。根据文件的大小,可以在数据上使用1个循环,也可以使用3个单独的循环

# Step (1) Parse the file
compounds = list() # store all compunds
with open('compound.txt', 'r') as f:
    # stores a single compound as a list of rows for a given compound.
    # Note: can be improved to e.g. a dictionary or a custom class
    current_compound = list()
    for line in f:
        if line.strip() == '': # assumes each compound is split by empty line(s)
            print('Empty line')
            # Store previous compound
            if len(current_compound) != 0:
                compounds.append(list(current_compound))

            # prepare for next compound
            current_compound = list()
        else:
            # At this point we could parse this more,
            # e.g. seperate into key/value, but lets just append the whole line with trailing newline
            print('Adding', line.strip())
            current_compound.append(line)

好的,现在让我们检查一下进展情况

for item in compounds:
    print('\n===Compound===\n', item)

导致

===Compound===
 ['NAME: C11H11NO5; PlaSMA ID-967\n', 'PRECURSORMZ: 238.0712\n', 'PRECURSORTYPE: [M+H]+\n', 'FORMULA: C11H11NO5\n', 'Ontology: Formula predicted\n', 'INCHIKEY:\n', 'SMILES:\n'\
, 'RETENTIONTIME: 1.74\n', 'CCS: -1\n', 'IONMODE: Positive\n', 'COLLISIONENERGY:\n', 'Comment: Annotation level-3; PlaSMA ID-967; ID title-AC_Bulb_Pos-629; Max plant tissue-LE\
_Ripe_Pos\n', 'Num Peaks: 2\n', '192.06602   53\n', '238.0757    31\n']

===Compound===
 ['NAME: Malvidin-3,5-di-O-glucoside; PlaSMA ID-3141\n', 'PRECURSORMZ: 656.19415\n', 'PRECURSORTYPE: [M+H]+\n', 'FORMULA: C29H35O17\n', 'Ontology: Anthocyanidin O-glycosides\n\
', 'INCHIKEY: CILLXFBAACIQNS-UHFFFAOYNA-O\n', 'SMILES: COC1=CC(=CC(OC)=C1O)C1=C(OC2OC(CO)C(O)C(O)C2O)C=C2C(OC3OC(CO)C(O)C(O)C3O)=CC(O)=CC2=[O+]1\n', 'RETENTIONTIME: 2.81\n', '\
CCS: 241.3010517\n', 'IONMODE: Positive\n', 'COLLISIONENERGY:\n', 'Comment: Annotation level-1; PlaSMA ID-3141; ID title-Malvidin-3,5-di-O-glucoside; Max plant tissue-Standard\
 only\n', 'Num Peaks: 0\n']

然后,您可以遍历此复合物列表,并在写回文件之前删除Num Peaks设置为0的复合物。如果您在这方面也需要帮助,请告诉我

相关问题 更多 >