如何在两个列表之间找到匹配项并根据匹配项编写输出？

list_headers = ['gene_id', 'gene_name', 'trans_id'] # these are the features to be mined from each line of `attri_values` attri_values = ['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"'] ['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"'] ['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']

output = open('gtf_table', 'w') output.write('\t'.join(list_headers) + '\n') # this will first write the header # then I want to read each line for values in attri_values: for list in list_headers: if values.startswith(list): attr_id = ''.join([x for x in attri_values if list in x]) attr_id = attr_id.replace('"', '').split(' ')[1] output.write('\t' + '\t'.join([attr_id])) elif not values.startswith(list): attr_id = 'NA' output.write('\t' + '\t'.join([attr_id])) output.write('\n')

3条回答

网友

1楼 · 编辑于 2024-05-19 16:35:01

我设法写了一个函数，将有助于分析您的数据。我试图修改你发布的原始代码，使问题复杂的是你存储需要解析的数据的方式，无论如何我无法判断，以下是我的代码：

def searchHeader(title, values):
    """"
    searchHeader(title, values)  > list 

    *Return all the words of strings in an iterable object in which title is a substring, 
    without including title. Else write 'N\A' for strings that title is not a substring.
    Example:
             >>> seq = ['spam and ham', 'spam is awesome', 'Ham is...!', 'eat cake but not pizza']
             >>> searchHeader('spam', attri_values)
             ['and', 'ham', 'is', 'awesome', 'N\\A', 'N\\A'] 
    """
    res = [] 
    for x in values: 
        if title in x: 
            res.append(x)
        else:
            res.append('N\A')                     # If no match found append N\A for every string in values

    res = ' '.join(res)
    # res = res.replace('"', '')                  You can use this for your code or use it after you call the function on res
    res = res.split(' ')
    res = [x for x in res if x != title]          # Remove title string from res
    return  res

正则表达式在这种情况下也很方便。使用此函数解析数据，然后格式化结果以将表写入文件。此函数只使用一个for循环和一个列表理解，而在代码中使用两个嵌套的for循环和一个列表理解。你知道吗

将每个头字符串分别传递给函数，如下所示：

for title in list_headers: 
    result = searchHeader(title, attri_values)
    ...format as table...
    ...write to file...

如果可能的话，可以考虑将attri_values从一个简单的列表移动到一个字典中，这样您就可以用它们的头对字符串进行分组：

attri_values = {'header': ('data1', 'data2',...)}

在我看来，这比使用列表要好得多。还要注意，您在代码中重写list名称，这不是一件好事，因为list实际上是创建列表的内置类。你知道吗

网友

2楼 · 编辑于 2024-05-19 16:35:01

python有一个字符串的find方法，您可以使用它来迭代每个属性值的每个列表头。尝试使用此功能：

def Get_Match(search_space,search_string):
    start_character = search_space.find(search_string)

    if start_character == -1:
        return "N/A"
    else:
        return search_space[(start_character + len(search_string)):]

for  i in range(len(attri_values_1)):
    for j in range(len(list_headers)):
        print Get_Match(attri_values_1[i],list_headers[j])

网友

3楼 · 编辑于 2024-05-19 16:35:01

我的答案是熊猫

import pandas as pd

# input data
list_headers = ['gene_id', 'gene_name', 'trans_id']

attri_values = [
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"'],
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"'],
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']]

# process input data
attri_values_X = [dict([tuple(b.split())[:2] for b in a]) for a in attri_values]

# Create DataFrame with the desired columns
df = pd.DataFrame(attri_values_X, columns=list_headers)

# print dataframe
print df

输出

               gene_id  gene_name             trans_id
0  "scaffold_200001.1"        NaN                  NaN
1  "scaffold_200001.1"        NaN  "scaffold_200001.1"
2  "scaffold_200002.1"        NaN  "scaffold_200002.1"

没有熊猫也很容易。我已经给了你attri_values_X，那你就快到了，把你不想要的字典里的键去掉就行了。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章