使用Python从文本文件到csv

NAME IMP4 DESCRIPTION small nucleolar ribonucleoprotein CLASS Genetic Information Processing Translation Ribosome biogenesis in eukaryotes DBLINKS NCBI-GI: 15529982 NCBI-GeneID: 92856 OMIM: 612981 /// NAME COMMD9 DESCRIPTION COMM domain containing 9 ORGANISM H.sapiens DBLINKS NCBI-GI: 156416007 NCBI-GeneID: 29099 OMIM: 612299 /// .....

NAME IMP4\tDESCRIPTION small nucleolar ribonucleoprotein\tCLASS Genetic Information Processing\t Translation\t Ribosome biogenesis in eukaryotes\tDBLINKS NCBI-GI: 15529982\t NCBI-GeneID: 92856\t OMIM: 612981 NAME COMMD9\tDESCRIPTION COMM domain containing 9\tORGANISM H.sapiens\tDBLINKS NCBI-GI: 156416007\t NCBI-GeneID: 29099t\ OMIM: 612299

NAME IMP4\tDESCRIPTION small nucleolar ribonucleoprotein\tNA\tCLASS Genetic Information Processing; Translation; Ribosome biogenesis in eukaryotes\tDBLINKS NCBI-GI: 15529982; NCBI-GeneID: 92856; OMIM: 612981 NAME COMMD9\tDESCRIPTION COMM domain containing 9\tORGANISM H.sapiens\tNA\tDBLINKS NCBI-GI: 156416007; NCBI-GeneID: 29099; OMIM: 612299

2条回答

网友

1楼 · 编辑于 2024-06-26 14:42:19

此脚本将您的文本文件转换为有效的CSV文件（例如，可以用Excel读取）：

import sys
from sets import Set

if len(sys.argv) < 2:
    print 'Usage: %s <input-file> <output-file>' % sys.argv[0]
    sys.exit(1)

entries = []
entry = {}

# Read the input file
with open(sys.argv[1]) as input:
    lines = input.readlines()

for line in lines:
    # Check for beginning of new entry
    if line.strip() == '///':
        if len(entry) > 0:
            entries.append(entry)
        entry = {}
        continue

    # Check for presense of key
    possible_key = line[:13].strip()
    if possible_key != '':
        key = possible_key
        entry[key] = []

    # Assemble the value
    if key:
        entry[key].append(line[13:].strip())

# Append the last entry
if len(entry) > 0:
    entries.append(entry)

# 'entries' now contains a list of a dict of a list

# Find out all possible keys
all_keys = Set()
for entry in entries:
    all_keys.union_update(entry.keys())

# Write all entries to the output file
with open(sys.argv[2], 'w') as output:
    # The first line will contain the keys
    output.write(','.join(['"%s"' % key for key in sorted(all_keys)]))
    output.write('\r\n')

    # Write each entry
    for entry in entries:
       output.write(','.join(['"%s"' % ';'.join(entry[key]) if key in entry else '' for key in sorted(all_keys)]))
       output.write('\r\n')

网友

2楼 · 编辑于 2024-06-26 14:42:19

您可以使用itertools.groupby，一次将行收集到记录中，第二次将多行字段收集到迭代器中：

import csv
import itertools

def is_end_of_record(line):
    return line.startswith('///')

class FieldClassifier(object):
    def __init__(self):
        self.field=''
    def __call__(self,row):
        if not row[0].isspace():
            self.field=row.split(' ',1)[0]
        return self.field

fields='NAME DESCRIPTION ORGANISM CLASS DBLINKS'.split()
with open('data','r') as f:
    for end_of_record, lines in itertools.groupby(f,is_end_of_record):
        if not end_of_record:
            classifier=FieldClassifier()
            record={}
            for fieldname, row in itertools.groupby(lines,classifier):
                record[fieldname]='; '.join(r.strip() for r in row)
            print('\t'.join(record.get(fieldname,'NA') for fieldname in fields))

收益率

^{pr2}$

上面是你看到的输出。它与您发布的期望输出相匹配，假设您正在显示该输出的repr。在

参考使用的工具：

itertools.groupby
a class with a ^{} method
str.join和generator expression一起使用，它可以帮助先了解list comprehension
dict.get method with a default value

相关问题更多 >

编程相关推荐

热门问题

热门文章