使用解析器读取Python中的多个文件(需要一个简短的课程)

2024-09-30 08:38:17 发布

您现在位置:Python中文网/ 问答频道 /正文

作为每周的自我教育日,我在python中玩解析器实用程序。你知道吗

我也取得了一些进步,但我遇到了一个非常愚蠢的问题。你知道吗

The problem:我有一个只包含一列(n行)的输入文件,我有m个字典文件。我想从不同的字典中提取术语的值(在我的输入文件中)。你知道吗

输入文件是:

SGIP1
SLC45A1
NECAP2
AGBL4

字典文件1

NM_032291   chr1    66999824    67210768    0   SGIP1   4694    6   1.0586e-02
NM_001080397    chr1    8384389 8404227 0   SLC45A1 2401    0   0.0000e+00
NM_018090   chr1    16767166    16786584    0   NECAP2  2081    3673    1.4617e+01
NM_032785   chr1    48998526    50489626    -0  AGBL4   2988    0   0.0000e+00
NM_001145278    chr1    16767166    16786584    0   NECAP2  2003    3534    1.4612e+01
NM_013943   chr1    25071759    25170815    0   CLIC4   4434    5646    1.0545e+01
NM_001145277    chr1    16767166    16786584    0   NECAP2  2005    3504    1.4473e+01
NM_052998   chr1    33546713    33585995    0   ADC 2182    4   1.5182e-02
NM_001195683    chr1    92145899    92351836    -0  TGFBR3  6464    59  7.5590e-02

字典文件2

NM_032291   chr1    66999824    67210768    +   SGIP1   4694    44  9.5755e-02
NM_001080397    chr1    8384389 8404227 +   SLC45A1 2401    4   1.7018e-02
NM_018090   chr1    16767166    16786584    +   NECAP2  2081    1815    8.9095e+00
NM_032785   chr1    48998526    50489626    -   AGBL4   2988    4   1.3675e-02
NM_001145278    chr1    16767166    16786584    +   NECAP2  2003    1760    8.9760e+00
NM_013943   chr1    25071759    25170815    +   CLIC4   4434    3859    8.8906e+00
NM_001145277    chr1    16767166    16786584    +   NECAP2  2005    1719    8.7581e+00
NM_052998   chr1    33546713    33585995    +   ADC 2182    14  6.5543e-02
NM_001195683    chr1    92145899    92351836    -   TGFBR3  6464    49  7.7436e-02

字典文件可以是1个或多个,具体取决于用户,并且可以有多行。你知道吗

如果字典文件第6列中的值与输入文件中的值匹配,并且字典第8列中的值大于5。它应该打印第6列和第9列,并将结果汇总到一个最终文件中,如下所示:

SGIP1   1.0586e-02  9.5755e-02
NECAP2  1.4617e+01  8.9095e+00
NECAP2  1.4612e+01  8.9760e+00

这是我要求的格式。你知道吗

我做了什么:

#!/usr/bin/python

import sys
import re
import os
import tempfile
import subprocess
import math
from optparse import OptionParser,OptionGroup

VERSION = "1.0  "

########process the options##########

usage = "usage: %prog -l <FILE> -i <FILE>,<FILE>,<FILE>....... -n <STRING> "
parser = OptionParser()
parser.add_option("-l", "--genelist file", dest="input_file", help="one string per line", metavar="FILE")
parser.add_option("-i", "--RNASeq files (separted by comma)", dest="data_file", help="RNASeq file generated from Arjen's Script", metavar="FILE")
parser.add_option("-n", "--name", dest="name", help="Name of output file", metavar="STRING")
parser.add_option_group(group1)
(options, args) = parser.parse_args()
####check whether all files & scripts are present####
if not options.input_file or not options.name:
        parser.print_help()
        sys.exit(0)
####reading input file ######

for item in open(options.input_file):

    item=item.replace("\n","")

#######reading of data file and matching the components and assembling in final file##########

This is where I am lost I dont know how to do it, the datafiles if more than 1 will be seperated by comma's.


I have done similar thing with quick and dirty solution for one data file, The code for which is below (incase needed)


#! /usr/bin/python
inputfile="genelist.txt"
rnafile="datafile.txt"
for item in open(inputfile):

    item=item.replace("\n","")
    for line in open(rnafile):
        line = line.split("\t")
        if line[5] == item:
            print (line[5] + "\t" + line[8].replace("\n",""))

你们谁能给我带路吗。你知道吗

谢谢


Tags: 文件importaddparserfor字典lineitem
1条回答
网友
1楼 · 发布于 2024-09-30 08:38:17

好的,下面是我的未测试代码。这应该只是做什么,你正在寻找,酒吧异常处理和一些可能的格式问题:

# edit - have to remove trailing \n from input lines
valid_items = [ line.strip() for line in open('input') ]

with open('dictionary1') as dict1:

  for dict2_line in open('dictionary2'):
    dict1_line = dict1.readline()

    # protect against dict1 being shorter
    if dict1_line == '':
      break

    fields1 = dict1_line.split()
    if fields1[5] in valid_items and int(fields1[7]) > 5:
      fields2 = dict2_line.split()
      print(fields1[5].ljust(8) + fields1[8] + '  ' + fields2[8])

并不是说使用split而不使用参数会对任何空格进行拆分,不会产生空字段,并且应该删除后面的换行符。这可能是您正在寻找的,因为示例中的分隔符不一致。你知道吗

希望这有帮助!你知道吗

相关问题 更多 >

    热门问题