如何从大文件中检索字符串

with open("\Users\Zebrafish\Desktop\IDs.txt") as f: # will get input from the text for line in f: c = line.split("\t") for i, x in enumerate(c): #passing values to start and end variables if i == 1: start = x elif i == 2: end = x elif i == 0: gene_name = x infile = open("/Users/Zebrafish/Desktop/complete.txt") #file to get large string data for seq in infile: seqnew = seq.split("\t") # get data as single line retrived = seqnew[int(start):int(end)] #retrieve fragment print retrived

3条回答

网友

1楼 · 编辑于 2024-10-01 19:28:01

3MB不是很大（在可以运行Windows的计算机上）。只需将第二个文件作为单个字符串加载到内存中，即可获得片段：

# populate `id -> (start, end)` map
ids = {} 
with open(r"\Users\Zebrafish\Desktop\ASHISH\IDs.txt") as id_file:
    for line in id_file:
        if line.strip(): # non-blank line
           id, start, end = line.split() 
           ids[id] = int(start), int(end)

# load the file as a single string (ignoring whitespace)
with open("/Users/Zebrafish/Desktop/ASHISH/complete.txt") as seq_file:
    s = "".join(seq_file.read().split()) # or re.sub("\s+", "", seq_file.read())

# print fragments
for id, (start, end) in ids.items():
    print("{id} -> {fragment}".format(id=id, fragment=s[start:end]))

如果complete.txt文件不适合内存，可以使用mmap以字节序列的形式访问其内容，而无需将整个文件加载到内存中：

from mmap import ACCESS_READ, mmap    

with open("complete.txt") as f, mmap(f.fileno(), 0, access=ACCESS_READ) as s:
    # use `s` here (assume that indices refer to the raw file in this case)
    # e.g., `fragment = s[start:end]`

网友

2楼 · 编辑于 2024-10-01 19:28:01

我不知道你为什么要在\t上拆分你的complete.txt文件，下面是你的代码优化：

ids = {}
with open('/Users/Zebrafish/Desktop/ASHISH/IDs.txt') as f:
    for line in f:
       if len(line.strip()):
           # This makes sure you skip blank lines
           id,start,end = line.split('\t')
           ids[id] = (int(start),int(end))

# Here, I assume your `complete.txt` is a file with one long line.
with open('/Users/Zebrafish/Desktop/ASHISH/complete.txt') as f:
    sequence = f.readline()

# For each id, fetch the sequence "chunk:
for id,value in ids.iteritems():
    start, end = value
    print('{} {}'.format(id,sequence[start-1:end]))

网友

3楼 · 编辑于 2024-10-01 19:28:01

删除行：

seqnew = seq.split("\t")

只要做：

retrieved = seqnew[int(start):int(end)]

会得到你想要的子串。你知道吗

然后你就可以：

print retrieved

相关问题更多 >

编程相关推荐

热门问题

热门文章