导入500GB文本文件的最快方式，只选取所需部分

网友

1楼 · 编辑于 2024-05-17 05:05:35

你说你有“大约500GB的文本文件”。如果我理解正确，你并没有为每个观察设置固定的长度（注意，我不是说行数，我是指观察所有行的总长度，以字节为单位）。这意味着您必须遍历整个文件，因为您无法确切知道新行将在哪里。在

现在，根据每个文本文件的大小，您可能需要寻找不同的答案。但是如果每个文件足够小（小于1GB？），您可能可以使用^{}模块，该模块为您处理按行搜索。在

你可能会这样使用它：

import linecache

filename = 'observations1.txt'

# Start at 44th line
curline = 44
lines = []

# Keep looping until no return string is found
# getline() never throws errors, but returns an empty string ''
# if the line wasn't found (if the line was actually empty, it would have
# returned the newline character '\n')
while linecache.getline(filename, curline):
    for i in xrange(75):
        lines.append(linecache.getline(filename, curline).rstrip())
        curline += 1

    # Perform work with the set of observation lines
    add_to_observation_log(lines)

    # Skip the unnecessary section and reset the lines list
    curline += 4
    lines = []

我试过这个测试，它在五秒钟内就把一个23MB的文件给啃了。在

网友

2楼 · 编辑于 2024-05-17 05:05:35

你应该考虑把你想保存的信息写入数据库。在python中，可以使用内置的sqlite3。关于docs很容易理解。在

你说你现在正是你想要保存的每个文件中的行。所以你可以试试这个。在

    import csv
    reader = csv.reader(open("afile.csv","rb"),delimiter="\t",quotechar='"')
    info_to_keep = []
    obs = []
    for row in reader:
       i+=1
       if i<43:
           continue
       elif i-43 <79*(len(info_to_keep)+1)-4:
           obs.append(row)
       elif i-43 <79*(len(info_to_keep)+1):
           continue
       else:
           info_to_keep.append(obs)
           obs = [row]

这样你就可以有一个名为info_的列表来保存每个条目，每个条目都包含一个包含来自csv文件的字段的列表

网友

3楼 · 编辑于 2024-05-17 05:05:35

opening the file. Loading it. Deleting these observations going line by line.

你说的“装载”是什么意思？如果你的意思是把整件事都读成一条线，那么是的，这会很糟糕。处理文件的自然方法是利用file对象是文件行上的迭代器这一事实：

for line in file:
    if should_use(line): do_something_with(line)

相关问题更多 >

编程相关推荐

热门问题

热门文章