我有一个大的文本文件。我把那个文件分成一定大小的小文件。下面是我得到的一个例子:
import math
import os
numThread = 4
inputData= 'dir\example.txt'
def chunk_files():
nline = sum(1 for line in open(inputData,'r', encoding='utf-8', errors='ignore'))
chunk_size = math.floor(nline/int(numThread ))
n_thread = int(numThread )
j = 0
with open(inputData,'r', encoding='utf-8', errors='ignore') as file_:
for i, line in enumerate(file_):
if (i + 1 == j * chunk_size and j != n_thread) or i == nline:
out.close()
if i + 1 == 1 or (j != n_thread and i + 1 == j * chunk_size):
chunk_file = '_raw' + str(j) + '.txt'
if os.path.isfile(chunk_file):
break
out = open(chunk_file, 'w+', encoding='utf-8', errors='ignore')
j = j + 1
if out.closed != True:
out.write(line)
if i % 1000 == 0 and i != 0:
print ('Processing line %i ...' % (i))
print ('Done.')
这是文本文件中的文本示例:
190219 7:05:30 line3 success
line3 this is the 1st success process
line3 this process need 3sec
200219 9:10:10 line2 success
line2 this is the 1st success process
由于块的大小,我得到了各种形式的分割文本。像这样:
190219 7:05:30 line3 success
line3 this is the 1st success process
line3 this process need 3sec
200219 9:10:10 line2 success
line2 this is the 1st success process
我需要使用regexreg= re.compile(r"\b(\d{6})(?=\s\d{1,}:\d{2}:\d{2})\b")
得到split,后跟datetime,如下所示:
190219 7:05:30 line3 success
line3 this is the 1st success process
line3 this process need 3sec
200219 9:10:10 line2 success
line2 this is the 1st success process
我试过了。但似乎我不能调整它与我的问题。你知道吗
有人能帮我把regex放到chunk\u文件函数中吗?提前谢谢
由于我们的行数似乎不是静态的,我们可以得到6位数的数字和日期,然后收集所有行,然后编写问题的其余部分,也许这个简单的表达式是我们感兴趣的:
这里有我们的数字部分:
我们这里的台词是:
Demo
测试
输出
我相信,让事情简单一点会有很大帮助。你知道吗
用你的测试试一试,结果是:
然后,您可以让代码返回一个生成器/迭代器,在这里您可以轻松地将任意大小的文件分块,并获得分块行的列表。你知道吗
相关问题 更多 >
编程相关推荐