如何将大文件按照指定大小和条件分割

2条回答

网友

1楼 · 编辑于 2024-09-30 20:37:18

由于我们的行数似乎不是静态的，我们可以得到6位数的数字和日期，然后收集所有行，然后编写问题的其余部分，也许这个简单的表达式是我们感兴趣的：

(\d{6})\s(\d{1,}:\d{2}:\d{2})|\s*(.*)\s*

这里有我们的数字部分：

(\d{6})\s(\d{1,}:\d{2}:\d{2})

我们这里的台词是：

\s*(.*)\s*

Demo

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(\d{6})\s(\d{1,}:\d{2}:\d{2})|\s*(.*)\s*"

test_str = ("190219 7:05:30 line3 success \n"
    "               line3 this is the 1st success process\n"
    "               line3 this process need 3sec\n"
    "200219 9:10:10 line2 success \n"
    "               line2 this is the 1st success process\n"
    "190219 7:05:30 line3 success \n"
    "               line3 this is the 1st success process\n"
    "               line3 this process need 3sec\n"
    "200219 9:10:10 line2 success \n"
    "               line2 this is the 1st success process\n"
    "200219 9:10:10 line2 success \n"
    "               line2 this is the 1st success process\n"
    "               line2 this is the 1st success process\n"
    "               line2 this is the 1st success process\n"
    "               line2 this is the 1st success process\n"
    "               line2 this is the 1st success process\n"
    "               line2 this is the 1st success process")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

输出

Match 1 was found at 0-14: 190219 7:05:30
Group 1 found at 0-6: 190219
Group 2 found at 7-14: 7:05:30
Group 3 found at -1 1: None
Match 2 was found at 14-45:  line3 success 

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 15-29: line3 success 
Match 3 was found at 45-98: line3 this is the 1st success process

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 45-82: line3 this is the 1st success process
Match 4 was found at 98-127: line3 this process need 3sec

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 98-126: line3 this process need 3sec
Match 5 was found at 127-141: 200219 9:10:10
Group 1 found at 127-133: 200219
Group 2 found at 134-141: 9:10:10
Group 3 found at -1 1: None
Match 6 was found at 141-172:  line2 success 

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 142-156: line2 success 
Match 7 was found at 172-210: line2 this is the 1st success process

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 172-209: line2 this is the 1st success process
Match 8 was found at 210-224: 190219 7:05:30
Group 1 found at 210-216: 190219
Group 2 found at 217-224: 7:05:30
Group 3 found at -1 1: None
Match 9 was found at 224-255:  line3 success 

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 225-239: line3 success 
Match 10 was found at 255-308: line3 this is the 1st success process

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 255-292: line3 this is the 1st success process
Match 11 was found at 308-337: line3 this process need 3sec

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 308-336: line3 this process need 3sec
Match 12 was found at 337-351: 200219 9:10:10
Group 1 found at 337-343: 200219
Group 2 found at 344-351: 9:10:10
Group 3 found at -1 1: None
Match 13 was found at 351-382:  line2 success 

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 352-366: line2 success 
Match 14 was found at 382-420: line2 this is the 1st success process

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 382-419: line2 this is the 1st success process
Match 15 was found at 420-434: 200219 9:10:10
Group 1 found at 420-426: 200219
Group 2 found at 427-434: 9:10:10
Group 3 found at -1 1: None
Match 16 was found at 434-465:  line2 success 

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 435-449: line2 success 
Match 17 was found at 465-518: line2 this is the 1st success process

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 465-502: line2 this is the 1st success process
Match 18 was found at 518-571: line2 this is the 1st success process

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 518-555: line2 this is the 1st success process
Match 19 was found at 571-624: line2 this is the 1st success process

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 571-608: line2 this is the 1st success process
Match 20 was found at 624-677: line2 this is the 1st success process

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 624-661: line2 this is the 1st success process
Match 21 was found at 677-730: line2 this is the 1st success process

Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 677-714: line2 this is the 1st success process
Match 22 was found at 730-767: line2 this is the 1st success process
Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 730-767: line2 this is the 1st success process
Match 23 was found at 767-767: 
Group 1 found at -1 1: None
Group 2 found at -1 1: None
Group 3 found at 767-767:

网友

2楼 · 编辑于 2024-09-30 20:37:18

我相信，让事情简单一点会有很大帮助。你知道吗

all_parts = []
part = []
for line in l.split('\n'):
    if re.search(r"^\d+\s\d+:\d+:\d+\s", line):
        if part:
            all_parts.append(part)
            part = []
    part.append(line)
else: 
    all_parts.append(part)


print(all_parts)

用你的测试试一试，结果是：

In [37]: all_parts                                                                                                                                                                                
Out[37]: 
[['190219 7:05:30 line3 success ',
  '               line3 this is the 1st success process',
  '               line3 this process need 3sec'],
 ['200219 9:10:10 line2 success ',
  '               line2 this is the 1st success process'],
 ['190219 7:05:30 line3 success ',
  '               line3 this is the 1st success process',
  '               line3 this process need 3sec'],
 ['200219 9:10:10 line2 success ',
  '               line2 this is the 1st success process'],
 ['200219 9:10:10 line2 success ',
  '               line2 this is the 1st success process',
  '               line2 this is the 1st success process',
  '               line2 this is the 1st success process',
  '               line2 this is the 1st success process',
  '               line2 this is the 1st success process',
  '               line2 this is the 1st success process']]

然后，您可以让代码返回一个生成器/迭代器，在这里您可以轻松地将任意大小的文件分块，并获得分块行的列表。你知道吗

Demo

测试

输出

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何将大文件按照指定大小和条件分割

Demo

测试

输出

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >