<p>此处不需要re模块,也不必加载内存中的所有内容:</p>
<pre><code>with open(r"C:\Users\file1.txt", 'r') as f, open('file2.txt', "w") as file:
seen = set() # use a set to only keep distinct lines
for line in f: # iterate the input file
lr = line.rstrip()
if line.startswith('one') and lr.endswith('apple'):
if lr not in seen:
seen.add(lr)
_ = file.write(line)
</code></pre>
<hr/>
<p>如果搜索实际上更复杂并且需要<code>re</code>模块,我仍然坚持一次处理一行,并在循环之外编译正则表达式:</p>
<pre><code>with open(r"C:\Users\file1.txt", 'r') as f, open('file2.txt', "w") as file:
seen = set() # use a set to only keep distinct lines
rx = re.compile(pattern)
for line in f: # iterate the input file
lr = line.rstrip()
if rx.match(lr):
if lr not in seen:
seen.add(lr)
_ = file.write(line)
</code></pre>
<hr/>
<p>如果需要搜索2种模式,并确保第一种模式的匹配在第二种模式的匹配之前写入,则可以使用:</p>
<pre><code>patterns = ["^\s*\[SUM\]\s*[0-9\-\.]+\s+sec(?!\s+0\.00 Bytes).*sender.*",
"^\s*\[SUM\]\s*[0-9\-\.]+\s+sec(?!\s+0\.00 Bytes).*receiver.*"]
rxs = [re.compile(pattern) for pattern in patterns]
with open(r"C:\Users\file1.txt", 'r') as f:
data = [[], []]
seen = set() # use a set to only keep distinct lines
for line in f: # iterate the input file
lr = line.rstrip()
for i, rx in enumerate(rxs):
if rx.match(lr):
if lr not in seen:
seen.add(lr)
data[i].append(line)
with open('file2.txt', "w") as file:
for lst in data:
for line in lst:
_ = file.write(line)
print(file.getvalue())
</code></pre>
<p>它给出了预期的结果:</p>
<pre><code>[SUM] 0.00-34.53 sec 2.11 GBytes 524 Mbits/sec sender
[SUM] 0.00-34.62 sec 2.36 GBytes 586 Mbits/sec sender
[SUM] 0.00-34.75 sec 2.39 GBytes 591 Mbits/sec receiver
</code></pre>