<p>这里有一种替代方法:对文本行进行两次处理。在第一步中,我们将猜测起始索引:非空格字符前面有空格的索引。对于每一行,开始索引可能稍有不同,但是如果我们看一下这些索引,它们都是<strong>全部</strong>共同的,那么这些索引可能是列的开始。对启动指数的猜测并不完美。它要求一行中没有一个单元格丢失</p>
<p>在第二个步骤中,我们将使用这些索引将一行拆分为列,去掉后面的空白。在</p>
<pre><code>import itertools
def split_columns(row, start_indices):
start_indices = start_indices + [None]
a, b = iter(start_indices), iter(start_indices)
next(b)
columns = []
for istart, istop in zip(a, b):
columns.append(row[slice(istart, istop)].strip())
return columns
def guess_start_indices(row):
row = ' ' + row
prev_seq, cur_seq = iter(row), iter(row)
next(cur_seq)
start_indices = []
for i, (prev, cur) in enumerate(zip(prev_seq, cur_seq)):
if prev == ' ' and cur != ' ':
start_indices.append(i)
return set(start_indices)
def find_common_start_indices(rows):
start_indices = set.intersection(*(guess_start_indices(row) for row in rows))
start_indices = sorted(start_indices)
return start_indices
if __name__ == '__main__':
with open('columnize.txt') as rows:
first_pass, second_pass = itertools.tee(rows)
start_indices = find_common_start_indices(first_pass)
print(start_indices)
for row in second_pass:
print(split_columns(row, start_indices))
</code></pre>
<p>注释</p>
<ul>
<li>在代码中,我创建了两个迭代器<code>first_pass</code>和{<cd2>}来帮助迭代文本行。这些迭代器很重要,因为它们允许在不倒带文件的情况下对行进行两次迭代。在</li>
<li>这种方法的主题是猜测,所以会有文本愚弄编码,它会做出错误的猜测</li>
<li>因此,将此解决方案作为起点并验证输出</li>
</ul>