<p>我们可以用算法/状态机来解决这个问题。下面的代码在与python脚本相同的目录中打开名为“datafile.txt”的文件,对其进行解析,并打印结果。该算法的关键是假设每两个字段之间只有空行<em>和</em>,并且包含我们要记录的描述字段开头的任何行都将其代码属性与其描述属性分隔三个或更多空格。从您的文件片段中可以看出,这些假设总是正确的</p>
<pre><code>index = -1
record = False
description_block = False
codes = []
descriptions = []
with open("datafile.txt", "r") as file:
for line in file:
line = [portion.strip() for portion in line.split(" ") if portion != ""]
if record:
if len(line) == 2:
index += 1
codes.append(line[0])
descriptions.append(line[1])
else:
if line[0]:
description_block = True
if description_block:
if not line[0]:
description_block = False
record = False
continue
else:
descriptions[index] += " "+line[0]
if line[0] == "Code":
record = True
print("codes:", codes)
print("descriptions:", descriptions)
</code></pre>
<p>结果:</p>
<pre><code>codes: ['Blank', '1', 'Blank1', '11']
descriptions: ['Not reviewed, or reviewed and corrected', 'Reviewed and confirmed as reported: A patient had behavior code of in situ and laterality is not stated as right: origin of primary; left: origin of primary; or only one side involved, right or left origin not specified', 'Not reviewed', 'A patient had laterality coded non-specifically and extension coded specifically']
</code></pre>
<p>在python 3.8.2中测试</p>
<p>编辑:
更新代码以反映注释中提供的整个数据文件</p>
<pre><code>import re
column_separator = " "
index = -1
record = False
block_exit = False
break_on_newline = False
codes = []
descriptions = []
templine = ""
def add(line):
global index
index += 1
block_exit = False
codes.append(line[0])
descriptions.append(line[1])
with open("test", "r", encoding="utf-8") as file:
while True:
line = file.readline()
if not line:
break
if record:
line = [portion.strip() for portion in line.split(column_separator) if portion != ""]
if len(line) > 1:
add(line)
else:
if block_exit:
record = False
block_exit = False
else:
if line[0]:
descriptions[index] += " "+line[0]
else:
while True:
line = [portion.strip() for portion in file.readline().split(column_separator) if portion != ""]
if not line:
break
if len(line) > 1:
if templine:
descriptions[index] += templine
templine = ""
add(line)
break
else:
print(line)
if line[0] and "Instructions" not in line[0]:
templine += " "+line[0]
else:
if break_on_newline:
break_on_newline = False
record = False
templine = ""
break
else:
templine += " "+line[0]
break_on_newline = True
else:
if line == "Code Description\n":
record = True
print("codes:", codes)
print("\n")
print("descriptions:", descriptions)
# for i in range(len(codes)):
# print(codes[i]+"\t\t", descriptions[i])
</code></pre>