我正在阅读Python中的一个大文本文件,它如下所示(包含许多Code
和Description
信息)
Over-ride Flag for Site/Laterality/Morphology (Interfield Edit 42)
This field is used to identify whether a case was reviewed and coding confirmed
for paired-organ primary
site cases with an in situ behavior and the laterality is not coded right,
left, or one side involved, right or left
origin not specified.
Code Description
Blank Not reviewed, or reviewed and corrected
1 Reviewed and confirmed as reported: A patient had behavior
code of in situ and laterality is not
stated as right: origin of primary; left: origin of primary; or only one side
involved, right or left
origin not specified
This field is used to identify whether a case was reviewed and coding confirmed
for cases with a non-
specific laterality code.
Code Description
Blank1 Not reviewed
11 A patient had laterality
coded non-specifically and
extension coded specifically
This field, new for 2018, indicates whether a case was reviewed and coding
............
从上面的自由文本中,我只需要将代码和描述值存储到两个列表中,如下所示
code = ["Blank", "1", "Blank1", "11"]
des = ["Not reviewed, or reviewed and corrected", "Reviewed and confirmed as reported: A patient had behavior code of in situ and laterality is not stated as right: origin of primary; left: origin of primary; or only one side involved, right or left origin not specified", "Not reviewed", "A patient had laterality coded non-specifically and extension coded specifically"]
我如何用Python实现它
注意:Code
可以包含“Blank(或Blank1)”关键字或数值。有时代码Description
在多行中被分段。在上面的示例中,我展示了一个Code
和Description
块包含两个代码和两个描述。但是,一个Code
和Description
块可以包含一个或多个代码和描述
尝试以下正则表达式方法:
这张照片是:
这里的想法是只匹配并捕获输入文本的
Code ... Description ... Blank
部分的各种所需部分。注意,这个答案假设您已经将文本读入Python字符串变量我们可以用算法/状态机来解决这个问题。下面的代码在与python脚本相同的目录中打开名为“datafile.txt”的文件,对其进行解析,并打印结果。该算法的关键是假设每两个字段之间只有空行和,并且包含我们要记录的描述字段开头的任何行都将其代码属性与其描述属性分隔三个或更多空格。从您的文件片段中可以看出,这些假设总是正确的
结果:
在python 3.8.2中测试
编辑: 更新代码以反映注释中提供的整个数据文件
相关问题 更多 >
编程相关推荐