如何在python中从片段文本中提取信息？

Over-ride Flag for Site/Laterality/Morphology (Interfield Edit 42) This field is used to identify whether a case was reviewed and coding confirmed for paired-organ primary site cases with an in situ behavior and the laterality is not coded right, left, or one side involved, right or left origin not specified. Code Description Blank Not reviewed, or reviewed and corrected 1 Reviewed and confirmed as reported: A patient had behavior code of in situ and laterality is not stated as right: origin of primary; left: origin of primary; or only one side involved, right or left origin not specified This field is used to identify whether a case was reviewed and coding confirmed for cases with a non- specific laterality code. Code Description Blank1 Not reviewed 11 A patient had laterality coded non-specifically and extension coded specifically This field, new for 2018, indicates whether a case was reviewed and coding ............

code = ["Blank", "1", "Blank1", "11"] des = ["Not reviewed, or reviewed and corrected", "Reviewed and confirmed as reported: A patient had behavior code of in situ and laterality is not stated as right: origin of primary; left: origin of primary; or only one side involved, right or left origin not specified", "Not reviewed", "A patient had laterality coded non-specifically and extension coded specifically"]

2条回答

网友

1楼 · 编辑于 2024-09-28 22:32:08

尝试以下正则表达式方法：

blanks = re.findall(r'\bCode\b.*?\bDescription\s*?(\S+)\s+.*?\r?\n(\d+)\s+.*?(?=\r?\n\r?\n)', inp, flags=re.DOTALL)
print(blanks)

reviews = re.findall(r'\bCode\b.*?\bDescription\s*?\S+\s+(.*?)\r?\n\d+\s+(.*?)(?=\r?\n\r?\n)', inp, flags=re.DOTALL)

这张照片是：

[('Blank', '1'), ('Blank1', '11')]

[('Not reviewed, or reviewed and corrected\n', 'Reviewed and confirmed as reported: A patient had behavior \ncode of in situ and laterality is not\nstated as right: origin of primary; left: origin of primary; or only one side \ninvolved, right or left\norigin not specified'), ('Not reviewed\n', 'A patient had laterality \ncoded non-specifically and\nextension coded specifically')]

这里的想法是只匹配并捕获输入文本的Code ... Description ... Blank部分的各种所需部分。注意，这个答案假设您已经将文本读入Python字符串变量

网友

2楼 · 编辑于 2024-09-28 22:32:08

我们可以用算法/状态机来解决这个问题。下面的代码在与python脚本相同的目录中打开名为“datafile.txt”的文件，对其进行解析，并打印结果。该算法的关键是假设每两个字段之间只有空行和，并且包含我们要记录的描述字段开头的任何行都将其代码属性与其描述属性分隔三个或更多空格。从您的文件片段中可以看出，这些假设总是正确的

index = -1
record = False
description_block = False
codes = []
descriptions = []
with open("datafile.txt", "r") as file:
  for line in file:
    line = [portion.strip() for portion in line.split("   ") if portion != ""]
    if record:
      if len(line) == 2:
        index += 1
        codes.append(line[0])
        descriptions.append(line[1])
      else:
        if line[0]:
          description_block = True
        if description_block:
          if not line[0]:
            description_block = False
            record = False
            continue
          else:
            descriptions[index] += " "+line[0]
    if line[0] == "Code":
      record = True
print("codes:", codes)
print("descriptions:", descriptions)

结果:

codes: ['Blank', '1', 'Blank1', '11']
descriptions: ['Not reviewed, or reviewed and corrected', 'Reviewed and confirmed as reported: A patient had behavior code of in situ and laterality is not stated as right: origin of primary; left: origin of primary; or only one side involved, right or left origin not specified', 'Not reviewed', 'A patient had laterality coded non-specifically and extension coded specifically']

在python 3.8.2中测试

编辑：更新代码以反映注释中提供的整个数据文件

import re
column_separator = "     "
index = -1
record = False
block_exit = False
break_on_newline = False
codes = []
descriptions = []
templine = ""
def add(line):
  global index
  index += 1
  block_exit = False
  codes.append(line[0])
  descriptions.append(line[1])
with open("test", "r", encoding="utf-8") as file:
  while True:
    line = file.readline()
    if not line:
      break
    if record:
      line = [portion.strip() for portion in line.split(column_separator) if portion != ""]
      if len(line) > 1:
        add(line)
      else:
        if block_exit:
            record = False
            block_exit = False
        else:
          if line[0]:
            descriptions[index] += " "+line[0]
          else:
            while True:
              line = [portion.strip() for portion in file.readline().split(column_separator) if portion != ""]
              if not line:
                break
              if len(line) > 1:
                if templine:
                  descriptions[index] += templine
                  templine = ""
                add(line)
                break
              else:
                print(line)
                if line[0] and "Instructions" not in line[0]:
                  templine += " "+line[0]
                else:
                  if break_on_newline:
                    break_on_newline = False
                    record = False
                    templine = ""
                    break
                  else:
                    templine += " "+line[0]
                    break_on_newline = True
    else:
      if line == "Code           Description\n":
        record = True

print("codes:", codes)
print("\n")
print("descriptions:", descriptions)

# for i in range(len(codes)):
#   print(codes[i]+"\t\t", descriptions[i])

相关问题更多 >

编程相关推荐

热门问题

热门文章