python:如何从一个自然语言文件中提取记录,定界符是从记录的开始5个字符

2024-09-20 04:13:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要从一个相当古老的系统生成的日志文件中提取单个记录,并为数据库输入做好准备。这些平面文件是我所能提取的全部(仅格式化查询就花了几周时间)。下面是一个包含两个记录的文件的示例。我看到的唯一分隔符是“/11 S11-”,它本身在中的一个常规位置5个字符,但不完全在开头或结尾。在

对于观看的人来说,是的,这与my other newb question有关。我已经看过python文档、一些google结果和一些relatedquestions。所以,我的问题是

a)如何使用从记录开始5个字符的分隔符?在

b)如何抓住这些大块的自然语言?在

c)如何去除换行后的空白?这可能是最简单的部分:我可以在查询中指定每个字段的长度。目前,种质号是10个字符长,序列号是10个字符,patMedicalRecordum是15个字符。所以finalxtext上的空白是35个字符。在

01/01/11  S11-55555 20/444-55-6666 A.  PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:                           
                                   -  ADENOCARCINOMA.                                                      

                                   TOTAL GLEASON SCORE:  GLEASON 5+4=9                                     
                                   TUMOR LOCATION:  BILATERAL                                              
                                   TUMOR QUANTITATION:  15% OF PROSTATE INVOLVED BY TUMOR
                                   EXTRAPROSTATIC EXTENSION:  PRESENT AT RIGHT POSTERIOR                   
                                   SEMINAL VESICLE INVASION:  PRESENT                                      
                                   MARGINS:  UNINVOLVED                                                    
                                   LYMPHOVASCULAR INVASION:  PRESENT                                       
                                   PERINEURAL INVASION:  PRESENT                                           
                                   LYMPH NODES (SPECIMENS B AND C):                                        
                                      NUMBER EXAMINED:  25                                                 
                                      NUMBER INVOLVED:  1                                                  
                                      DIAMETER OF LARGEST METASTASIS:  1.7 mm                              
                                   ADDITIONAL FINDINGS:  HIGH-GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA,   
                                      ACUTE AND CHRONIC INFLAMMATION, INTRADUCTAL EXTENSION OF INVASIVE    
                                      CARCINOMA                                                            

                                   PATHOLOGIC STAGE:  pT3b N1 MX                                           

                               B.  LYMPH NODES, RIGHT PELVIC, EXCISION:                                    
                                   -  ONE OF SEVENTEEN LYMPH NODES POSITIVE FOR METASTASIS (1/17).         

                               C.  LYMPH NODES, LEFT PELVIC, EXCISION:                                     
                                   -  EIGHT LYMPH NODES NEGATIVE FOR METASTASIS (0/8).                     
01/02/11  S11-4444 20/111-22-3333 PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:                               
                                  - ADENOCARCINOMA.                                                        
                                    GLEASON SCORE:  3 + 3 = 6 WITH TERTIARY PATTERN OF 5.                                             
                                    TUMOR QUANTITATION:  APPROXIMATELY 10% BY VOLUME.                      
                                    TUMOR LOCATION:  BILATERAL.                                            
                                    EXTRAPROSTATIC EXTENSION:  NOT IDENTIFIED.                             
                                    MARGINS:  NEGATIVE.                                                    
                                    PERINEURAL INVASION:  IDENTIFIED.                                      
                                    LYMPH-VASCULAR INVASION:  NOT IDENTIFIED.                              
                                    SEMINAL VESICLE/VASA DEFERENTIA INVASION: NOT IDENTIFIED.              
                                    LYMPH NODES:  NONE SUBMITTED.                                          
                                    OTHER:  HIGH GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA.                
                                   PATHOLOGIC STAGE (pTNM):  pT2c NX. 

Tags: and文件of记录nodes个字符lymphpresent
3条回答

我会试试这样的方法:

import re                                # regex module

in_string = """Text from above"""

records = []                             # list to store all records in order
record = ""                              # string to store current record

for line in in_string.splitlines():      # go through each line of the input
    if re.match('\d\d/\d\d/\d\d',line):  # match the date at the start 
        records.append(record)           # add current record to list
        record = ""                      # start new current record
    record += line.strip()               # add line (without whitespace) to current record
records.append(record)                   # add last record to records list

这将输出以下内容:

['',

'01/01/11 S11-55555 20/444-55-6666 A. PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:- ADENOCARCINOMA.TOTAL GLEASON SCORE: GLEASON 5+4=9TUMOR LOCATION: BILATERALTUMOR QUANTITATION: 15% OF PROSTATE INVOLVED BY TUMOREXTRAPROSTATIC EXTENSION: PRESENT AT RIGHT POSTERIORSEMINAL VESICLE INVASION: PRESENTMARGINS: UNINVOLVEDLYMPHOVASCULAR INVASION: PRESENTPERINEURAL INVASION: PRESENTLYMPH NODES (SPECIMENS B AND C):NUMBER EXAMINED: 25NUMBER INVOLVED: 1DIAMETER OF LARGEST METASTASIS: 1.7 mmADDITIONAL FINDINGS: HIGH-GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA,ACUTE AND CHRONIC INFLAMMATION, INTRADUCTAL EXTENSION OF INVASIVECARCINOMAPATHOLOGIC STAGE: pT3b N1 MXB. LYMPH NODES, RIGHT PELVIC, EXCISION:- ONE OF SEVENTEEN LYMPH NODES POSITIVE FOR METASTASIS (1/17).C. LYMPH NODES, LEFT PELVIC, EXCISION:- EIGHT LYMPH NODES NEGATIVE FOR METASTASIS (0/8).',

'01/02/11 S11-4444 20/111-22-3333 PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:- ADENOCARCINOMA.GLEASON SCORE: 3 + 3 = 6 WITH TERTIARY PATTERN OF 5.TUMOR QUANTITATION: APPROXIMATELY 10% BY VOLUME.TUMOR LOCATION: BILATERAL.EXTRAPROSTATIC EXTENSION: NOT IDENTIFIED.MARGINS: NEGATIVE.PERINEURAL INVASION: IDENTIFIED.LYMPH-VASCULAR INVASION: NOT IDENTIFIED.SEMINAL VESICLE/VASA DEFERENTIA INVASION: NOT IDENTIFIED.LYMPH NODES: NONE SUBMITTED.OTHER: HIGH GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA.PATHOLOGIC STAGE (pTNM): pT2c NX.']

注意:这是一个糟糕的正则表达式,它将匹配任何以“nn/nn/nn”开头的行

您可能需要在行之间添加一个空格—类似于record += line.strip()+' '

祝你好运!在


您可以使用正则表达式(regex/re)here-将正则表达式(即\d\d/\d\d/\d\d S11)放在顶部框中,文本放在底部框中。在

分隔符

我可能没什么大不了的,但看看你的记录,特别是01/01/11 S11-55555 20/444-55-666601/01/11对我来说有点像约会。在

因此,从您的输入判断:

  • 您可以检查行是否以日期开头(这里的格式是mm/dd/yy),例如使用非常简单的regex和re.match。在
  • 看起来每个记录中的数据都是缩进的,所以一行没有缩进意味着它是一个分隔符。在

空白

my_string.strip返回my_string,去掉了初始空格和尾随空格。在

这是一个想法:

 chunky = open(file, 'r')
    for line in chunky:
        if line>'00':                            # It's a starting line
            linedata = line.split(None, 3)       # separates line in four pieces
            chunk = linedata[3].strip()
        else:
            chunk += ' ' + line.strip()

对于新手来说:一个字符串的一部分:行[a:b],其中a是从0开始的第一个,b是第一个不需要的。你的S11应该是linedata[1][0:3]

相关问题 更多 >