将显示油井许可证数据的未删除文本文件解析为列表的方法?

2024-09-24 22:18:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我目前正在尝试解析来自AER的Python文本文件,该文件显示了阿尔伯塔省每日颁发的油井许可证。基本上,我想根据文件头中显示的类型(井名、唯一标识符、许可证号等)分离出每个许可证的数据,并将每个数据添加到一个列表中,然后将其移动到数据库中。你知道吗

问题是所讨论的文本文件的格式(见下面的一节)对解析不是特别友好。没有定界符,它是人类可读的。我处理字符串的经验是有限的,我不知道如何着手解决这个问题。你知道吗

下面是有关文本文件的一个片段:





    DATE: 02 July 2019                                                                                  


    --------------------------------------------------------------------------------------------        
    WELL NAME               LICENCE NUMBER         MINERAL RIGHTS       GROUND ELEVATION                
    UNIQUE IDENTIFIER       SURFACE CO-ORDINATES   BOARD FIELD CENTRE   PROJECTED DEPTH                 
    LAHEE CLASSIFICATION    FIELD                                       TERMINATING ZONE                
    DRILLING OPERATION      WELL PURPOSE           WELL  TYPE           SUBSTANCE                       
    LICENSEE                                                            SURFACE LOCATION                
    --------------------------------------------------------------------------------------------        

    MEG K7N HARDY 4-7-77-5               0483923   ALBERTA CROWN        571.7M                          
    106/04-07-077-05W4/02  S  572.4M  W  278.3M    BONNYVILLE           1600.0M                         
    DEV (NC)                             HARDY                          MCMURRAY FM                     
    HORIZONTAL                           RESUMPTIONPRODUCTION (SCHEME)  CRUDE BITUMEN                   
    MEG ENERGY CORP.                                                    09-07-077-05W4                  

    SPL 11-24 HZ MARTEN 14-25-76-6       0494994   ALBERTA CROWN        705.3M                          
    100/14-25-076-06W5/00  S  566.0M  E  800.6M    ST. ALBERT           2700.0M                         
    OUT (C)                              MARTEN                         CLEARWATER FM                   
    HORIZONTAL                           NEW       PRODUCTION           CRUDE OIL                       
    SPUR PETROLEUM LTD.                                                 11-24-076-06W5                  

    SPL 10-24 HZ MARTEN 5-23-76-6        0494995   ALBERTA CROWN        705.5M                          
    100/05-23-076-06W5/00  S  566.3M  W  800.1M    ST. ALBERT           2700.0M                         
    OUT (C)                              MARTEN                         CLEARWATER FM                   
    HORIZONTAL                           NEW       PRODUCTION           CRUDE OIL                       
    SPUR PETROLEUM LTD.                                                 10-24-076-06W5                  

    SURGE ENERGY HZ103 VALHALLA 6-7-75-8 0494996   ALBERTA CROWN        770.8M                          
    103/06-07-075-08W6/00  S  372.0M  E  324.5M    GRANDE PRAIRIE       3350.0M                         
    DEV (NC)                             VALHALLA                       DOIG FM                         
    HORIZONTAL                           NEW       PRODUCTION           CRUDE OIL                       
    SURGE ENERGY INC.                                                   13-06-075-08W6                  

    CNRL ET AL HZ KARR 4-16-66-3         0494997   ALBERTA CROWN        770.7M                          
    100/04-16-066-03W6/00  N  623.4M  E  127.5M    GRANDE PRAIRIE       5295.0M                         
    DEV (NC)                             KARR                           DUNVEGAN FM                     
    HORIZONTAL                           NEW       PRODUCTION           CRUDE OIL                       
    CANADIAN NATURAL RESOURCES LIMITED                                  05-14-066-03W6     

我不需要从虚线之间的标题信息,或日期的任何东西。我只需要从每个块的每一行的每一节中提取文本,如标题所示。我尝试过一些方法,包括Python和RegEx中的基本字符串操作,但是没有一种方法可以接近,我不知所措。。让我知道如果你需要更详细地解释这个任务,我明白这是一个大的要求,是有点复杂。你知道吗


Tags: devnewenergyproductionnc文本文件oilwell
1条回答
网友
1楼 · 发布于 2024-09-24 22:18:17

此表达式或其某些派生表达式可能提取所需数据:

[A-Z]{1,}.*?\d+-\d+-\d+-\d+[\s\S]*?\s{3,}\d+-\d+-\d+-[A-Za-z0-9]{4}

不过,如果我们在通过正则表达式传递头之前删除它,也许我们会做得更好。你知道吗


this demo的右侧面板中,如果您感兴趣,将进一步解释表达式。你知道吗

测试

import re

regex = r"[A-Z]{1,}.*?\d+-\d+-\d+-\d+[\s\S]*?\s{3,}\d+-\d+-\d+-[A-Za-z0-9]{4}"

test_str = (" DATE: 02 July 2019                                                                                  \n\n\n"
    "                                                          \n"
    "    WELL NAME               LICENCE NUMBER         MINERAL RIGHTS       GROUND ELEVATION                \n"
    "    UNIQUE IDENTIFIER       SURFACE CO-ORDINATES   BOARD FIELD CENTRE   PROJECTED DEPTH                 \n"
    "    LAHEE CLASSIFICATION    FIELD                                       TERMINATING ZONE                \n"
    "    DRILLING OPERATION      WELL PURPOSE           WELL  TYPE           SUBSTANCE                       \n"
    "    LICENSEE                                                            SURFACE LOCATION                \n"
    "                                                          \n\n"
    "    MEG K7N HARDY 4-7-77-5               0483923   ALBERTA CROWN        571.7M                          \n"
    "    106/04-07-077-05W4/02  S  572.4M  W  278.3M    BONNYVILLE           1600.0M                         \n"
    "    DEV (NC)                             HARDY                          MCMURRAY FM                     \n"
    "    HORIZONTAL                           RESUMPTIONPRODUCTION (SCHEME)  CRUDE BITUMEN                   \n"
    "    MEG ENERGY CORP.                                                    09-07-077-05W4                  \n\n"
    "    SPL 11-24 HZ MARTEN 14-25-76-6       0494994   ALBERTA CROWN        705.3M                          \n"
    "    100/14-25-076-06W5/00  S  566.0M  E  800.6M    ST. ALBERT           2700.0M                         \n"
    "    OUT (C)                              MARTEN                         CLEARWATER FM                   \n"
    "    HORIZONTAL                           NEW       PRODUCTION           CRUDE OIL                       \n"
    "    SPUR PETROLEUM LTD.                                                 11-24-076-06W5                  ")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

建议

The fourth bird建议:

[The above expression] is not anchored and causes a lot of backtracking. Perhaps anchor it with ^[ \t]* could make it a bit more efficient.

^[ \t]*[A-Z]{1,}.*?\d+-\d+-\d+-\d+[\s\S]*?\s{3,}\d+-\d+-\d+-[A-Za-z0-9]{4}

See a demo

根据当前的示例数据,这可能也是一种选择

^[ \t]*[A-Z]+(?: [A-Z0-9-]+)+[ \t]+[0-9]{7}[ \t]+.*(?:\r?\n(?![ \t]*$).*)* 

See a demo

相关问题 更多 >