Python文本抽取到字典

2024-06-16 13:28:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我是个新手,谁能帮我用正则表达式或任何其他技术将下面的文本集转换成字典

Bus Number: Departure,将在所有消息/块中通用

KPN_Sleeper: Bus Number: Departure 
Bus code: Kpn-866489 KA-01-7233 Bangalore 
AC Sleeper/56 Seats
24 Seats booked 

SRS: Bus Number: Departure 
Bus code: SRS-5858 KA-31-5985 Bangalore 


SAM: Bus Number: Departure 
Bus code: SAM-0077 TN-23-0777 Chennai 
{0:{
  "Bus_name": "KPN_Sleeper",
  "Bus code":"Kpn-866489",
  "Bus Number": "KA-01-7233",
  "Departure": "Bangalore",
  "others": "AC Sleeper/56 Seats 24 Seats booked "
},
1:{
  "Bus_name": "SRS",
  "Bus code":"SRS-5858",
  "Bus Number": "KA-31-5985",
  "Departure": "Bangalore",
  "others": ""
}}

因为我对编码和正则表达式还不熟悉,所以我觉得很难构造


Tags: namenumbersamcodeacbussrskpn
1条回答
网友
1楼 · 发布于 2024-06-16 13:28:01

根据您的意见,我认为您可以尝试以下方法:

^(.*):\s*Bus Number: Departure\s*\nBus code:\s*([^ ]+)\s([^ ]+)\s([^\n]+)[ \t]*(?:\n|$)((?:[^\n]+(?:\n|$))+)?

Regex Demo

示例代码(run here):

regex = r"^(.*):\s*Bus Number: Departure\s*\nBus code:\s*([^ ]+)\s([^ ]+)\s([^\n]+)[ \t]*(?:\n|$)((?:[^\n]+(?:\n|$))+)?"

test_str = ("KPN_Sleeper: Bus Number: Departure \n"
    "Bus code: Kpn-866489 KA-01-7233 Bangalore dfdf\n"
    "AC Sleeper/56 Seats\n"
    "24 Seats booked \n\n"
    "SRS: Bus Number: Departure \n"
    "Bus code: SRS-5858 KA-31-5985 Bangalore dfdf dfd\n\n\n"
    "SAM: Bus Number: Departure \n"
    "Bus code: SAM-0077 TN-23-0777 Chennai \n"
    "asdfadf ;kasdjlfads;f lkadsjf")

matches = re.finditer(regex, test_str, re.MULTILINE)


for match in matches:
    print("Bus Name: "+match.group(1)+"Bus Code: "+match.group(2)+" Bus No: "+match.group(3)+" Departure: "+match.group(4))


#you can have other's value in match.group(5) , however, having it is conditional

说明:

  1. ^(.*):\s* (.*)>;第一个捕获组以获取总线名称\s*以覆盖空白

  2. Bus Number: Departure\s*\n>;公共汽车号码:出发,然后是空格和换行

  3. Bus code:\s*下一行以总线代码冒号和选项空格开始

  4. ([^ ]+)\s([^ ]+)\s([^\n]+)[ \t]*

    a){}>;总线代码\s>;空白处

    b){}>;总线号\s>;空白处

    c){}>;离开,它可能有多个单词

    d)[ \t]*>;它覆盖了起飞后的尾随空间

  5. (?:\n|$)>;它覆盖了换行符或字符串的结尾

  6. ((?:[^\n]+(?:\n|$))+)?

    a){}>;匹配除换行符后跟换行符或字符串结尾以外的任何内容

    b)?:使其成为非捕获组

    c)+表示可以有多行

    d)最后的()对组中的所有other行求和

    e)?使整个other过程成为可选过程

相关问题 更多 >