如何解析某些文本数据?

2024-10-03 15:25:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个这样格式的文本文件:

B2100 Door Driver Key Cylinder Switch Failure B2101 Head Rest Switch Circuit Failure B2102 Antenna Circuit Short to Ground`, plus 1000 lines more.

这就是我想要的:

B2100*Door Driver Key Cylinder Switch Failure B2101*Head Rest Switch Circuit Failure B2102*Antenna Circuit Short to Ground B2103*Antenna Not Connected B2104*Door Passenger Key Cylinder Switch Failure

这样我就可以在LibreOffice Calc中复制这些数据,并将其格式化为两列代码,每列都有意义。你知道吗

我的思维过程:
在Bxxxx上应用一个正则表达式,在它前面加一个星号(它充当分隔符),在意思前面加一个\n(我不知道这是否有效?),并删除空白,直到遇到下一个字符。你知道吗

我正在尝试隔离B2100,直到现在都失败了。我天真的尝试:

import re

text = """B2100 Door Driver Key Cylinder Switch Failure B2101   Head Rest Switch Circuit Failure B2102  Antenna Circuit Short to Ground B2103   Antenna Not Connected B2104 Door Passenger Key Cylinder Switch Failure B2105    Throttle Position Input Out of Range Low B2106  Throttle Position Input Out of Range High B2107 Front Wiper Motor Relay Circuit Short to Vbatt B2108    Trunk Key Cylinder Switch Failure"""
# text_arr = text.split("\^B[0-9][0-9][0-9][0-9]$\gi");
l = re.compile('\^B[0-9][0-9][0-9][0-9]$\gi').split(text)
print(l)

这将输出:

['B2100\tDoor Driver Key Cylinder Switch Failure B2101\tHead Rest Switch Circuit Failure B2102\tAntenna Circuit Short to Ground B2103\tAntenna Not Connected B2104\tDoor Passenger Key Cylinder Switch Failure B2105\tThrottle Position Input Out of Range Low B2106\tThrottle Position Input Out of Range High B2107\tFront Wiper Motor Relay Circuit Short to Vbatt B2108\tTrunk Key Cylinder Switch Failure']

如何达到预期效果?

要进一步细分,我要做的是:
将所有内容分解为代码(B1001)和含义(后面的文本)数组,然后分别对其应用每个操作(The \nthing)。如果你对如何做这件事有更好的想法,那就更好了。我很想听。你知道吗


Tags: tokeyrestfailuredrivershortswitchantenna
3条回答
import re
text = """B2100 Door Driver Key Cylinder Switch Failure B2101   Head Rest Switch Circuit Failure B2102  Antenna Circuit Short to Ground B2103   Antenna Not Connected B2104 Door Passenger Key Cylinder Switch Failure B2105    Throttle Position Input Out of Range Low B2106  Throttle Position Input Out of Range High B2107 Front Wiper Motor Relay Circuit Short to Vbatt B2108    Trunk Key Cylinder Switch Failure"""

l = [i for i in re.split('(B[0-9]{4}\s+)', text) if i]
print '\n'.join(['{}*{}'.format(id_.strip(), label.strip()) for id_,label in zip(l[0::2], l[1::2])])

.split可以在拆分后保留分隔符,如果在正则表达式中包含()。以上产生输出:

B2100*Door Driver Key Cylinder Switch Failure
B2101*Head Rest Switch Circuit Failure
B2102*Antenna Circuit Short to Ground
B2103*Antenna Not Connected
B2104*Door Passenger Key Cylinder Switch Failure
B2105*Throttle Position Input Out of Range Low
B2106*Throttle Position Input Out of Range High
B2107*Front Wiper Motor Relay Circuit Short to Vbatt
B2108*Trunk Key Cylinder Switch Failure

基本上,你想:

  • 在输入中查找任何Bxxxx字符串。你知道吗
  • 用换行符替换前面的空白。你知道吗
  • *替换它们后面的空白。你知道吗

这一切都可以通过一个re.sub()来完成:

re.sub(r'\s*(B\d{4})\s*', r'\n\1*', text).strip()

匹配模式:

\s*              # Any amount of whitespace
   (B\d{4})      # "B" followed by exactly 4 digits
           \s*   # Any amount of whitespace

替换模式:

\n               # Newline
  \1             # The first parenthesized sequence from the matching pattern (B####)
    *            # Literal "*"

strip()的目的是删减任何前导或尾随的空格,包括将由第一个B#####序列的子序列产生的换行符。你知道吗

首先,你的正则表达式是错误的 “^B[0-9][0-9][0-9][0-9]$\gi”

  1. 修饰符在Python上不是这样工作的
  2. ^和$表示行首和行尾,与文本中的任何内容都不匹配
  3. 倍数[0-9]可以替换为“[0-9]{4}”
  4. 如果您想忽略大小写,请在Pythonregex上使用相应的东西

考虑到这一点,实现所需的简单代码如下:

l = [x.strip() for x in re.compile('\s*(B\d{4})\s*', re.IGNORECASE).split(text)]
lines = ['*'.join(l[i:i+2]) for i in range(0,len(l),2)]

相关问题 更多 >