从格式不一致的字符串中提取多个数据字段

import re example = '23.12 22.12.09 Verfügung Geldautomat\t63050000 / 9000481400\tGA NR00002317 BLZ63050000 0\t22.12/14.17UHR ESELSBERGW EUR 50,00\t-50,00' x = re.search(r'(\S+) (\S+) ([\S| ]+)\t(\S+) / (\S+)\t([\S| ]+)\t([\S| ]+)\t([\S| ]+)', example) print x.groups() >>>('23.12', '22.12.09', 'Verf\xc3\xbcgung Geldautomat', '63050000', '9000481400', 'GA NR00002317 BLZ63050000 0', '22.12/14.17UHR ESELSBERGW EUR 50,00', '-50,00')

2条回答

网友

1楼 · 编辑于 2024-06-28 14:56:22

我要做几个假设： 1）你可能再也不会使用这个代码了 2）只有几种可能的格式

我不会费心为这个制定一个RE，因为它不需要如此强大。（见假设1）。在

我可能会设法找出一些方法来确定我正在阅读的特定行使用的格式。然后使用一些if语句，通过适当的定界步骤将其发送到，以获得所需的字段顺序。（见假设2）。在

我很快就想出了一个例子，你显然需要做很多改变才能使它适合你的情况，但是你明白了。最困难的部分可能是找出一种方法来决定使用哪个解码器…我在我的例子中使用了“标签的位置”。在

def decoder1(line):
    parts = line.split("\t")
    d1, d2 = parts[0].split(",")
    d3, d4, d5, d6, d7, d8, d9 = parts[1].split(",")
    return [d1, d2, d3, d4, d5, d6, d7, d8, d9]


def decoder2(line):
    parts = line.split("\t")
    d1 = parts[0]
    d2, d3, d4, d5, d6, d7, d8, d9 = parts[1].split(",")
    return [d1, d2, d3, d4, d5, d6, d7, d8, d9]


def decoder3(line):
    parts = line.split("\t")
    d1, d2, d3, d4, d5, d6, d7 = parts[0].split(",")
    d8, d9 = parts[1].split(",")

    return [d1, d2, d3, d4, d5, d6, d7, d8, d9]


if __name__ =="__main__":
    lines = [
            "1,2\t3,4,5,6,7,8,9",
            "1\t2,3,4,5,6,7,8,9",
            "1,2,3,4,5,6,7\t8,9"
            ]

    for line in lines:
        tablocation = len((line.split("\t")[0]).split(","))
        if tablocation == 2:
            res = decoder1(line)
        elif tablocation == 1:
            res = decoder2(line)
        elif tablocation == 7:
            res = decoder3(line)
        else:
            print "Must be a new format for %s" %line
            res = "NA"
        print res

如果你有更多的“解码器选项”，那么花时间开发一些REs可能是值得的，但是如果你不知道所有可能的变化，很难提供比我在上面的方法中展示的更多的帮助。在

网友

2楼 · 编辑于 2024-06-28 14:56:22

在你的问题中有点混乱，但我认为你在问的是：

How do I specify multiple delimiters to split on, some of which may be more than one character long?

答案是使用re.split()：

s = '1 2 3\t7 / 6\t5\t4\t8'

import re

re.split(r'\s/\s|\s|\t',s)
Out[13]: ['1', '2', '3', '7', '6', '5', '4', '8']

你可以在你认为合适的时候重新排列你的最终输出。在

注意：通常在这些多分隔符问题中，您可以任意指定要拆分的标记的顺序。这里不是这样。在

^{pr2}$

您需要在之前查找\s/\s，因为后者是前者的子串。在

相关问题更多 >

编程相关推荐

热门问题

热门文章