解析一个不一致但很阴险的textfi

Clnr Kontonr Konto Valuta Bokföringsdatum Transaktionsdatum Referens Kontohändelse Belopp 12345 1234567890 vardagskonto SEK 13-09-30 13-09-30 Hyresgästför Autogiro -15,00 12345 1234567890 vardagskonto SEK 13-09-30 13-09-30 SPOTIFY SPOTIFY Kortköp/uttag -19,00 12345 1234567890 vardagskonto SEK 13-09-30 13-09-30 +46123456789 Swish mottagen 80,00 12345 1234567890 vardagskonto SEK 13-09-30 13-09-30 PRIS NYCKELKUND Debiteringsavgift -49,00 12345 1234567890 vardagskonto SEK 13-09-27 13-09-27 12345678 direktbetalning -301,00 12345 1234567890 vardagskonto SEK 13-09-27 13-09-27 Unionen Bg-bet. via internet -125,00 12345 1234567890 vardagskonto SEK 13-09-26 13-09-26 123456789012345 Överföring -1 000,00

3条回答

网友

1楼 · 编辑于 2024-09-30 06:15:46

为了方便起见，下面是如何“自动”解析此格式的方法：

import re

# find out spaces' positions common to all rows
spaces = sorted(set.intersection(*[
    set(m.end() for m in re.finditer(ur'\s', line))
    for line in data
]))

# split by these positions
for line in data:
    row = []
    p = 0
    for s in spaces:
        row.append(line[p:s])
        p = s
    row.append(line[p:])
    row = filter(len, map(unicode.strip, row))
    print ' | '.join(row) # or whatever you want...

对于您的数据：

^{pr2}$

打印：

Clnr | Kontonr | Konto | Valuta | Bokföringsdatum | Transaktionsdatum | Referens | Kontohändelse | Belopp
12345 | 1234567890 | vardagskonto | SEK | 13-09-30 | 13-09-30 | Hyresgästför | Autogiro | -15,00
12345 | 1234567890 | vardagskonto | SEK | 13-09-30 | 13-09-30 | SPOTIFY SPOTIFY | Kortköp/uttag | -19,00
12345 | 1234567890 | vardagskonto | SEK | 13-09-30 | 13-09-30 | +46123456789 | Swish mottagen | 80,00
12345 | 1234567890 | vardagskonto | SEK | 13-09-30 | 13-09-30 | PRIS NYCKELKUND | Debiteringsavgift | -49,00
12345 | 1234567890 | vardagskonto | SEK | 13-09-27 | 13-09-27 | 12345678 | direktbetalning | -301,00
12345 | 1234567890 | vardagskonto | SEK | 13-09-27 | 13-09-27 | Unionen | Bg-bet. via internet | -125,00
12345 | 1234567890 | vardagskonto | SEK | 13-09-26 | 13-09-26 | 123456789012345 | Överföring | -1 000,00

网友

2楼 · 编辑于 2024-09-30 06:15:46

可以使用re.split分隔这些值。示例：

import re

raw_data = open("test.csv").readlines()
header = raw_data[0]
data = raw_data[1:]

for line in data:
        values = re.split("\s{2,}", line.strip()) # splits by two or more spaces
        print list(values) # show as a list

网友

3楼 · 编辑于 2024-09-30 06:15:46

为什么不使用此正则表达式：

(.*?)(  +|\r\n|\n|$)

似乎所有的柱子都被两个空格隔开了

相关问题更多 >

编程相关推荐

热门问题

热门文章