使用单空格和多空格分隔符组合读取平面文件

2024-06-28 20:16:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个空格分隔的文件,其中空格是可变的,我想知道读取此类文件的理想方法。我尝试过熊猫,尝试过设置许多分隔符,但到目前为止没有任何效果

我当前使用的数据格式:

STBID DOCUMENTNO   DOCDATE    CUSTID    CT TOWNID           PRDID     PRD                                                BATCHNO    PRICE        QUANTITY     BONUS        DISCOUNT     AMOUNT       NETAMOUNT    REASON
642    752633       07-07-2021 0092      01 026              4419      OAD X-MEN TAB . 20S                                T-0987     1105.00      2            0            0.00         2210.00      2210.00      R

我需要的数据格式:

STBID,DOCUMENTNO,DOCDATE,CUSTID,CT,TOWNID,PRDID,PRD,BATCHNO,PRICE,QUANTITY,BONUS,DISCOUNT,AMOUNT,NETAMOUNT,REASON
642,752633,07-07-2021,0092,01,026,4419,OAD X-MEN TAB . 20S,T-0987,1105.00,2,0,0.00,2210.00,2210.00,R

Tags: 文件pricequantity空格数据格式ctbonusprd
2条回答

下面是一个将数据视为固定宽度文件的工作解决方案,使用^{}

import re
import numpy as np   # not strictly required

# read header
with open('multi_space.csv', 'r') as f:
    header = f.readline()

# get starting positions for each word in the header
starts = [m.start() for m in re.finditer('\w+', header)]

# define colspecs (start,stop) for each column
cols = list(zip(starts, np.array(starts[1:]+[len(head)])-1))
## below alternative without numpy
# cols = list(zip(starts, [s-1 for s in starts[1:]+[len(head)]]))

# read fixed width
df = pd.read_fwf('multi_space.csv', colspecs=cols)

输出:

   STBID  DOCUMENTNO    DOCDATE  CUSTID  CT  TOWNID  PRDID                  PRD BATCHNO   PRICE  QUANTITY  BONUS  DISCOUNT  AMOUNT  NETAMOUNT REASON
0    642      752633  07-07-202      92   0      26   4419  OAD X-MEN TAB . 20S  T-0987  1105.0         2      0       0.0  2210.0     2210.0      R

信息:

>>> df.infos()
RangeIndex: 1 entries, 0 to 0
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
 -                      -  
 0   STBID       1 non-null      int64  
 1   DOCUMENTNO  1 non-null      int64  
 2   DOCDATE     1 non-null      object 
 3   CUSTID      1 non-null      int64  
 4   CT          1 non-null      int64  
 5   TOWNID      1 non-null      int64  
 6   PRDID       1 non-null      int64  
 7   PRD         1 non-null      object 
 8   BATCHNO     1 non-null      object 
 9   PRICE       1 non-null      float64
 10  QUANTITY    1 non-null      int64  
 11  BONUS       1 non-null      int64  
 12  DISCOUNT    1 non-null      float64
 13  AMOUNT      1 non-null      float64
 14  NETAMOUNT   1 non-null      float64
 15  REASON      1 non-null      object 

你可以在下面试试

df = pd.read_csv("texttest", header=None)
print(df)
                                                                                                                                                                                                                          0
0  STBID DOCUMENTNO   DOCDATE    CUSTID    CT TOWNID           PRDID     PRD                                                BATCHNO    PRICE        QUANTITY     BONUS        DISCOUNT     AMOUNT       NETAMOUNT    REASON
1      642    752633       07-07-2021 0092      01 026              4419      OAD X-MEN TAB . 20S                                T-0987     1105.00      2            0            0.00         2210.00      2210.00      R

现在使用replace将空格转换为逗号

df = df.replace(r'\s+', ',', regex=True)
print(df)
                                                                                                                   0
0  STBID,DOCUMENTNO,DOCDATE,CUSTID,CT,TOWNID,PRDID,PRD,BATCHNO,PRICE,QUANTITY,BONUS,DISCOUNT,AMOUNT,NETAMOUNT,REASON
1               642,752633,07-07-2021,0092,01,026,4419,OAD,X-MEN,TAB,.,20S,T-0987,1105.00,2,0,0.00,2210.00,2210.00,R

最后,保存到一个没有索引和头的文件中

df.to_csv('new_csv2',index=False,header=False)

$ cat new_csv1

"STBID,DOCUMENTNO,DOCDATE,CUSTID,CT,TOWNID,PRDID,PRD,BATCHNO,PRICE,QUANTITY,BONUS,DISCOUNT,AMOUNT,NETAMOUNT,REASON"
"642,752633,07-07-2021,0092,01,026,4419,OAD,X-MEN,TAB,.,20S,T-0987,1105.00,2,0,0.00,2210.00,2210.00,R"

编辑:

由于数据不一致,所以对于单个事件,您可以作为一种解决方法,但这不是动态的,最好在处理时清理数据

df = df.replace(r"X-MEN,TAB,.,20S", "OAD X-MEN TAB . 20S", regex=True)
print(df)
                                                                                                                   0
0  STBID,DOCUMENTNO,DOCDATE,CUSTID,CT,TOWNID,PRDID,PRD,BATCHNO,PRICE,QUANTITY,BONUS,DISCOUNT,AMOUNT,NETAMOUNT,REASON
1           642,752633,07-07-2021,0092,01,026,4419,OAD,OAD X-MEN TAB . 20S,T-0987,1105.00,2,0,0.00,2210.00,2210.00,R

相关问题 更多 >