修复在pandas中读取csv/txt时下载的错误行

2024-10-01 22:34:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我每天从amazon aws下载一个数据集。问题是有一些行下载错误(参见图片。也可以下载示例here)。以“ref”开头的那两行应附加在以“001ec214-97e6-4d84-a64a-d1bee0079fd8”开头的前一行中,以使该行正确。我有上百个这样的案例需要解决,所以我想知道是否有一种方法可以使用pandas read_csv函数来完成这种追加。此外,我不能忽略这些行,因为我会丢失有用的数据

enter image description here


Tags: csv数据方法awsref示例amazonpandas
2条回答

此oneliner应具备以下功能:

from io import StringIO
import pandas as pd

df = pd.read_csv(StringIO(open('filename.csv').read().replace('\n\\\nref :', ' ref :')), sep="\t", header=None)

它读取文件并替换多余的换行符,然后将csv文件的字符串表示形式加载到带有StringIO的pandas中

鉴于问题Google doc的案文:

001ea9a6-2c30-4ce6-b4ee-f445803303db    596d89bf-e641-4e9c-9374-695241694a6d    45640   MUNDO JURIDICO  26204   20350   \N  A0000000031010  10530377    MATEO FLORES ELMER  ASOC INTEGRACION LOS OLIVOS AV NAPOLES MZ I LTE 31  visanet 650011581   1   0   mpos    contactless 0   0   0   visa    421410  5969    12  24  c6d06de1-c9f0-4b4a-992b-ad3a5f263039    000066  666407  2021-07-20 09:31:23 000 301201522830426 995212014496392 158348  \N  \N  \N  PEN 200.00  200.00  0.00    1   0   0.00    200026214   5999    ASOC INTEGRACION LOS OLIVOS AV NAPOLES MZ I LTE 31  MUNDO JURIDICO  947761525   AUTHORIZED  2021-07-20 09:31:24 2021-07-20 09:32:00 210720  093123
001ec0fd-8d0e-4332-851a-bcd93fdf0a37    ee32d094-8fc4-4d92-b788-58ca2ae2a590    36750   Chifa Maryori   18923   25313   \N  A0000000031010  46753818    Chifa Maryori   San isidro Peru visanet 650011581   1   0   mpos    contactless 0   0   0   visa    421355  5765    04  26  7d00f10e-d0b7-40ec-9b97-d20590cb7710    000708  620744  2021-06-27 16:52:21 000 301178787424243 996211783887688 100513  \N  \N  \N  PEN 17.00   17.00   0.00    0   0   0.00    400034782   5499    San isidro Peru Chifa Maryori   +51 01 5179805  AUTHORIZED  2021-06-27 16:52:23 2021-06-27 16:52:23 210627  165221
001ec214-97e6-4d84-a64a-d1bee0079fd8    3c4a98cc-d279-4f9e-af8b-5198647889c5    33214   Inversiones Polleria el rey MPOS    15699   23053   \N  \N  88846910264 JoseOrbegoso Puelle     D 5 Lt 2 urb Mariscal
\
ref : altura de elektra visanet 650011581   1   0   mpos    chip    0   0   0   visa    455788  2123    09  22  8b3fd975-140a-42d5-8601-1b4bcc16bd60    000022  170790  2020-10-20 11:32:44 687 \N  \N  \N  \N  \N  \N  PEN 1.00    1.00    0.00    0   0   0.00    200020168   5999    D 5 Lt 2 urb Mariscal
\
ref : altura de elektra Inversiones Polleria el rey MPOS    964974226   DENIED  2020-10-20 11:32:44 2020-10-20 11:32:44 201020  113244
001ec66f-6350-4a33-a1fe-b9375dac2161    34169a7a-a66f-4258-80c2-8c0d4512aa36    27044   ABRAHAM LORENZO MELGAREJO   10353   13074   \N  \N  99944748991 ABRAHAM LORENZO MELGAREJO   JR RENOVACION 399   visanet 650011581   1   0   mpos    chip    0   0   0   visa    455788  4712    08  24  83915520-1a1f-4d5e-b118-f0c57a9d96ae    000367  161286  2020-10-14 15:59:51 000 300288755920408 995202889603642 242407  \N  \N  \N  PEN 15.00   15.00   0.00    1   0   0.00    200012523   5811    JR RENOVACION 399   ABRAHAM LORENZO MELGAREJO   957888909   AUTHORIZED  2020-10-14 15:59:52 2020-10-14 16:00:21 201014  155951
001eebaf-bccc-47a7-87a3-be8b09eb971b    c5e14889-d61c-4f18-8000-d3bac0564cfb    27605   Polleria Arroyo Express 10792   21904   \N  A0000000031010  41429707    Polleria Arroyo Express San isidro Peru visanet 650011581   1   0   mpos    contactless 0   0   0   visa    421355  9238    09  25  3c5268e8-5731-4fea-905d-45b03ed623d2    000835  379849  2021-01-30 19:43:36 000 301031026163928 995210303271986 928110  \N  \N  \N  PEN 24.00   24.00   0.00    0   0   0.00    400026444   5499    San isidro Peru Polleria Arroyo Express +51 01 5179805  AUTHORIZED  2021-01-30 19:43:37 2021-01-30 19:43:37 210130  194336

…使用来自Concatenate lines with previous line based on number of letters in first column的代码,如果“好的行”都从001开始(这就是下面正则表达式检查的内容),您可以尝试

代码:

import re
import pandas as pd

all_the_data = ""

pattern = r'^001.*$'

with open('aws.txt') as f, open('aws_out.csv','w') as output:
    all_the_data = ""
    for line in f:
        if not re.search(pattern, line):
            all_the_data = re.sub("\n$", "", all_the_data)
        all_the_data = "".join([all_the_data, line])
        
    output.write(all_the_data)

df = pd.read_csv('aws_out.csv', header=None, sep='\t')

print(df)

输出示例:

enter image description here

相关问题 更多 >

    热门问题