用逗号解析pandas中的CSV文件

2024-06-26 14:53:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要从csv文件创建pandas.DataFrame。为此,我使用pandas.csv_reader(...)方法。这个文件的问题是一个或多个列在值中包含逗号(我不控制文件格式)。 我试图从这个question实现解决方案,但我得到以下错误:

pandas.errors.EmptyDataError: No columns to parse from file 

由于某种原因,在实现这个解决方案后,我尝试修复的csv文件是空白的。在

下面是我使用的代码:

^{pr2}$

有什么想法吗?在

数据概述:

 Id0    Id 1    Id 2 Country Company Title       Email                  
  23    123     456   AR     name    cargador   email@email.com                 

  24    123     456   AR     name    Executive assistant    email@email.com                 

  25    123     456   AR     name   Asistente Administrativo    email@email.com                 

  26    123     456   AR     name   Atención al cliente vía telefónica   vía online email@email.com             
  39    123     456   AR     name   Asesor de ventas    email@email.com                 

  40    123     456   AR     name    inc.   International company representative    email@email.com             
  41    123     456   AR     name   Vendedor de campo   email@email.com                 

  42    123     456   AR     name   PUBLICIDAD   ATENCIÓN AL CLIENTE    email@email.com             
  43    123     456   AR     name   Asistente de Marketing  email@email.com                 

  44    123     456   AR     name   SOLDADOR    email@email.com                 
  217   123     456   AR     name   Se requiere vendedores       Loja    Quevedo     Guayas)    email@email.com 
  218   123     456   AR     name   Ing. Civil recién graduado   Yaruquí    email@email.com             
 219    123     456   AR     name   ayudantes enfermeria    email@email.com                 

 220    123     456   AR     name   Trip Leader for International Youth Exchange    email@email.com                 
 221    123     456   AR     name   COUNTRY MANAGER / DIRECTOR COMERCIAL    email@email.com                 
 250    123     456   AR     name   Ayudante de Pasteleria  email@email.com  Asesor email@email.com email@email.com     

预解析CSV:

#,Id 1,Id 2,Country,Company,Title,Email,,,,
23,123,456,AR,name,cargador,email@email.com,,,,
24,123,456,AR,name,Executive assistant,email@email.com,,,,
25,123,456,AR,name,Asistente Administrativo,email@email.com,,,,
26,123,456,AR,name,Atención al cliente vía telefónica , vía online,email@email.com,,,
39,123,456,AR,name,Asesor de ventas,email@email.com,,,,
40,123,456,AR,name, inc.,International company representative,email@email.com,,,
41,123,456,AR,name,Vendedor de campo,email@email.com,,,,
42,123,456,AR,name,PUBLICIDAD, ATENCIÓN AL CLIENTE,email@email.com,,,
43,123,456,AR,name,Asistente de Marketing,email@email.com,,,,
44,123,456,AR,name,SOLDADOR,email@email.com,,,,
217,123,456,AR,name,Se requiere vendedores,, Loja , Quevedo, Guayas),email@email.com
218,123,456,AR,name,Ing. Civil recién graduado, Yaruquí,email@email.com,,,
219,123,456,AR,name,ayudantes enfermeria,email@email.com,,,,
220,123,456,AR,name,Trip Leader for International Youth Exchange,email@email.com,,,,
221,123,456,AR,name,COUNTRY MANAGER / DIRECTOR COMERCIAL,email@email.com,,,,
250,123,456,AR,name,Ayudante de Pasteleria,email@email.com, Asesor,email@email.com,email@email.com,
251,123,456,AR,name,Ejecutiva de Ventas,email@email.com,,,,

Tags: 文件csvnamecomidpandasemailde
1条回答
网友
1楼 · 发布于 2024-06-26 14:53:52

如果您可以假设对于Comapny,任何逗号后面都是空格,并且所有剩余的错误逗号都在电子邮件地址之前的列中,那么可以编写一个小解析器来处理这个问题。在

代码:

import csv
import re

VALID_EMAIL = re.compile(r'[^@]+@[^@]+\.[^@]+')

def read_my_csv(file_handle):
    # build csv reader
    reader = csv.reader(file_handle)

    # get the header, and find the e-mail and title columns
    header = next(reader)
    email_column = header.index('Email')
    title_column = header.index('Title')

    # yield the header up to the e-mail column
    yield header[:email_column+1]

    # for each row, go through rebuild columns
    for row in reader:

        # for each row, put the Company column back together
        while row[title_column].startswith(' '):
            row[title_column-1] += ',' + row[title_column]
            del row[title_column]

        # for each row, put the Title column back together
        while not VALID_EMAIL.match(row[email_column]):
            row[email_column-1] += ',' + row[email_column]
            del row[email_column]
        yield row[:email_column+1]

测试代码:

^{pr2}$

结果:

      # Id 1 Id 2 Country     Company  \
0    23  123  456      AR        name   
1    24  123  456      AR        name   
2    25  123  456      AR        name   
3    26  123  456      AR        name   
4    39  123  456      AR        name   
5    40  123  456      AR  name, inc.   
6    41  123  456      AR        name   
7    42  123  456      AR        name   
8    43  123  456      AR        name   
9    44  123  456      AR        name   
10  217  123  456      AR        name   
11  218  123  456      AR        name   
12  219  123  456      AR        name   
13  220  123  456      AR        name   
14  221  123  456      AR        name   
15  250  123  456      AR        name   
16  251  123  456      AR        name   

                                               Title            Email  
0                                           cargador  email@email.com  
1                                Executive assistant  email@email.com  
2                           Asistente Administrativo  email@email.com  
3    Atención al cliente vía telefónica , vía online  email@email.com  
4                                   Asesor de ventas  email@email.com  
5               International company representative  email@email.com  
6                                  Vendedor de campo  email@email.com  
7                    PUBLICIDAD, ATENCIÓN AL CLIENTE  email@email.com  
8                             Asistente de Marketing  email@email.com  
9                                           SOLDADOR  email@email.com  
10  Se requiere vendedores,, Loja , Quevedo, Guayas)  email@email.com  
11               Ing. Civil recién graduado, Yaruquí  email@email.com  
12                              ayudantes enfermeria  email@email.com  
13      Trip Leader for International Youth Exchange  email@email.com  
14              COUNTRY MANAGER / DIRECTOR COMERCIAL  email@email.com  
15                            Ayudante de Pasteleria  email@email.com  
16                               Ejecutiva de Ventas  email@email.com  

相关问题 更多 >