使用正则表达式分隔符读取csv

6 Rotterdam NLD Zuid-Holland 593321 19 Zaanstad NLD Noord-Holland 135621 214 Porto Alegre BRA Rio Grande do Sul 1314032 397 Lauro de Freitas BRA Bahia 109236 547 Dobric BGR Varna 100399 552 Bujumbura BDI Bujumbura 300000 554 Santiago de Chile CHL Santiago 4703954 626 al-Minya EGY al-Minya 201360 646 Santa Ana SLV Santa Ana 139389 762 Bahir Dar ETH Amhara 96140 123 Chicago 10000 222 New York 200000

我的尝试

import numpy as np import pandas as pd from io import StringIO s = """6 Rotterdam NLD Zuid-Holland 593321 19 Zaanstad NLD Noord-Holland 135621 214 Porto Alegre BRA Rio Grande do Sul 1314032 397 Lauro de Freitas BRA Bahia 109236 547 Dobric BGR Varna 100399 552 Bujumbura BDI Bujumbura 300000 554 Santiago de Chile CHL Santiago 4703954 626 al-Minya EGY al-Minya 201360 646 Santa Ana SLV Santa Ana 139389 762 Bahir Dar ETH Amhara 96140 123 Chicago 10000 222 New York 200000 """; sep = r'(\d+)\s+|([\D]+)\s+|(\d+)\s+' df = pd.read_csv(StringIO(s), sep=sep,engine='python') df

我有很多NaN，如何只得到3列

Column names are: ID CITY POPULATION

2条回答

网友

1楼 · 编辑于 2024-10-06 12:20:57

只是为了提供一个不使用正则表达式的替代解决方案：

您还可以用纯Python解析文本文件。在某些情况下，这可能比一个相当复杂的正则表达式更容易维护

对于这种特定的格式，我们知道每行的第一个和最后一个数字都有特殊的含义。所以我会用split和rsplit来挑选它们

import pandas as pd
from io import StringIO

s = """6 Rotterdam NLD Zuid-Holland 593321 
19 Zaanstad NLD Noord-Holland 135621 
214 Porto Alegre BRA Rio Grande do Sul 1314032 
397 Lauro de Freitas BRA Bahia 109236 
547 Dobric BGR Varna 100399 
552 Bujumbura BDI Bujumbura 300000 
554 Santiago de Chile CHL Santiago 4703954 
626 al-Minya EGY al-Minya 201360 
646 Santa Ana SLV Santa Ana 139389 
762 Bahir Dar ETH Amhara 96140 
123 Chicago 10000 
222 New York 200000  """

data = []
for line in StringIO(s):
    line = line.strip()
    if not line:
        continue
    id_value, line = line.split(" ", 1)
    city, population = line.rsplit(" ", 1)

    data.append((id_value, city, population))

df = pd.DataFrame(data, columns=["id", "city", "population"])
df["id"] = pd.to_numeric(df["id"])
df["population"] = pd.to_numeric(df["population"])
print(df)

我没有做任何速度测量。不过，根据文件大小，速度可能根本不是问题。但即使是这样：我也会先使用这个脚本对数据进行预处理（并且只预处理一次），以便能够在不需要额外参数的情况下使用常规的oldpd.read_csv

网友

2楼 · 编辑于 2024-10-06 12:20:57

您使用模式来匹配（提取）文本，但在pandas方法中，您使用模式来拆分

如果每行开头只能有1、2或3位数字，请使用

sep = r'(?:(?<=^\d)|(?<=^\d{2})|(?<=^\d{3}))\s+|\s+(?=\S+\s*$)'

见regex demo。您可以通过在第一个非捕获组中添加更多lookbehind来扩展它

详细信息

(?:(?<=^\d)|(?<=^\d{2})|(?<=^\d{3}))\s+-1+空格（\s+），在字符串（^）开头加上1位（\d）、2位（\d{2}）或3位（\d{3}）
|-或
\s+(?=\S+\s*$)-1+个空格，后跟1+个非空格字符，然后是字符串结尾之前的任何尾随0+个空格

这很有效

我的尝试

类似问题

相关问题更多 >

编程相关推荐

热门问题

热门文章