CSV中无效行的疑难解答

+------------+-------------+------------+ | ID | Date | String | +------------+-------------+------------+ | 123456 | 09-20-2019 | ABCDEFG | | 123abc456 | 10-30-2019 | HIJKLMN | | 7891011 | jdqhouehwf | OPQRSTU | | 1010101 | 03-15-2018 | 8473737 | | 4823.00 | 02-11-2015 | VWXYZ | | 2348813.0 | 01-23-2016 | BAZ | +------------+-------------+------------+

"ID","Date","String" 123456,"09-20-2019","ABCDEFG" 123abc456,"10-30-2019","HIJKLMN" 7891011,"jdqhouehwf","OPQRSTU" 1010101,"03-15-2018",8473737 4823.00,"02-11-2015","VWXYZ" "2348813.0","01-23-2016","BAZ"

"ID","Date","String" 123456,"09-20-2019","ABCDEFG" # 123abc456,"10-30-2019","HIJKLMN" was deleted because the ID wasn't a number # 7891011,"jdqhouehwf","OPQRSTU" was deleted because the data was not a date 1010101,"03-15-2018","8473737" # The last number could be converted to string 4823,"02-11-2015","VWXYZ" # The first number could be converted to integer 2348813,"01-23-2016","BAZ" # The ID number could be converted to int

2条回答

网友

1楼 · 编辑于 2024-10-04 09:26:00

正如您标记的sed，这里有一个命令应该以一种非常有效和可移植的方式来完成这项工作，但是它有点不可读。。。你知道吗

sed -n '1p;s/^"\{0,1\}\([0-9]\+\)\(\.[0-9]*\)\{0,1\}"\{0,1\}\(,"\(0[0-9]\|1[0-2]\)-\([0-2][0-9]\|3[01]\)-2[0-9]\{3\}",\)"\{0,1\}\([^"]*\)"\{0,1\}$/\1\3"\6"/p' file

它的作用是：

打印标题，即第一行（1p）
在所有行上尝试替换（s）命令，并且仅当替换成功时才打印结果（因此仅当行与搜索模式匹配时）s/…/…/p。你知道吗

关于替换模式\1\3"\6"，每个转义的数字都指向相应的捕获组（\(…\)；请记住，根据开始\(标记出现的顺序为这些组分配了一个数字）。具体来说：

\1表示前导数（[0-9]\+），有或没有（\{0,1\}）以下三件事：
- 领先的"
- 后面的小数部分\.[0-9]*
- 以及以下"
\3指的是包含在"周围的日期（"\(0[0-9]\|1[0-2]\)-\([0-2][0-9]\|3[01]\)-2[0-9]\{3\}"，注意我在这个正则表达式中是不准确的，因为它也会匹配不存在的日期，比如2月31日）；
"\6"引用（并把它放在"之间）到最后的字母数字字符串，我对它几乎没有任何假设（[^"]*）。

这应该与日期匹配得更好一些（除了2月29日总是匹配，无论年份如何）：

sed -n '1p;s/^"\{0,1\}\([0-9]\+\)\(\.[0-9]*\)\{0,1\}"\{0,1\}\(,"\(\(0[0-9]\|1[0-2]\)-[0-2][0-9]\|\(0[469]\|11\)-30\|\(0[13578]\|1[02]\)-31\)-2[0-9]\{3\}",\)"\{0,1\}\([^"]*\)"\{0,1\}$/\1\3"\8"/p' file

网友

2楼 · 编辑于 2024-10-04 09:26:00

def main():

    from pathlib import Path
    import csv
    import datetime as dt

    with Path("thing.csv").open("r") as file:
        for row in csv.DictReader(file):
            try:
                row["ID"] = int(float(row["ID"]))
                row["Date"] = dt.datetime.strptime(row["Date"], "%m-%d-%Y")
            except (KeyError, ValueError):
                continue
            print(*row.values())

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

相关问题更多 >

编程相关推荐

热门问题

热门文章