用Python清理数据集

2024-09-27 22:08:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我对Python还不熟悉。我有一个CSV文件,其tweet条目的格式如下:

15,Oct 11,785816454042124288,/realDonaldTrump/status/785816454042124288,False,"Despite winning the second debate in a landslide (every poll), it is hard to do well when Paul Ryan and others give zero support!",DonaldTrump

还有一个

16,Oct 10,785563318652178432,/realDonaldTrump/status/785563318652178432,False,"Wow, @CNN got caught fixing their ""focus group"" in order to make Crooked Hillary look better. Really pathetic and totally dishonest!",DonaldTrump

在Python中,我使用熊猫加载内容,如下所示:

data = pd.read_csv(arg, sep=',')

现在,我想清理CSV文件,只保存用户ID(每行的第3个条目)和tweet本身(我想是第6行)。如您所见,我使用sep=','进行拆分。问题是,如果某些tweet包含逗号,我不希望由于拆分而删除此字符。。如果tweet号码、日期、用户id等之间的分隔符不是逗号,那么就容易多了。有什么建议吗?我只想要一个没有我不需要的信息的新CSV文件。你知道吗


Tags: and文件csvto用户in目的false
1条回答
网友
1楼 · 发布于 2024-09-27 22:08:25

The problem is if some tweets contains commas, I don't want this character to be removed due to the splitting..

常规的Python标准库CSV module很好地处理了这种情况:

>>> import csv
>>> s = '''15,Oct 11,785816454042124288,/realDonaldTrump/status/785816454042124288,False,"Despite winning the second debate in a landslide (every poll), it is hard to do well when Paul Ryan and others give zero support!",DonaldTrump
16,Oct 10,785563318652178432,/realDonaldTrump/status/785563318652178432,False,"Wow, @CNN got caught fixing their ""focus group"" in order to make Crooked Hillary look better. Really pathetic and totally dishonest!",DonaldTrump
'''.splitlines()
>>> for fields in csv.reader(s):
        print(fields[2], fields[5])


785816454042124288 Despite winning the second debate in a landslide (every poll), it is hard to do well when Paul Ryan and others give zero support!
785563318652178432 Wow, @CNN got caught fixing their "focus group" in order to make Crooked Hillary look better. Really pathetic and totally dishonest!

相关问题 更多 >

    热门问题