Python数据争用问题

2024-09-29 21:25:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我现在被一些小数据集的基本问题所困扰。以下是前三行数据格式说明:

“运动”,“参赛”,“比赛日期”,“地点”,“积分”,“中奖非门票”,“获奖门票”,“比赛参赛作品”,“参赛费用”,“奖品池”,“场地费”

“NBA”,“NBA 3K交叉赛#3[3000保证](仅限早期)(1/15)”,“2015-03-01 13:00:00”,35283.25,“13.33”,“0.00”,171,“20.00”,“3000.00”,35

“NBA”,“NBA 1500上篮4[1500保证](仅提前)(1/25)”,“2015-03-01 13:00:00”,148283.25,“3.00”,“0.00”,862,“2.00”,“1500.00”,200

我在使用read_csv创建数据帧后遇到的问题:

  1. 在某些类别值(例如Prize_Pool)中出现逗号会导致python将这些条目视为字符串。我需要把这些转换成浮点数,以便进行某些计算。我已经使用python的replace()函数来去除逗号,但我已经做到了。

  2. 分类竞赛包含时间戳,但有些是重复的。我想将整个数据集子集为一个只有唯一时间戳的数据集。最好选择删除重复的条目,但目前我只希望能够用唯一的时间戳过滤数据。


Tags: 数据read时间条目作品费用逗号数据格式
2条回答

对包含逗号的数字使用thousands=','参数

In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')

你可以检查奖池是数字的

^{pr2}$

要删除行-首先观察到,也可以选择最后一个

In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
  Sport                                              Entry  \
0   NBA  NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...

      Contest_Date_EST  Place  Points  Winnings_Non_Ticket  Winnings_Ticket  \
0  2015-03-01 13:00:00     35  283.25                13.33                0

   Contest_Entries  Entry_Fee  Prize_Pool  Places_Paid
0              171         20        3000           35

Edit: Just realized you're using pandas - should have looked at that. I'll leave this here for now in case it's applicable but if it gets downvoted I'll take it down by virtue of peer pressure :)

I'll try and update it to use pandas later tonight

似乎itertools.groupby()是这项工作的工具

像这样?在

import csv
import itertools

class CsvImport():

    def Run(self, filename):
        # Get the formatted rows from CSV file
        rows = self.readCsv(filename)
        for key in rows.keys():
            print "\nKey: " + key
            i = 1
            for value in rows[key]:
                print "\nValue {index} : {value}".format(index = i, value = value)
                i += 1

    def readCsv(self, fileName):
        with open(fileName, 'rU') as csvfile:
            reader = csv.DictReader(csvfile)
            # Keys may or may not be pulled in with extra space by DictReader()
            # The next line simply creates a small dict of stripped keys to original padded keys
            keys = { key.strip(): key for (key) in reader.fieldnames }
            # Format each row into the final string
            groupedRows = {}
            for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
                groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
            return groupedRows;

    def normalizeRow(self, row):
        row[1] = float(row[1].replace(',','')) # "Prize_Pool"
        # and so on
        return row


if __name__ == "__main__":
    CsvImport().Run("./Test1.csv")

输出: enter image description here

更多信息:

https://docs.python.org/2/library/itertools.html

希望这有帮助:)

相关问题 更多 >

    热门问题