Python.exe运行脚本和列表实现时挂起

2024-09-29 23:29:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我开发了一个脚本来处理CSV文件并生成另一个结果文件。脚本在有限的测试数据下成功运行,但当我用实际的数据文件(15列中有2500万行)执行它时,同一个脚本被挂起并突然关闭。请参阅随附的错误屏幕截图。你知道吗

所以,有没有什么最大限度,我可以读取使用熊猫从CSV文件或最大限度地存储在列表中的记录。。?你知道吗

请分享您的想法,以优化下面的脚本。你知道吗

[Error Screen Shot]

下面是脚本。你知道吗

import csv
import operator
import pandas as pd
import time

print time.strftime('Script Start Time : ' + "%Y-%m-%d %H:%M:%S")
sourceFile = raw_input('Enter file name along with path : ')
searchParam1 = raw_input('Enter first column name containing MSISDN : ').lower()
searchParam2 = raw_input('Enter second column name containing DATE-TIME : ').lower()
searchParam3 = raw_input('Enter file seperator (,/#/|/:/;) : ')

df = pd.read_csv(sourceFile, sep=searchParam3)
df.columns = df.columns.str.lower()
df = df.rename(columns={searchParam1 : 'msisdn', searchParam2 : 'datetime'})

destFileWritter = csv.writer(open(sourceFile + ' - ProcessedFile.csv','wb'))
destFileWritter.writerow(df.keys().tolist())
sortedcsvList = df.sort_values(['msisdn','datetime']).values.tolist()

rows = [row for row in sortedcsvList]
col_1 = [row[df.columns.get_loc('msisdn')] for row in rows]
col_2 = [row[df.columns.get_loc('datetime')] for row in rows]

for i in range(0,len(col_1)-1):
    if col_1[i] == col_1[i+1]:
        #print('Inside If...')
        continue
    else:
        for row in rows:
            if col_1[i] in row:
                if col_2[i] in row:
                    #print('Inside else...')
                    destFileWritter.writerow(row)
destFileWritter.writerow(rows[len(rows)-1])
print('Processing Completed, Kindly Check Response File On Same Location.')
print time.strftime('Script End Time : ' + "%Y-%m-%d %H:%M:%S")
raw_input('Press Enter to Exit...')[![enter image description here][1]][1]

更新脚本:

import csv
import operator
import pandas as pd
import time
import sys

print time.strftime('Script Start Time : ' + "%Y-%m-%d %H:%M:%S")
sourceFile = raw_input('Enter file name along with path : ')
searchParam1 = raw_input('Enter first column name containing MSISDN : ').lower()
searchParam2 = raw_input('Enter second column name containing DATE-TIME : ').lower()
searchParam3 = raw_input('Enter file seperator (,/#/|/:/;) : ')

def csvSortingFunc(sourceFile, searchParam1, searchParam2, searchParam3):
    CHUNKSIZE = 10000
    for chunk in pd.read_csv(sourceFile, chunksize=CHUNKSIZE, sep=searchParam3):
        df = chunk
        #df = pd.read_csv(sourceFile, sep=searchParam3)
        df.columns = df.columns.str.lower()
        df = df.rename(columns={searchParam1 : 'msisdn', searchParam2 : 'datetime'})
        """destFileWritter = csv.writer(open(sourceFile + ' - ProcessedFile.csv','wb'))
        destFileWritter.writerow(df.keys().tolist()) """
        resultList = []
        resultList.append(df.keys().tolist())
        sortedcsvList = df.sort_values(['msisdn','datetime']).values.tolist()
        rows = [row for row in sortedcsvList]
        col_1 = [row[df.columns.get_loc('msisdn')] for row in rows]
        col_2 = [row[df.columns.get_loc('datetime')] for row in rows]
        for i in range(0,len(col_1)-1):
            if col_1[i] == col_1[i+1]:
                #print('Inside If...')
                continue
            else:
                for row in rows:
                    if col_1[i] in row:
                        if col_2[i] in row:
                            #print('Inside else...')
                            #destFileWritter.writerow(row)
                            resultList.append(row)
        #destFileWritter.writerow(rows[len(rows)-1])
    resultList.append(rows[len(rows)-1])
    writedf = pd.DataFrame(resultList)
    writedf.to_csv(sourceFile + ' - ProcessedFile.csv', header=False, index=False)
    #print('Processing Completed, Kindly Check Response File On Same Location.')


csvSortingFunc(sourceFile, searchParam1, searchParam2, searchParam3)
print('Processing Completed, Kindly Check Response File On Same Location.')
print time.strftime('Script End Time : ' + "%Y-%m-%d %H:%M:%S")
raw_input('Press Enter to Exit...')

Tags: columnscsvinimportdfforinputraw
1条回答
网友
1楼 · 发布于 2024-09-29 23:29:33

如果您可以轻松地聚合结果,那么您应该非常明确地考虑在中使用参数chunksizepd.read\U csv文件。它允许您读取大块的.csv文件,比如100000条记录。你知道吗

chunksize = 10000
for chunk in pd.read_csv(filename, chunksize=chunk_size):
    df = chunk
    #your code

之后,您应该将每次计算的结果附加到最后一次计算中。 希望能有所帮助,我在处理超过数百万行的文件时使用了这种方法。你知道吗

续:

    i = 0
    for chunk in pd.read_csv(sourceFile, chunksize=10):
        print('chunk_no', i)
        i+=1

你能把这几句话讲一下吗?它能打印出一些数字吗?你知道吗

相关问题 更多 >

    热门问题