将巨大的JSON文件转换为csv-fi

2024-06-26 14:07:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个巨大的json文件要传输到csv。我在互联网上搜索了很多东西,我也试着自己写Python,但都没用。我被困在这里一个星期了。有人能帮我吗? json文件的格式为:

{"Gid": "5999043768223797248", 
"rights": [{"grantorContext": "Freemium right added by Netlife", "sku": "CMO-STO-2-FREE", "rightId": "5340e29a6dc01000", "grantorId": "NETLIFE_B2C"}], 
"used_quota": "16.95", 
"creationtime": "2001-04-29 12:58:33", 
"devices": [{"last_connection": "2001-05-30 22:06:08", "os_version": "4.2.2", "auto_upload": "wifi", "last_upload": "2002-04-29 13:12:26", "device_name": "i-mobile i-STYLE 7.5", "platform": "unknow", "client_version": "2.0.0"}], 
"total_quota": 2.0, 
"Uid": ["666927729520"]}


{"Gid": "5999043740151320576", 
"rights": [{"grantorContext": "Freemium right added by Netlife", "sku": "CMO-STO-2-FREE", "rightId": "5340e29f72c05000", "grantorId": "NETLIFE_B2C"}, 
           {"grantorContext": null, "sku": "CMO-STO-25-M", "rightId": "53b5d2d8b0400000", "grantorId": "DTN"}], 
"used_quota": "480.85", 
"creationtime": "2001-04-29 12:58:38", 
"devices": [{"last_connection": "2001-08-02 03:46:05", "os_version": "8.4", "auto_upload": "wifi", "last_upload": "2015-08-02 03:46:05", "device_name": "Nokia", "platform": "unknow", "client_version": "1.0.0"}], 
"total_quota": 27.0, 
"Uid": ["465949097714"]}


{"Gid": "5999043675907166208", 
"rights": [{"grantorContext": null, "sku": "CMO-STO-25-M", "rightId": "53b5d2e161000000", "grantorId": "DTN"}, 
           {"grantorContext": "Freemium right added by Netlife", "sku": "CMO-STO-2-FREE", "rightId": "5340e29b42805000", "grantorId": "NETLIFE_B2C"}], 
"used_quota": "8.26", 
"creationtime": "2001-04-29 12:58:35", 
"devices": [{"last_connection": "2001-04-29 13:08:24", "os_version": "4.2.2", "auto_upload": "wifi", "last_upload": "2002-04-29 13:03:25", "device_name": "Nokia V797", "platform": "unknow", "client_version": "2.0.0"}], 
"total_quota": 27.0, 
"Uid": ["666994575443"]}

Tags: rightaddedversionlastuploadquotagidsku
3条回答

这里有一个使用re.split()的稍微不那么暴力的方法,我已经测试过,在一个拥有8Gb内存的core-i3上,以下操作只需几秒钟

huge="{" + "a"*1000 + "}\n"
huge = huge * 500000
len(huge)/1000000.0
# get 501.5 (Mbytes)
jsons =  re.split(r'\}\s*\{', huge)
len(s)
# get 500000, took about 2 seconds
del huge  # might be a good idea to free half a gigabyte asap.

split生成单独的JSON元素,每个元素都在自己的字符串中,减去左大括号(第一个除外)和右大括号(最后一个除外)。所以剩下的工作(未经测试)将是

^{pr2}$

这是一种强力方法,当文件足够小,可以作为单个字符串处理而不会耗尽内存时

import json
import re

multijsons = open('file.json','r').read()

sep = re.compile( r'\}\s*\{' )
jsonlist = '[' + re.sub( sep, '}, {', multijsons ) + ']'

load =  json.loads( jsonlist)

# quick debug:
for item in load:
    print item
    print '\n -\n'

这是我使用的,因为我的文件不是很大。考虑过编码上面,但不需要它,它看起来似乎是棘手的混乱

我自己也有类似的档案!在

它不是有效的JSON文件。它是一组JSON文件合并成一个文件。从Python身上json.dumps文件文件

Unlike pickle and marshal, JSON is not a framed protocol, so trying toserialize multiple objects with repeated calls to dump() using thesame fp will result in an invalid JSON file

也就是说,即使它可以同时放入内存中,也不能使用json.load文件要读取此文件,除非在最后编辑第一行和“”之前的“[]”和每个元素之间的逗号(即在“}”和“{”/p>之间的空行中。

您可以使用pythonjson模块,让它做我想您想做的事情,以python dict的形式顺序读取每组七个元素,键为“Gid”“rights”等

您必须使用jsondcoder类的raw_decode方法,它将在结束的“}”处停止,并将索引返回到它正在扫描的字符串中,这样您就可以切掉刚刚处理的内容。在

所以,读读这个大文件的一大块,然后在异常处理程序中尝试对一个元素进行原始解码。如果成功,保存解码结果,删除解码成功的部分,然后重复。如果出现异常,请从文件中读取另一个块,并将其附加到正在解码的字符串中,然后重复。如果仍然出现异常,则表明您所在的JSON元素已损坏(或超过块大小,或者无法正确处理文件结尾)。在

如果你的文件少于几十(几百?)的话,编写代码就容易得多兆字节。然后,只需将整个内容读入一个字符串,并开始从前面大量删除JSON元素,直到只剩下空白或遇到解码器错误为止。在

相关问题 更多 >