Python3.X中的JSON到DataFrames

2024-05-08 15:22:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个JSON数据如下表所示。每一行都是一本独立的词典

["{'asin': '0001048791', 'salesRank': {'Books': 6334800}, 'imUrl': 'http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg', 'categories': [['Books']], 'title': 'The Crucible: Performed by Stuart Pankin, Jerome Dempsey & Cast'}\n",
 "{'asin': '0000143561', 'categories': [['Movies & TV', 'Movies']], 'description': '3Pack DVD set - Italian Classics, Parties and Holidays.', 'title': 'Everyday Italian (with Giada de Laurentiis), Volume 1 (3 Pack): Italian Classics, Parties, Holidays', 'price': 12.99, 'salesRank': {'Movies & TV': 376041}, 'imUrl': 'http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif', *'related'*: {'also_viewed': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC', 'B002I5GNVU', 'B000RBU4BM'], 'buy_after_viewing': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC']}}\n"]

如果你仔细看这些数据,你可以观察下面的情况

  1. related/also_bought/also_viewed间歇性可用
  2. 它在每对{}之后都有\n。你知道吗

下面是此数据可以包含的最大列数。 Max columns that a single dict in the each line of file can contain

我的最终目标是将上面指示的列数据移动到数据帧,并且不确定是否可以移动它。你知道吗

请帮忙!你知道吗


Tags: 数据comhttpamazontitlemoviesbooksalso
1条回答
网友
1楼 · 发布于 2024-05-08 15:22:59

您的数据不是JSON

好的-你尝试做的有几个问题。首先,将列表复制并粘贴到交互式口译员会话中:

>>> data = ["{'asin': '0001048791', 'salesRank': {'Books': 6334800}, 'imUrl': 'http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg', 'categories': [['Books']], 'title': 'The Crucible: Performed by Stuart Pankin, Jerome Dempsey & Cast'}\n",
...  "{'asin': '0000143561', 'categories': [['Movies & TV', 'Movies']], 'description': '3Pack DVD set - Italian Classics, Parties and Holidays.', 'title': 'Everyday Italian (with Giada de Laurentiis), Volume 1 (3 Pack): Italian Classics, Parties, Holidays', 'price': 12.99, 'salesRank': {'Movies & TV': 376041}, 'imUrl': 'http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif', *'related'*: {'also_viewed': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC', 'B002I5GNVU', 'B000RBU4BM'], 'buy_after_viewing': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC']}}\n"]
>>> [type(x) for x in data]
[<class 'str'>, <class 'str'>]

您只需要一个包含两个字符串的列表。这些不是字典,更重要的是,它们不是有效的JSON。请注意,字符串有单引号,并且有如下键:

*'related'*: {'also_viewed': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC', 'B002I5GNVU', 'B000RBU4BM'], 'buy_after_viewing': ['B0036FO6SI', 'B000KL8ODE', '000014357X', 'B0037718RC']}

它周围有两个星号(*)在单引号外。这肯定不是有效的JSON。一开始,我怀疑您是使用如下代码在中读取数据的:

with open('data.json','r') as f:
    data = f.readlines()

这意味着你的.json文件一开始是无效的。但是,您发布的屏幕截图显示了有效的json代码,因此无论您做了什么处理,都会将有效的json文件转换为无效的json字符串。对于您给出的示例,我们可以通过使用json模块函数loads将有效的JSON字符串反序列化为Python对象,快速地将它们转换为有效的Python对象:

>>> import json
>>> valid_data = [json.loads(s.replace("'",'"').replace('*','')) for s in data]

注意,对于原始列表中的每个字符串,我用双引号(")替换单引号('),并删除星号(*)。现在,这就是你想要的:

>>> from pprint import pprint
>>> pprint(valid_data)
[{'asin': '0001048791',
  'categories': [['Books']],
  'imUrl': 'http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg',
  'salesRank': {'Books': 6334800},
  'title': 'The Crucible: Performed by Stuart Pankin, Jerome Dempsey &amp; '
           'Cast'},
 {'asin': '0000143561',
  'categories': [['Movies & TV', 'Movies']],
  'description': '3Pack DVD set - Italian Classics, Parties and Holidays.',
  'imUrl': 'http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif',
  'price': 12.99,
  'related': {'also_viewed': ['B0036FO6SI',
                              'B000KL8ODE',
                              '000014357X',
                              'B0037718RC',
                              'B002I5GNVU',
                              'B000RBU4BM'],
              'buy_after_viewing': ['B0036FO6SI',
                                    'B000KL8ODE',
                                    '000014357X',
                                    'B0037718RC']},
  'salesRank': {'Movies & TV': 376041},
  'title': 'Everyday Italian (with Giada de Laurentiis), Volume 1 (3 Pack): '
           'Italian Classics, Parties, Holidays'}]

现在,您只需使用来自pandasDataFrame构造函数:

>>> df = pd.DataFrame(valid_data)
>>> df
         asin               categories  \
0  0001048791                [[Books]]   
1  0000143561  [[Movies & TV, Movies]]   

                                         description  \
0                                                NaN   
1  3Pack DVD set - Italian Classics, Parties and ...   

                                               imUrl  price  \
0  http://ecx.images-amazon.com/images/I/51MKP0T4...    NaN   
1  http://g-ecx.images-amazon.com/images/G/01/x-s...  12.99   

                                             related                salesRank  \
0                                                NaN       {'Books': 6334800}   
1  {'also_viewed': ['B0036FO6SI', 'B000KL8ODE', '...  {'Movies & TV': 376041}   

                                               title  
0  The Crucible: Performed by Stuart Pankin, Jero...  
1  Everyday Italian (with Giada de Laurentiis), V...  
>>> data[0]
"{'asin': '0001048791', 'salesRank': {'Books': 6334800}, 'imUrl': 'http://ecx.images-amazon.com/images/I/51MKP0T4DBL.jpg', 'categories': [['Books']], 'title': 'The Crucible: Performed by Stuart Pankin, Jerome Dempsey &amp; Cast'}\n"
>>> df['asin']
0    0001048791
1    0000143561
Name: asin, dtype: object
>>> df.columns
Index(['asin', 'categories', 'description', 'imUrl', 'price', 'related',
       'salesRank', 'title'],
      dtype='object')

这是有效的,正如您所见,它通过填充NaN来处理某些列的间歇值。其中一些列,如'related'只是存储字典。实际上,这可能不是您想要的,但是从这一点上,您必须决定您到底要如何清理或重塑这些数据。如果你在那里遇到问题,也许值得再问一个问题。你知道吗

最后

在将数据转换成这些字符串之后,您可能只想执行以下操作(假设您正在处理的文件名为'数据.json'):

>>> with open('valid_json.json') as f:
...    data = pd.read_json(f)
... 
>>> data
      asin               categories  \
0  1048791                [[Books]]   
1   143561  [[Movies & TV, Movies]]   

                                         description  \
0                                                NaN   
1  3Pack DVD set - Italian Classics, Parties and ...   

                                               imUrl  price  \
0  http://ecx.images-amazon.com/images/I/51MKP0T4...    NaN   
1  http://g-ecx.images-amazon.com/images/G/01/x-s...  12.99   

                                             related                salesRank  \
0                                                NaN       {'Books': 6334800}   
1  {'also_viewed': ['B0036FO6SI', 'B000KL8ODE', '...  {'Movies & TV': 376041}   

                                               title  
0  The Crucible: Performed by Stuart Pankin, Jero...  
1  Everyday Italian (with Giada de Laurentiis), V...  
>>> type(data)
<class 'pandas.core.frame.DataFrame'>

相关问题 更多 >