如何使用pandas从JSON文件中提取一些信息

2024-09-29 01:33:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我是熊猫库的初学者,我有以下数据集作为JSON文件:

array(
[
    {'paragraphs': 
        [
            {'qas': 
                [
                    {'question': "Quel astronome a émit l'idée en premier d'une planète entre les orbites de Mars et Jupiter ?",
                     'id': '9f38825f-1bd3-4171-9d3b-b0c2c71e7de2',
                     'answers': [
                            {'text': 'Johann Elert Bode', 'answer_start': 136}
                        ]
                    },
                    {'question': 'Quel astronome découvrit Uranus ?',
                     'id': 'c2415641-9a62-4052-b57b-9a239da7599c',
                     'answers': [
                            {'text': 'William Herschel', 'answer_start': 404}
                        ]
                    },
                    {'question': 'Quelles furent les découvertes finales des vingt-quatre astronomes ?', 
                    'id': '5c59e19a-066c-4dc0-aa16-2871dcb12d39', 
                    'answers': [
                            {'text': 'plusieurs autres astéroïdes', 'answer_start': 733}
                        ]
                    }
                ],
                'context': "L'idée selon laquelle une planète inconnue pourrait..."
            }
        ]
    }
]
) 

我想要一个脚本来从这个JSON文件中提取问题文本上下文。 我尝试了以下脚本:

import pandas as pd
df = pd.read_json('train.json',  orient='columns')
print(df.head()['data'])

我得到的结果是:

0    {'paragraphs': [{'qas': [{'question': "Quel as...
1    {'paragraphs': [{'qas': [{'question': 'America...
2    {'paragraphs': [{'qas': [{'question': "A quell...
3    {'paragraphs': [{'qas': [{'question': "Pourquo...
4    {'paragraphs': [{'qas': [{'question': "Quels s...

Tags: 文件textansweridjsonstartanswersquestion
1条回答
网友
1楼 · 发布于 2024-09-29 01:33:53

jmespath在这里可能会有所帮助,因为它允许轻松遍历嵌套的json数据:

#jmespath represents lists/arrays with the list symbol ([])
#and the key with a .(dot)
#the first part, paragraph is in a list
#so a list precedes it [].paragraphs
#move on to the next one and there qas is embedded in a list
#so u have [].paragraphs[].qas
#again, question is in a list
#so u have [].paragraphs[].qas[].question
#and u get ur data when u pass the compiled sequence through search
#the same concept applies to the rest
#note that context is just in the list after paragraphs
#so no extended compile like the other two
questions = jmespath.compile('[].paragraphs[].qas[].question').search(data)
text = jmespath.compile('[].paragraphs[].qas[].answers[].text').search(data)
context = jmespath.compile('[].paragraphs[].context').search(data)

#to make context align with the others (questions, text)
#we multiply the list to match
#so every row in question and text is aligned with context
context = context * len(questions)

#put the three extracts into a list 
#and zip
#effectively pairing the rows
content = zip(*[questions,text,context])

#read data into pandas
df = pd.DataFrame(content,columns=['questions','text','context'])
print(df)

questions   text    context
0   Quel astronome a émit l'idée en premier d'une ...   Johann Elert Bode   L'idée selon laquelle une planète inconnue pou...
1   Quel astronome découvrit Uranus ?   William Herschel    L'idée selon laquelle une planète inconnue pou...
2   Quelles furent les découvertes finales des vin...   plusieurs autres astéroïdes L'idée selon laquelle une planète inconnue pou...

相关问题 更多 >