使用一个数据结构在pyspark中创建多个表(XML解析为字典)

2024-09-28 21:37:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在从几个具有嵌套结构的XML文件创建一个sqlite数据库

我花了大量时间编写了几个不同的数据解析器,使用显式函数、递归和解析的一般规则等等,但是没有一个能够以快速简单的方式解析完整的信息

幸运的是,在做了一些挖掘之后,我发现已经有一个包可以解析我正在处理的许多类型的XML文件;但是,解析器解析到一个记录列表,每个记录中都保留嵌套结构。 这似乎是一个极为常见的问题(给定的记录将包含一个值列表或字典,这表明需要一个单独的表。)-因此可能存在一个简单的解决方案

我在这个问题的底部提供了一个非常简单的xml文件字符串,它抓住了这个问题

使用上述解析器并应用pandas数据帧来可视化我们在这里看到的问题:

# Parse each row of the main table in a generator
import Bio
with open("test.xml") as f:
  gen = Bio.Entrez.parse(f)
  rec1 = next(gen) # first record
  rec2 = next(gen) # second record
  print(rec2)

# Recursively explode the columns into new tables
# [rec1, rec2] is just a operating as a small record array here
import pandas as pd
display(pd.DataFrame([rec1, rec2]))
display(pd.DataFrame([rec1, rec2]).set_index('DescriptorUI')['TreeNumberList'].explode().apply(pd.Series))
display(pd.DataFrame([rec1, rec2]).set_index('DescriptorUI')['ConceptList'].explode().apply(pd.Series))
display(pd.DataFrame([rec1, rec2]).set_index('DescriptorUI')['ConceptList'].explode().apply(pd.Series)['ConceptRelationList'].explode().apply(pd.Series))

这将输出记录的字典表示形式以及pandas数据帧,我在这里稍微整理了一下以说明我的观点:

>>> print(rec2)

DictElement({'DescriptorUI': 'D000002', 'DescriptorName': {'String': 'Temefos'}, 'DateCreated': {'Year': '1999', 'Month': '01', 'Day': '01'}, 'DateRevised': {'Year': '2013', 'Month': '07', 'Day': '08'}, 'TreeNumberList': ['D02.705.400.625.800', 'D02.705.539.345.800', 'D02.886.300.692.800'], 'ConceptList': [DictElement({'ConceptUI': 'M0000002', 'ConceptName': {'String': 'Temefos'}, 'CASN1Name': "Phosphorothioic acid, O,O'-(thiodi-4,1-phenylene) O,O,O',O'-tetramethyl ester", 'RegistryNumber': 'ONP3ME32DL', 'ScopeNote': 'An organothiophosphate insecticide.\n    ', 'RelatedRegistryNumberList': ['3383-96-8 (Temefos)'], 'ConceptRelationList': [DictElement({'Concept1UI': 'M0000002', 'Concept2UI': 'M0352201'}, attributes={'RelationName': 'NRW'}), DictElement({'Concept1UI': 'M0000002', 'Concept2UI': 'M0352200'}, attributes={'RelationName': 'NRW'})], 'TermList': [DictElement({'TermUI': 'T000008', 'String': 'Temefos', 'DateCreated': {'Year': '1999', 'Month': '01', 'Day': '01'}, 'ThesaurusIDlist': ['FDA SRS (2014)', 'INN (19XX)', 'USAN (1974)']}, attributes={'ConceptPreferredTermYN': 'Y', 'IsPermutedTermYN': 'N', 'LexicalTag': 'NON', 'RecordPreferredTermYN': 'Y'}), DictElement({'TermUI': 'T000007', 'String': 'Temephos', 'DateCreated': {'Year': '1994', 'Month': '12', 'Day': '01'}, 'ThesaurusIDlist': ['NLM (1996)']}, attributes={'ConceptPreferredTermYN': 'N', 'IsPermutedTermYN': 'N', 'LexicalTag': 'TRD', 'RecordPreferredTermYN': 'N'})]}, attributes={'PreferredConceptYN': 'Y'})]}, attributes={'DescriptorClass': '1'})

enter image description here

当然,在pandas中对xmls或已解析的字典进行任何此类处理都是一个糟糕的想法,因为这会非常缓慢

但是,我已经看到通过pyspark创建拼花地板文件,通过pyspark.sql.RowflatMap功能可以非常快地工作,例如:

from pyspark.sql import Row, SQLContext, Window
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import rank, max, sum, desc

conf = SparkConf().setAppName('mesh_spark')\
    .setMaster('local[8]')\
    .set('executor.memory', '8g')\
    .set('driver.memory', '8g')\
    .set('spark.driver.maxResultSize', '0')

sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)

path_rdd = sc.parallelize(['test.xml'], numSlices=1)
parse_results_rdd = path_rdd.flatMap(lambda x: [Row(file_name=os.path.basename(x), **mesh_dict)
                       for mesh_dict in parse_generator_func(x)])

mesh_tables = parse_results_rdd.toDF()  #.partitionBy('table') maybe partition by table element?
print(medline_df)
mesh_tables.write.parquet(os.path.join(save_dir, 'mesh_raw.parquet'), mode='overwrite')


sc.stop()

我看到了一些关于使用.partitionBy()和其他一些想法的主张,但我所看到的想法都没有奏效

是否有一种更可靠、更简单的方法来展平此数据结构,或通过pyspark将多个值递归插入到单独的列中,以便自动处理嵌套数据以生成其他表?

下面是一个简单的测试xml:

"""
<?xml version="1.0"?>
<!DOCTYPE DescriptorRecordSet SYSTEM "https://www.nlm.nih.gov/databases/dtd/nlmdescriptorrecordset_20190101.dtd">
<DescriptorRecordSet>
<DescriptorRecord>
  <DescriptorUI>D000001</DescriptorUI>
  <DescriptorName>
   <String>Calcimycin</String>
  </DescriptorName>
  <DateCreated>
   <Year>1974</Year>
   <Month>11</Month>
   <Day>19</Day>
  </DateCreated>
  <TreeNumberList>
   <TreeNumber>D03.633.100.221.173</TreeNumber>
  </TreeNumberList>
  <ConceptList>
   <Concept PreferredConceptYN="Y">
    <ConceptUI>M0000001</ConceptUI>
    <ConceptName>
     <String>Calcimycin</String>
    </ConceptName>
   </Concept>
   <Concept PreferredConceptYN="N">
    <ConceptUI>M0353609</ConceptUI>
    <ConceptName>
     <String>A-23187</String>
    </ConceptName>
    <RegistryNumber>0</RegistryNumber>
    <ConceptRelationList>
     <ConceptRelation RelationName="NRW">
     <Concept1UI>M0000001</Concept1UI>
     <Concept2UI>M0353609</Concept2UI>
     </ConceptRelation>
    </ConceptRelationList>
   </Concept>
  </ConceptList>
 </DescriptorRecord>
<DescriptorRecord DescriptorClass = "1">
  <DescriptorUI>D000002</DescriptorUI>
  <DescriptorName>
   <String>Temefos</String>
  </DescriptorName>
  <DateCreated>
   <Year>1999</Year>
   <Month>01</Month>
   <Day>01</Day>
  </DateCreated>
  <DateRevised>
   <Year>2013</Year>
   <Month>07</Month>
   <Day>08</Day>
  </DateRevised>
  <TreeNumberList>
   <TreeNumber>D02.705.400.625.800</TreeNumber>
   <TreeNumber>D02.705.539.345.800</TreeNumber>
   <TreeNumber>D02.886.300.692.800</TreeNumber>
  </TreeNumberList>
  <ConceptList>
   <Concept PreferredConceptYN="Y">
    <ConceptUI>M0000002</ConceptUI>
    <ConceptName>
     <String>Temefos</String>
    </ConceptName>
    <CASN1Name>Phosphorothioic acid, O,O'-(thiodi-4,1-phenylene) O,O,O',O'-tetramethyl ester</CASN1Name>
    <RegistryNumber>ONP3ME32DL</RegistryNumber>
    <ScopeNote>An organothiophosphate insecticide.
    </ScopeNote>
    <RelatedRegistryNumberList>
     <RelatedRegistryNumber>3383-96-8 (Temefos)</RelatedRegistryNumber>
    </RelatedRegistryNumberList>
    <ConceptRelationList>
     <ConceptRelation RelationName="NRW">
     <Concept1UI>M0000002</Concept1UI>
     <Concept2UI>M0352201</Concept2UI>
     </ConceptRelation>
     <ConceptRelation RelationName="NRW">
     <Concept1UI>M0000002</Concept1UI>
     <Concept2UI>M0352200</Concept2UI>
     </ConceptRelation>
    </ConceptRelationList>
    <TermList>
     <Term  ConceptPreferredTermYN="Y"  IsPermutedTermYN="N"  LexicalTag="NON"  RecordPreferredTermYN="Y">
      <TermUI>T000008</TermUI>
      <String>Temefos</String>
      <DateCreated>
       <Year>1999</Year>
       <Month>01</Month>
       <Day>01</Day>
      </DateCreated>
      <ThesaurusIDlist>
       <ThesaurusID>FDA SRS (2014)</ThesaurusID>
       <ThesaurusID>INN (19XX)</ThesaurusID>
       <ThesaurusID>USAN (1974)</ThesaurusID>
      </ThesaurusIDlist>
     </Term>
     <Term  ConceptPreferredTermYN="N"  IsPermutedTermYN="N"  LexicalTag="TRD"  RecordPreferredTermYN="N">
      <TermUI>T000007</TermUI>
      <String>Temephos</String>
      <DateCreated>
       <Year>1994</Year>
       <Month>12</Month>
       <Day>01</Day>
      </DateCreated>
      <ThesaurusIDlist>
       <ThesaurusID>NLM (1996)</ThesaurusID>
      </ThesaurusIDlist>
     </Term>
    </TermList>
   </Concept>
  </ConceptList>
 </DescriptorRecord>
</DescriptorRecordSet>
"""

Tags: stringyearpddaymonthrec2thesaurusidconceptui