我正在从几个具有嵌套结构的XML文件创建一个sqlite数据库
我花了大量时间编写了几个不同的数据解析器,使用显式函数、递归和解析的一般规则等等,但是没有一个能够以快速简单的方式解析完整的信息
幸运的是,在做了一些挖掘之后,我发现已经有一个包可以解析我正在处理的许多类型的XML文件;但是,解析器解析到一个记录列表,每个记录中都保留嵌套结构。 这似乎是一个极为常见的问题(给定的记录将包含一个值列表或字典,这表明需要一个单独的表。)-因此可能存在一个简单的解决方案
我在这个问题的底部提供了一个非常简单的xml文件字符串,它抓住了这个问题
使用上述解析器并应用pandas数据帧来可视化我们在这里看到的问题:
# Parse each row of the main table in a generator
import Bio
with open("test.xml") as f:
gen = Bio.Entrez.parse(f)
rec1 = next(gen) # first record
rec2 = next(gen) # second record
print(rec2)
# Recursively explode the columns into new tables
# [rec1, rec2] is just a operating as a small record array here
import pandas as pd
display(pd.DataFrame([rec1, rec2]))
display(pd.DataFrame([rec1, rec2]).set_index('DescriptorUI')['TreeNumberList'].explode().apply(pd.Series))
display(pd.DataFrame([rec1, rec2]).set_index('DescriptorUI')['ConceptList'].explode().apply(pd.Series))
display(pd.DataFrame([rec1, rec2]).set_index('DescriptorUI')['ConceptList'].explode().apply(pd.Series)['ConceptRelationList'].explode().apply(pd.Series))
这将输出记录的字典表示形式以及pandas数据帧,我在这里稍微整理了一下以说明我的观点:
>>> print(rec2)
DictElement({'DescriptorUI': 'D000002', 'DescriptorName': {'String': 'Temefos'}, 'DateCreated': {'Year': '1999', 'Month': '01', 'Day': '01'}, 'DateRevised': {'Year': '2013', 'Month': '07', 'Day': '08'}, 'TreeNumberList': ['D02.705.400.625.800', 'D02.705.539.345.800', 'D02.886.300.692.800'], 'ConceptList': [DictElement({'ConceptUI': 'M0000002', 'ConceptName': {'String': 'Temefos'}, 'CASN1Name': "Phosphorothioic acid, O,O'-(thiodi-4,1-phenylene) O,O,O',O'-tetramethyl ester", 'RegistryNumber': 'ONP3ME32DL', 'ScopeNote': 'An organothiophosphate insecticide.\n ', 'RelatedRegistryNumberList': ['3383-96-8 (Temefos)'], 'ConceptRelationList': [DictElement({'Concept1UI': 'M0000002', 'Concept2UI': 'M0352201'}, attributes={'RelationName': 'NRW'}), DictElement({'Concept1UI': 'M0000002', 'Concept2UI': 'M0352200'}, attributes={'RelationName': 'NRW'})], 'TermList': [DictElement({'TermUI': 'T000008', 'String': 'Temefos', 'DateCreated': {'Year': '1999', 'Month': '01', 'Day': '01'}, 'ThesaurusIDlist': ['FDA SRS (2014)', 'INN (19XX)', 'USAN (1974)']}, attributes={'ConceptPreferredTermYN': 'Y', 'IsPermutedTermYN': 'N', 'LexicalTag': 'NON', 'RecordPreferredTermYN': 'Y'}), DictElement({'TermUI': 'T000007', 'String': 'Temephos', 'DateCreated': {'Year': '1994', 'Month': '12', 'Day': '01'}, 'ThesaurusIDlist': ['NLM (1996)']}, attributes={'ConceptPreferredTermYN': 'N', 'IsPermutedTermYN': 'N', 'LexicalTag': 'TRD', 'RecordPreferredTermYN': 'N'})]}, attributes={'PreferredConceptYN': 'Y'})]}, attributes={'DescriptorClass': '1'})
当然,在pandas中对xmls或已解析的字典进行任何此类处理都是一个糟糕的想法,因为这会非常缓慢
但是,我已经看到通过pyspark创建拼花地板文件,通过pyspark.sql.Row
和flatMap
功能可以非常快地工作,例如:
from pyspark.sql import Row, SQLContext, Window
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import rank, max, sum, desc
conf = SparkConf().setAppName('mesh_spark')\
.setMaster('local[8]')\
.set('executor.memory', '8g')\
.set('driver.memory', '8g')\
.set('spark.driver.maxResultSize', '0')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
path_rdd = sc.parallelize(['test.xml'], numSlices=1)
parse_results_rdd = path_rdd.flatMap(lambda x: [Row(file_name=os.path.basename(x), **mesh_dict)
for mesh_dict in parse_generator_func(x)])
mesh_tables = parse_results_rdd.toDF() #.partitionBy('table') maybe partition by table element?
print(medline_df)
mesh_tables.write.parquet(os.path.join(save_dir, 'mesh_raw.parquet'), mode='overwrite')
sc.stop()
我看到了一些关于使用.partitionBy()
和其他一些想法的主张,但我所看到的想法都没有奏效
是否有一种更可靠、更简单的方法来展平此数据结构,或通过pyspark将多个值递归插入到单独的列中,以便自动处理嵌套数据以生成其他表?
下面是一个简单的测试xml:
"""
<?xml version="1.0"?>
<!DOCTYPE DescriptorRecordSet SYSTEM "https://www.nlm.nih.gov/databases/dtd/nlmdescriptorrecordset_20190101.dtd">
<DescriptorRecordSet>
<DescriptorRecord>
<DescriptorUI>D000001</DescriptorUI>
<DescriptorName>
<String>Calcimycin</String>
</DescriptorName>
<DateCreated>
<Year>1974</Year>
<Month>11</Month>
<Day>19</Day>
</DateCreated>
<TreeNumberList>
<TreeNumber>D03.633.100.221.173</TreeNumber>
</TreeNumberList>
<ConceptList>
<Concept PreferredConceptYN="Y">
<ConceptUI>M0000001</ConceptUI>
<ConceptName>
<String>Calcimycin</String>
</ConceptName>
</Concept>
<Concept PreferredConceptYN="N">
<ConceptUI>M0353609</ConceptUI>
<ConceptName>
<String>A-23187</String>
</ConceptName>
<RegistryNumber>0</RegistryNumber>
<ConceptRelationList>
<ConceptRelation RelationName="NRW">
<Concept1UI>M0000001</Concept1UI>
<Concept2UI>M0353609</Concept2UI>
</ConceptRelation>
</ConceptRelationList>
</Concept>
</ConceptList>
</DescriptorRecord>
<DescriptorRecord DescriptorClass = "1">
<DescriptorUI>D000002</DescriptorUI>
<DescriptorName>
<String>Temefos</String>
</DescriptorName>
<DateCreated>
<Year>1999</Year>
<Month>01</Month>
<Day>01</Day>
</DateCreated>
<DateRevised>
<Year>2013</Year>
<Month>07</Month>
<Day>08</Day>
</DateRevised>
<TreeNumberList>
<TreeNumber>D02.705.400.625.800</TreeNumber>
<TreeNumber>D02.705.539.345.800</TreeNumber>
<TreeNumber>D02.886.300.692.800</TreeNumber>
</TreeNumberList>
<ConceptList>
<Concept PreferredConceptYN="Y">
<ConceptUI>M0000002</ConceptUI>
<ConceptName>
<String>Temefos</String>
</ConceptName>
<CASN1Name>Phosphorothioic acid, O,O'-(thiodi-4,1-phenylene) O,O,O',O'-tetramethyl ester</CASN1Name>
<RegistryNumber>ONP3ME32DL</RegistryNumber>
<ScopeNote>An organothiophosphate insecticide.
</ScopeNote>
<RelatedRegistryNumberList>
<RelatedRegistryNumber>3383-96-8 (Temefos)</RelatedRegistryNumber>
</RelatedRegistryNumberList>
<ConceptRelationList>
<ConceptRelation RelationName="NRW">
<Concept1UI>M0000002</Concept1UI>
<Concept2UI>M0352201</Concept2UI>
</ConceptRelation>
<ConceptRelation RelationName="NRW">
<Concept1UI>M0000002</Concept1UI>
<Concept2UI>M0352200</Concept2UI>
</ConceptRelation>
</ConceptRelationList>
<TermList>
<Term ConceptPreferredTermYN="Y" IsPermutedTermYN="N" LexicalTag="NON" RecordPreferredTermYN="Y">
<TermUI>T000008</TermUI>
<String>Temefos</String>
<DateCreated>
<Year>1999</Year>
<Month>01</Month>
<Day>01</Day>
</DateCreated>
<ThesaurusIDlist>
<ThesaurusID>FDA SRS (2014)</ThesaurusID>
<ThesaurusID>INN (19XX)</ThesaurusID>
<ThesaurusID>USAN (1974)</ThesaurusID>
</ThesaurusIDlist>
</Term>
<Term ConceptPreferredTermYN="N" IsPermutedTermYN="N" LexicalTag="TRD" RecordPreferredTermYN="N">
<TermUI>T000007</TermUI>
<String>Temephos</String>
<DateCreated>
<Year>1994</Year>
<Month>12</Month>
<Day>01</Day>
</DateCreated>
<ThesaurusIDlist>
<ThesaurusID>NLM (1996)</ThesaurusID>
</ThesaurusIDlist>
</Term>
</TermList>
</Concept>
</ConceptList>
</DescriptorRecord>
</DescriptorRecordSet>
"""
目前没有回答
相关问题 更多 >
编程相关推荐