如何使用Spark从XML复制到SQL问题的回答

如何使用Spark从XML复制到SQL

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我需要打开并将存储在azuredatalake存储中的多个XML文件的内容复制到azuresqldb中。这是XML文件结构： <pre><code><?xml version="1.0" encoding="utf-8"?> <FileSummary xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="invoices.xsd"> <Header> <SequenceNumber>1</SequenceNumber> <Description>Hello</Description> <ShipDate>20180101</ShipDate> </Header> <FileInvoices> <InvoiceNumber>000000A</InvoiceNumber> <InvoiceHeader> <InvoiceHeaderDate>201800201</InvoiceHeaderDate> <InvoiceHeaderDescription>XYZ</InvoiceHeaderDescription> </InvoiceHeader> <InvoiceItems> <ItemId>000001</ItemId> <ItemQuantity>000010</ItemQuantity> <ItemPrice>000100</ItemPrice> </InvoiceItems> </FileInvoices> <FileInvoices> <InvoiceNumber>000000B</InvoiceNumber> <InvoiceHeader> <InvoiceHeaderDate>201800301</InvoiceHeaderDate> <InvoiceHeaderDescription>ABC</InvoiceHeaderDescription> </InvoiceHeader> <InvoiceItems> <ItemId>000002</ItemId> <ItemQuantity>000020</ItemQuantity> <ItemPrice>000200</ItemPrice> </InvoiceItems> </FileInvoices> </FileSummary> </code></pre> 所以我使用azuredatabricks将Datalake存储挂载为“/mnt/testdata”，然后我尝试用以下命令打开上面的示例文件 ^{pr2}$ 返回以下结果： <pre><code>dfXml:pyspark.sql.dataframe.DataFrame FileInvoices:array element:struct InvoiceHeader:struct InvoiceHeaderDate:long InvoiceHeaderDescription:string InvoiceItems:struct ItemId:long ItemPrice:long ItemQuantity:long InvoiceNumber:string Header:struct Description:string SequenceNumber:long ShipDate:long xmlns:xsi:string xsi:noNamespaceSchemaLocation:string Number of records in this dataframe: 1 root |-- FileInvoices: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- InvoiceHeader: struct (nullable = true) | | | |-- InvoiceHeaderDate: long (nullable = true) | | | |-- InvoiceHeaderDescription: string (nullable = true) | | |-- InvoiceItems: struct (nullable = true) | | | |-- ItemId: long (nullable = true) | | | |-- ItemPrice: long (nullable = true) | | | |-- ItemQuantity: long (nullable = true) | | |-- InvoiceNumber: string (nullable = true) |-- Header: struct (nullable = true) | |-- Description: string (nullable = true) | |-- SequenceNumber: long (nullable = true) | |-- ShipDate: long (nullable = true) |-- xmlns:xsi: string (nullable = true) |-- xsi:noNamespaceSchemaLocation: string (nullable = true) </code></pre> 因此，上面的命令似乎确实正确地读取了文件，当然，我能够连接到规范化良好的Azure SQL DB，并将记录写入特定的表中： <pre><code>dfXml.write.jdbc(url=jdbcUrl, table="dest_table", mode="overwrite", properties=connectionProperties) </code></pre> 但是，这种方法需要设置一些嵌套循环和大量手动任务来跟踪每个表的键并尊重引用完整性，而这些引用完整性不利用Spark体系结构，所以我现在想知道是否有最佳实践（或预构建库）以更自动化和可伸缩的方式完成此任务。在 我希望这是一个常见的需求，所以理想情况下我会使用一个库，它读取开头显示的完整XML结构，并自动提取信息以插入到规范化表中。在 非常感谢你的任何建议。在 毛罗

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

如何使用Spark从XML复制到SQL

1 个回答

相关Python问题