如何使用Spark从XML复制到SQL

<?xml version="1.0" encoding="utf-8"?> <FileSummary xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="invoices.xsd"> <Header> <SequenceNumber>1</SequenceNumber> <Description>Hello</Description> <ShipDate>20180101</ShipDate> </Header> <FileInvoices> <InvoiceNumber>000000A</InvoiceNumber> <InvoiceHeader> <InvoiceHeaderDate>201800201</InvoiceHeaderDate> <InvoiceHeaderDescription>XYZ</InvoiceHeaderDescription> </InvoiceHeader> <InvoiceItems> <ItemId>000001</ItemId> <ItemQuantity>000010</ItemQuantity> <ItemPrice>000100</ItemPrice> </InvoiceItems> </FileInvoices> <FileInvoices> <InvoiceNumber>000000B</InvoiceNumber> <InvoiceHeader> <InvoiceHeaderDate>201800301</InvoiceHeaderDate> <InvoiceHeaderDescription>ABC</InvoiceHeaderDescription> </InvoiceHeader> <InvoiceItems> <ItemId>000002</ItemId> <ItemQuantity>000020</ItemQuantity> <ItemPrice>000200</ItemPrice> </InvoiceItems> </FileInvoices> </FileSummary>

dfXml:pyspark.sql.dataframe.DataFrame FileInvoices:array element:struct InvoiceHeader:struct InvoiceHeaderDate:long InvoiceHeaderDescription:string InvoiceItems:struct ItemId:long ItemPrice:long ItemQuantity:long InvoiceNumber:string Header:struct Description:string SequenceNumber:long ShipDate:long xmlns:xsi:string xsi:noNamespaceSchemaLocation:string Number of records in this dataframe: 1 root |-- FileInvoices: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- InvoiceHeader: struct (nullable = true) | | | |-- InvoiceHeaderDate: long (nullable = true) | | | |-- InvoiceHeaderDescription: string (nullable = true) | | |-- InvoiceItems: struct (nullable = true) | | | |-- ItemId: long (nullable = true) | | | |-- ItemPrice: long (nullable = true) | | | |-- ItemQuantity: long (nullable = true) | | |-- InvoiceNumber: string (nullable = true) |-- Header: struct (nullable = true) | |-- Description: string (nullable = true) | |-- SequenceNumber: long (nullable = true) | |-- ShipDate: long (nullable = true) |-- xmlns:xsi: string (nullable = true) |-- xsi:noNamespaceSchemaLocation: string (nullable = true)

3条回答

网友

1楼 · 编辑于 2024-09-29 06:31:39

我使用SparkShell来执行下面的操作，我相信xml结构是重复的。您需要创建/引用一个与xml文件相关的模式。你可以利用砖厂的udf罐。那么

1.创建如下函数

sql(""" create temporary function numeric_range as brickhouse.udf.collect.NumericRange""")

2.使用模式

var df=sqlContext.read.format("com.databricks.spark.xml").option("rowTag","FileSummary").load("location of schema file")

val schema=df.schema

3.var df1=sqlContext.read.format("com.databricks.spark.xml").option("rowTag","FileSummary").schema(schema).load("location of actual xml file")

^{pr2}$

4.您需要将文件发票展开，如下所示

val df2=sql("select array_index(FileInvoices,n) as FileInvoices from XML_Data lateral view numeric_range(size(FileInvoices))n1 as n""").registerTempTable("xmlData2")

一旦every被转换为Struct，就更容易遍历或使用FileInvoices.InvoiceHeader.InvoiceHeaderDate进行分解

val jdbcUsername = "<username>"
val jdbcPassword = "<password>"
val jdbcHostname = "<hostname>" //typically, this is in the form or servername.database.windows.net
val jdbcPort = 1433
val jdbcDatabase ="<database>"

val jdbc_url = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase};encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=60;"

val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")

spark.table("").write.jdbc(jdbc_url, "xmlData2", connectionProperties)

网友

2楼 · 编辑于 2024-09-29 06:31:39

取决于您要做什么以及表结构的外观。我假设您正在尝试使用spark处理许多文件。并希望将数据加载到不同的规范化表中

例如，您可能希望将标题写入一个表中，header->fileInvoices是一对多关系，因此可以是另一个表。在

当您使用load（filename*.xml）读取多个xml文件时希望将文件摘要设置为rowtag。然后你会有多个数据帧中的行，每个文件摘要一行。
您可以选择另一个数据帧中的标题列并将其写入一张桌子。
FileInvoices是struc的数组，可以将它们分解成行把它们放在另一张桌子上。
此外，如果每个发票可以包含多个项目，则可以另做一个分解以使其成为行并存储到另一个表中

或者您可以进行两次分解并将结果数据帧加载到一个大的非规范化表中。在

这里有一篇关于爆炸如何工作的文章 https://hadoopist.wordpress.com/2016/05/16/how-to-handle-nested-dataarray-of-structures-or-multiple-explodes-in-sparkscala-and-pyspark/

网友

3楼 · 编辑于 2024-09-29 06:31:39

谢谢你，苏巴什，阿南德。关于Subash的答案，我没有模式文件，所以我修改了他的步骤2，将“实际xml文件的位置”替换为“实际xml文件的位置”，它实际上起作用了：在步骤3之后，如果我只是运行

df2=sql("select * from XML_Data")

然后我就跑了

^{pr2}$

因此，它跨多行复制头的同一个结构，在FileInvoices列中，我有一个单独的invoices结构： exploded FileInvoices

所以看起来我离我的最终目标越来越近了，但是我仍然没有按照正确的顺序自动创建记录，以避免破坏引用完整性。在

但在此之前，我很感激你的反馈。在

再次感谢

毛罗

相关问题更多 >

编程相关推荐

热门问题

热门文章