用sp解析/加载巨大的XML文件

<?xml version="1.0" encoding="utf-8"?> <SomeRoottag> <row Id="47513849" PostTypeId="1" /> <row Id="4751323" PostTypeId="4" /> <row Id="475546" PostTypeId="1" /> <row Id="47597" PostTypeId="2" /> </SomeRoottag>

1条回答

网友

1楼 · 发布于 2024-09-30 14:33:16

不要使用SomeRoottag作为rowTag。它指示Spark将整个文档用作一行。取而代之的是：

df = (sqlContext.read.format('xml')
    .option("rowTag", "row")
    .load("/tmp/xmlfile.xml"))

现在也不需要爆炸了：

^{pr2}$

编辑：

考虑到您的编辑，您会受到已知错误的影响。请参见Self-closing tags are not supported as top-level rows #92。目前在解决这一问题上似乎没有任何进展，因此您可能必须：

你自己来解决这个问题。在

手动分析文件。如果元素总是单行的，则可以使用udf轻松完成。在

from pyspark.sql.functions import col, udf
from lxml import etree

@udf("struct<id: string, postTypeId: string>")
def parse(s):
    try:
        attrib = etree.fromstring(s).attrib
        return attrib.get("Id"), attrib.get("PostTypeId")
    except:
        pass

(spark.read.text("/tmp/someXML.xml")
    .where(col("value").rlike("^\\s*<row "))
    .select(parse("value").alias("value"))
    .select("value.*")
    .show())

# +    +     +
# |      id|postTypeId|
# +    +     +
# |47513849|         1|
# | 4751323|         4|
# |  475546|         1|
# |   47597|         2|
# +    +     +

相关问题更多 >

编程相关推荐

热门问题

热门文章

用sp解析/加载巨大的XML文件

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >