Spark Python Pyspark如何用字典数组和嵌入的字典展开列（sparknlp注释器输出）

# content of one cell as an example: d = [{"annotatorType":"chunk","begin":2740,"end":2747,"result":"•Ability","metadata":{"entity":"ORG","sentence":"8","chunk":"22"},"embeddings":[],"sentence_embeddings":[]}, {"annotatorType":"chunk","begin":2740,"end":2747,"result":"Fedex","metadata":{"entity":"ORG","sentence":"8","chunk":"22"},"embeddings":[],"sentence_embeddings":[]}] from pyspark.sql.types import StructType, StructField, StringType from array import array schema = StructType([StructField('annotatorType', StringType(), True), StructField('begin', IntegerType(), True), StructField('end', IntegerType(), True), StructField('result', StringType(), True), StructField('sentence', StringType(), True), StructField('chunk', StringType(), True), StructField('metadata', StructType((StructField('entity', StringType(), True), StructField('sentence', StringType(), True), StructField('chunk', StringType(), True) )), True), StructField('embeddings', StringType(), True), StructField('sentence_embeddings', StringType(), True) ] ) df = spark.createDataFrame(d, schema=schema) df.show()

+-------------+-----+----+----------------+------------------------+----------+-------------------+ |annotatorType|begin| end| result | metadata |embeddings|sentence_embeddings| +-------------+-----+----+----------------+------------------------+----------+-------------------+ | chunk| 166| 169|Lyft |[MISC] | []| []| | chunk| 11| 14|Lyft |[MISC] | []| []| | chunk| 52| 55|Lyft. |[MISC] | []| []| | chunk| [..]|[..]|[Lyft,Lyft, |[MISC,MISC,MISC, | []| []| | | | |FedEx Ground..] |ORG,LOC,ORG,ORG,ORG,ORG]| | | +-------------+-----+----+----------------+------------------------+----------+-------------------+

new_df = sqlContext.read.json(ent2.rdd.map(lambda r: r.entities2)) new_df.show() +-------------+-----+----------+----+------------+----------------+-------------------+ |annotatorType|begin|embeddings| end| metadata| result|sentence_embeddings| +-------------+-----+----------+----+------------+----------------+-------------------+ | chunk| 166| []| 169|[0, MISC, 0]| Lyft| []| | chunk| 11| []| 14|[0, MISC, 0]| Lyft| []| | chunk| 52| []| 55|[0, MISC, 1]| Lyft| []| | chunk| 0| []| 11| [0, ORG, 0]| FedEx Ground| []| | chunk| 717| []| 720| [1, LOC, 4]| Dock| []| | chunk| 811| []| 816| [2, ORG, 5]| Parcel| []| | chunk| 1080| []|1095| [3, ORG, 6]|Parcel Assistant| []| | chunk| 1102| []|1108| [4, ORG, 7]| • Daily| []| | chunk| 1408| []|1417| [5, ORG, 8]| Assistants| []| +-------------+-----+----------+----+------------+----------------+-------------------+

def flatten(my_dict): d_result = defaultdict(list) for sub in my_dict: val = sub['result'] d_result["result"].append(val) return d_result["result"] ent = ent.withColumn('result', flatten(df.entities)) TypeError: Column is not iterable

1条回答

网友

1楼 · 发布于 2024-09-26 17:42:12

获取null的原因是，schema变量并不完全表示作为数据传递的字典列表

    from pyspark.shell import *
    from pyspark.sql.types import *

    schema = StructType([StructField('result', StringType(), True),
                 StructField('metadata', StructType((StructField('entity', StringType(), True),
                                                     StructField('sentence', StringType(), True),
                                                     StructField('chunk', StringType(), True))), True)])

    df = spark.createDataFrame(d1, schema=schema)
    df.show()

如果您喜欢定制的解决方案，可以尝试纯python/pandas方法

^{pr2}$

编辑

在阅读了您尝试过的所有方法之后，我认为{}仍然可以在相当复杂的情况下使用。我没有你最初的变量，但我可以从中提取你的图像，尽管不再有课堂教师或教学值。希望它对所有有用的东西都有用。在

您总是可以创建一个具有所需结构和get模式的mock数据帧

对于具有嵌套数据类型的复杂情况，可以使用SparkContext并读取生成的JSON格式

    import itertools

    from pyspark.shell import *
    from pyspark.sql.functions import *
    from pyspark.sql.types import *

    # assume two lists in two dictionary keys to make four cells
    # since I don't have but entities2, I can just replicate it
    sample = {
        'single_list': [{'annotatorType': 'chunk', 'begin': '166', 'end': '169', 'result': 'Lyft',
                         'metadata': {'entity': 'MISC', 'sentence': '0', 'chunk': '0'}, 'embeddings': [],
                         'sentence_embeddings': []},
                        {'annotatorType': 'chunk', 'begin': '11', 'end': '14', 'result': 'Lyft',
                         'metadata': {'entity': 'MISC', 'sentence': '0', 'chunk': '0'}, 'embeddings': [],
                         'sentence_embeddings': []},
                        {'annotatorType': 'chunk', 'begin': '52', 'end': '55', 'result': 'Lyft',
                         'metadata': {'entity': 'MISC', 'sentence': '1', 'chunk': '0'}, 'embeddings': [],
                         'sentence_embeddings': []}],
        'frankenstein': [
            {'annotatorType': 'chunk', 'begin': '0', 'end': '11', 'result': 'FedEx Ground',
             'metadata': {'entity': 'ORG', 'sentence': '0', 'chunk': '0'}, 'embeddings': [],
             'sentence_embeddings': []},
            {'annotatorType': 'chunk', 'begin': '717', 'end': '720', 'result': 'Dock',
             'metadata': {'entity': 'LOC', 'sentence': '4', 'chunk': '1'}, 'embeddings': [],
             'sentence_embeddings': []},
            {'annotatorType': 'chunk', 'begin': '811', 'end': '816', 'result': 'Parcel',
             'metadata': {'entity': 'ORG', 'sentence': '5', 'chunk': '2'}, 'embeddings': [],
             'sentence_embeddings': []},
            {'annotatorType': 'chunk', 'begin': '1080', 'end': '1095', 'result': 'Parcel Assistant',
             'metadata': {'entity': 'ORG', 'sentence': '6', 'chunk': '3'}, 'embeddings': [],
             'sentence_embeddings': []},
            {'annotatorType': 'chunk', 'begin': '1102', 'end': '1108', 'result': '* Daily',
             'metadata': {'entity': 'ORG', 'sentence': '7', 'chunk': '4'}, 'embeddings': [],
             'sentence_embeddings': []},
            {'annotatorType': 'chunk', 'begin': '1408', 'end': '1417', 'result': 'Assistants',
             'metadata': {'entity': 'ORG', 'sentence': '8', 'chunk': '5'}, 'embeddings': [],
             'sentence_embeddings': []}]
    }

    # since they are structurally different, get two dataframes
    df_single_list = spark.read.json(sc.parallelize(sample.get('single_list')))
    df_frankenstein = spark.read.json(sc.parallelize(sample.get('frankenstein')))

    # print better the table first border
    print('\n')

    # list to create a dataframe schema
    annotatorType = []
    begin = []
    embeddings = []
    end = []
    metadata = []
    result = []
    sentence_embeddings = []

    # PEP8 here to have an UDF instead of lambdas
    # probably a dictionary with actions to avoid IF statements
    function_metadata = lambda x: [x.entity]
    for k, i in enumerate(df_frankenstein.columns):
        if i == 'annotatorType':
            annotatorType.append(df_frankenstein.select(i).rdd.flatMap(lambda x: x).collect())
        if i == 'begin':
            begin.append(df_frankenstein.select(i).rdd.flatMap(lambda x: x).collect())
        if i == 'embeddings':
            embeddings.append(df_frankenstein.select(i).rdd.flatMap(lambda x: x).collect())
        if i == 'end':
            end.append(df_frankenstein.select(i).rdd.flatMap(lambda x: x).collect())
        if i == 'metadata':
            _temp = list(map(function_metadata, df_frankenstein.select(i).rdd.flatMap(lambda x: x).collect()))
            metadata.append(list(itertools.chain.from_iterable(_temp)))
        if i == 'result':
            result.append(df_frankenstein.select(i).rdd.flatMap(lambda x: x).collect())
        if i == 'sentence_embeddings':
            sentence_embeddings.append(df_frankenstein.select(i).rdd.flatMap(lambda x: x).collect())

    # headers
    annotatorType_header = 'annotatorType'
    begin_header = 'begin'
    embeddings_header = 'embeddings'
    end_header = 'end'
    metadata_header = 'metadata'
    result_header = 'result'
    sentence_embeddings_header = 'sentence_embeddings'
    metadata_entity_header = 'metadata.entity'

    frankenstein_schema = StructType(
        [StructField(annotatorType_header, ArrayType(StringType())),
         StructField(begin_header, ArrayType(StringType())),
         StructField(embeddings_header, ArrayType(StringType())),
         StructField(end_header, ArrayType(StringType())),
         StructField(metadata_header, ArrayType(StringType())),
         StructField(result_header, ArrayType(StringType())),
         StructField(sentence_embeddings_header, ArrayType(StringType()))
         ])

    # list of lists of lists of lists of ... lists
    frankenstein_list = [[annotatorType, begin, embeddings, end, metadata, result, sentence_embeddings]]
    df_frankenstein = spark.createDataFrame(frankenstein_list, schema=frankenstein_schema)

    print(df_single_list.schema)
    print(df_frankenstein.schema)

    # let's see how it is
    df_single_list.select(
        annotatorType_header,
        begin_header,
        end_header,
        result_header,
        array(metadata_entity_header),
        embeddings_header,
        sentence_embeddings_header).show()

    # let's see again
    df_frankenstein.select(
        annotatorType_header,
        begin_header,
        end_header,
        result_header,
        metadata_header,
        embeddings_header,
        sentence_embeddings_header).show()

输出：

    StructType(List(StructField(annotatorType,StringType,true),StructField(begin,StringType,true),StructField(embeddings,ArrayType(StringType,true),true),StructField(end,StringType,true),StructField(metadata,StructType(List(StructField(chunk,StringType,true),StructField(entity,StringType,true),StructField(sentence,StringType,true))),true),StructField(result,StringType,true),StructField(sentence_embeddings,ArrayType(StringType,true),true)))
    StructType(List(StructField(annotatorType,ArrayType(StringType,true),true),StructField(begin,ArrayType(StringType,true),true),StructField(embeddings,ArrayType(StringType,true),true),StructField(end,ArrayType(StringType,true),true),StructField(metadata,ArrayType(StringType,true),true),StructField(result,ArrayType(StringType,true),true),StructField(sentence_embeddings,ArrayType(StringType,true),true)))

    +      -+  -+ -+   +           +     +         -+
    |annotatorType|begin|end|result|array(metadata.entity)|embeddings|sentence_embeddings|
    +      -+  -+ -+   +           +     +         -+
    |        chunk|  166|169|  Lyft|                [MISC]|        []|                 []|
    |        chunk|   11| 14|  Lyft|                [MISC]|        []|                 []|
    |        chunk|   52| 55|  Lyft|                [MISC]|        []|                 []|
    +      -+  -+ -+   +           +     +         -+
    +          +          +          +          +          +          +          +
    |       annotatorType|               begin|                 end|              result|            metadata|          embeddings| sentence_embeddings|
    +          +          +          +          +          +          +          +
    |[[chunk, chunk, c...|[[0, 717, 811, 10...|[[11, 720, 816, 1...|[[FedEx Ground, D...|[[ORG, LOC, ORG, ...|[[[], [], [], [],...|[[[], [], [], [],...|
    +          +          +          +          +          +          +          +

您必须分别从每个数据帧中进行选择，因为它们的数据类型不同，但是内容已经准备好了（如果我从输出中理解了您的需求的话）

（͜͡ʖ͡͡）

相关问题更多 >

编程相关推荐

热门问题

热门文章