PySpark UDF，输入时仅无值

@udf(returnType=StringType()) def get_asn(ip_addr): from fm_kafka2parquet.asn_lookup import AsnLookup result = AsnLookup\ .get_instance(ASN_DB_PATH)\ .get().lookup(ip_addr)[0] # first record from tuple is ASN number if result is None: return "n/a" return result

# data frame for netflow reading df = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", CONFIG_KAFKA_BOOTSTRAP) \ .option("subscribe", CONFIG_KAFKA_TOPIC) \ .option("startingOffsets", "latest") \ .load() \ .selectExpr("CAST(value AS STRING)") \ .withColumn("net", from_json("value", Structures.get_ipfix_structure())) \ .select("net.*") # remove ipfix prefix in case of ipfixv1 collector temp_list = [] for c in df.columns: new_name = c.replace('ipfix.', '') temp_list.append(new_name) df = df.toDF(*temp_list) # enrichment edf = df \ .withColumn("sourceAS", get_asn('sourceIPv4Address')) \ .withColumn("destinationAS", get_asn('destinationIPv4Address'))

2条回答

网友

1楼 · 编辑于 2024-09-30 01:27:06

试着像下面提到的那样使用它。 .withColumn（“sourceAS”，get_asn（F.col（'sourceIPv4Address'））

网友

2楼 · 编辑于 2024-09-30 01:27:06

而且，这看起来很可疑

# remove ipfix prefix in case of ipfixv1 collector
  temp_list = []
  for c in df.columns:
      new_name = c.replace('ipfix.', '')
      temp_list.append(new_name)
  df = df.toDF(*temp_list)

您正在更改列名，然后选择它们，但新列名不在数据框中，对吗？因此，它必须返回空数据帧

如果要重命名列，请使用-

df = df.withColumnRenamed(c, c.replace('ipfix.', ''))

有关如何在pyspark中清除列名的详细信息，请参阅此-https://www.youtube.com/watch?v=vAHPAP9Oagc&t=1s

相关问题更多 >

编程相关推荐

热门问题

热门文章