无法从Beam中的GCS读取fromsub gz文件

2024-09-29 19:21:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我们试着用pubsub的方式在Beam中加载GCS的数据。一旦有新的数据上传到地面军事系统,我们可以通过pubsub在Beam中及时加载数据。但是,它未能从地面军事系统加载数据。你知道吗

我的管道是


class ParseAndFilterDo(beam.DoFn):
    def __int__(self):
        super(ParseAndFilterDo, self).__init__()
        self.num_parse_errors = Metrics.counter(self.__class__, 'num_parse_errors')

    def process(self, element):
        text_line = element.strip()
        data = {}
        try:
            data = json.loads(text_line)
            print(data)
            yield data
        except Exception as ex:
            print("Parse json exception:", ex)
            self.num_parse_errors.inc()

 ...

   pipeline_args.extend([
        '--runner=DirectRunner',
        '--staging_location=gs://my-transform-bucket/stage',
        '--temp_location=gs://my-transform-bucket/temp',
        '--job_name=test-sub-job',
    ])
    options = PipelineOptions(pipeline_args)
    options.view_as(SetupOptions).save_main_session = True
    options.view_as(StandardOptions).streaming = True

    with beam.Pipeline(options=options) as p:
        events = p | "ReadPubSub" >> beam.io.ReadFromPubSub(topic=args.topic)

        raw_events = (
            events
            | 'DecodeString' >> beam.Map( lambda b: b.decode('utf-8'))
            | "ParseAndFilterDo" >> beam.ParDo(ParseAndFilterDo())
        )

并将topic设置为GCS bucket

gsutil notification create -t testtopic -f json -e OBJECT_FINALIZE gs://my-test-bucket

同时,Google云发布/订阅API也被激活。你知道吗

然后我尝试将文件类型为gz的json数据上传到my-test-bucket,日志显示

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443
DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
{u'kind': u'storage#object', u'contentType': u'application/x-gzip', u'name': u'log_2019-08-12T00.4763-4caf-b712-cd1b815c203932.log.gz', u'timeCreated': u'2019-08-14T05:47:19.664Z', u'generation': u'1565761639664269', u'md5Hash': u'7mAixitzv6WDVVa1ar37Vw==', u'bucket': u'my-test-bucket', u'updated': u'2019-08-14T05:47:19.664Z', u'crc32c': u'UHiIrQ==', u'metageneration': u'1', u'mediaLink': u'https://www.googleapis.com/download/storage/v1/b/my-test-bucket/o/log_2019-08-12T00.4763-4caf-b712-cd1b815c203932.log.gz?generation=15657616399&alt=media', u'storageClass': u'MULTI_REGIONAL', u'timeStorageClassUpdated': u'2019-08-14T05:47:19.664Z', u'etag': u'CI2V19LEAE=', u'id': u'my-test-bucket/log_2019-08-12T00.4763-4caf-b712-cd1b815c203932.log.gz/1565761639664269', u'selfLink': u'https://www.googleapis.com/storage/v1/b/my-test-bucket/o/log_2019-08-12T00.4763-4caf-b712-cd1b815c203932.log.gz', u'size': u'55259'}
DEBUG:root:Connecting using Google Application Default Credentials.
DEBUG:root:Attempting to flush to all destinations. Total buffered: 0

似乎这里只触发了storage object事件。但在Beam中并没有可读取的数据有效负载。你知道吗

我的配置有什么问题吗?或者我遗漏了什么?你知道吗

  • 梁版本:2.14.0
  • google云pubsub:0.45.0
  • grpcio:1.22.0

Tags: 数据debugtestselflogjsondatabucket
1条回答
网友
1楼 · 发布于 2024-09-29 19:21:42

Pub/Sub notifications将仅包含事件元数据(上载的对象不通过发布/订阅消息发送)。你知道吗

如果我正确地理解了用例,并且您希望读取文件内容,那么您需要解析通知以获得完整的文件路径,然后将生成的PCollection传递给beam.io.ReadAllFromText(),如下所示:

class ExtractFn(beam.DoFn):
    def process(self, element):
        file_name = 'gs://' + "/".join(element['id'].split("/")[:-1])
        logging.info('File: ' + file_name) 
        yield file_name

请注意,我使用了您提供的示例消息的id字段(并删除了最后一部分,我猜它是用于版本控制的)。你知道吗

我的主要渠道是:

(p
  | 'Read Messages' >> beam.io.ReadFromPubSub(topic="projects/PROJECT/topics/TOPIC")
  | 'Convert Message to JSON' >> beam.Map(lambda message: json.loads(message))
  | 'Extract File Names' >> beam.ParDo(ExtractFn())
  | 'Read Files' >> beam.io.ReadAllFromText()
  | 'Write Results' >> beam.ParDo(LogFn()))

完整代码here。你知道吗

我用directrunner和2.14.0sdk、公共文件gs://apache-beam-samples/shakespeare/kinglear.txt和测试消息(不是真正的通知)测试了它:

python notifications.py  streaming
gcloud pubsub topics publish $TOPIC_NAME  message='{"id": "apache-beam-samples/shakespeare/kinglear.txt/1565795872"}'

开始印刷莎士比亚的《李尔王》:

INFO:root:File: gs://apache-beam-samples/shakespeare/kinglear.txt
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
...
INFO:root:  KING LEAR
INFO:root:
INFO:root:
INFO:root:  DRAMATIS PERSONAE
INFO:root:
INFO:root:
INFO:root:LEAR  king of Britain  (KING LEAR:)
INFO:root:
INFO:root:KING OF FRANCE:

相关问题 更多 >

    热门问题