如何使用Dataflow将Google Cloud Storage存储桶流式传输到Big Query，无需使用Pub/Sub

2024-10-01 11:31:02 发布

您现在位置：Python中文网/ 问答频道 /正文

8061

网友

男 | 程序猿一只，喜欢编程写python代码。

我正在尝试编写一个Python脚本，在Dataflow管道的帮助下将数据从Google云存储桶流式传输到bigquery。我可以启动一个作业，但该作业是批处理而不是流式作业，我们不允许使用发布/订阅

下面是我正在尝试的代码，其详细信息为通用代码：

from __future__ import absolute_import

import argparse
import re
import logging
import apache_beam as beam
import json

from past.builtins import unicode
from apache_beam.io import ReadFromText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions



# This class has all the functions which facilitate data transposition
class WordExtractingDoFn(beam.DoFn):
    def __init__(self):
        super(WordExtractingDoFn, self).__init__()

    # Create Bigquery Row
   dict function
       return
def run_bq(argv=None):
    parser = argparse.ArgumentParser()
    schema1 = your schema
    # All Command Line Arguments being added to the parser
    parser.add_argument(
        '--input', dest='input', required=False,
        default='gs://your-bucket-path/')

    parser.add_argument('--output', dest='output', required=False,
                        default='yourdataset.yourtable')
    known_args, pipeline_args = parser.parse_known_args(argv)
    pipeline_args.extend([
        '--runner=DataflowRunner',
        '--project=your-project',
        '--staging_location=gs://your-staging-bucket-path/',
        '--temp_location=gs://your-temp-bucket-path/',
        '--job_name=pubsubbql1',
        '--streaming'
    ])
    pushtobq = WordExtractingDoFn()

    # Pipeline Creation Begins
    p = beam.Pipeline(options=PipelineOptions(pipeline_args))
    (p
     | 'Read from a File' >> beam.io.ReadFromText(known_args.input)
     | 'String To BigQuery Row' >> beam.Map(dict-file)
     | 'Write to BigQuery' >> beam.io.WriteToBigQuery(
                    known_args.output,
                    schema=schema2
                )
     )

    # Run Pipeline
    p.run().wait_until_finish()


# Main Method to call
if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run_bq()

有了上面的代码，我可以创建作业，但它们是批处理作业，我的主要动机是从桶中获取数据，这是json格式的，我需要将其插入BigQuery。在

Tags：代码 from io import parser your pipeline apache

0条回答

目前没有回答

如何使用Dataflow将Google Cloud Storage存储桶流式传输到Big Query，无需使用Pub/Sub

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用Dataflow将Google Cloud Storage存储桶流式传输到Big Query，无需使用Pub/Sub

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >