我正在尝试编写一个Python脚本,在Dataflow管道的帮助下将数据从Google云存储桶流式传输到bigquery。我可以启动一个作业,但该作业是批处理而不是流式作业,我们不允许使用发布/订阅
下面是我正在尝试的代码,其详细信息为通用代码:
from __future__ import absolute_import
import argparse
import re
import logging
import apache_beam as beam
import json
from past.builtins import unicode
from apache_beam.io import ReadFromText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
# This class has all the functions which facilitate data transposition
class WordExtractingDoFn(beam.DoFn):
def __init__(self):
super(WordExtractingDoFn, self).__init__()
# Create Bigquery Row
dict function
return
def run_bq(argv=None):
parser = argparse.ArgumentParser()
schema1 = your schema
# All Command Line Arguments being added to the parser
parser.add_argument(
'--input', dest='input', required=False,
default='gs://your-bucket-path/')
parser.add_argument('--output', dest='output', required=False,
default='yourdataset.yourtable')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_args.extend([
'--runner=DataflowRunner',
'--project=your-project',
'--staging_location=gs://your-staging-bucket-path/',
'--temp_location=gs://your-temp-bucket-path/',
'--job_name=pubsubbql1',
'--streaming'
])
pushtobq = WordExtractingDoFn()
# Pipeline Creation Begins
p = beam.Pipeline(options=PipelineOptions(pipeline_args))
(p
| 'Read from a File' >> beam.io.ReadFromText(known_args.input)
| 'String To BigQuery Row' >> beam.Map(dict-file)
| 'Write to BigQuery' >> beam.io.WriteToBigQuery(
known_args.output,
schema=schema2
)
)
# Run Pipeline
p.run().wait_until_finish()
# Main Method to call
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run_bq()
有了上面的代码,我可以创建作业,但它们是批处理作业,我的主要动机是从桶中获取数据,这是json格式的,我需要将其插入BigQuery。在
目前没有回答
相关问题 更多 >
编程相关推荐