在Apache Beam上按顺序执行数据库写入和读取任务

with beam.Pipeline(options=pipeline_options) as p: # Executes first proc_id_result = (p | 'Create Proc Info Record' >> beam.Create([{'pipeline_name': 'cleansed_data_pipeline'}]) | 'Make Processing Id' >> relational_db.Write( source_config=source_config, table_config=proc_table_config)) # Executes second proc_id_record = p | relational_db.ReadFromDB( source_config=source_config, table_name='processing_info', query='SELECT pi.id FROM processing_info pi WHERE processing_date_time = ' ' (SELECT MAX(pi1.processing_date_time) from processing_info pi1 ' f' where pi1.pipeline_name = \'cleansed_data_pipeline\')' ) ... # This code executes later, and is automatically deferred until the side input is available | 'Add \'processing_info_id\'' >> (beam.ParDo(AddKeyValuePairToDict(), 'processing_info_id', AsSingleton(proc_id_record))) ...

1条回答

网友

1楼 · 发布于 2024-09-24 04:27:01

您的想法是正确的：您可以使用未使用的侧输入来实现这一点。您可以这样做（在Beam中用于ReadFromBigQuery

class PassThrough(beam.DoFn):
  def process(self, element):
    yield element

output = input | beam.ParDo(PassThrough()).with_outputs(
    'cleanup_signal', main='main')
main_output = output['main']
cleanup_signal = output['cleanup_signal']

single_element = (
    input.pipeline
    | beam.Create([None])
    | beam.Map(lambda x, nothing: x, beam.pvalue.AsSingleton(cleanup_signal)))

single_element | relational_db.ReadFromDB(...)

现在，问题是如何利用你的ReadFromDB转换来实现这一点，我想它不会接受这样的输入。有什么方法可以实现这一转换吗

相关问题更多 >

编程相关推荐

热门问题

热门文章