如何在Apache Beam/Google Cloud数据流中通过多个ParDo转换处理本地文件上的操作

2024-09-30 04:33:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在为谷歌云数据流开发一个ETL管道,其中我有几个分支ParDo转换,每个转换都需要一个本地音频文件。然后将分支结果合并并导出为文本

这最初是一个运行在一台机器上的Python脚本,我正在尝试使用GC数据流对其进行调整,以实现VM工作者并行化

提取过程从单个GCS存储桶位置下载文件,然后在转换完成后将其删除,以控制存储。这是因为预处理模块需要本地访问文件。通过重写一些预处理库,这可以被重新设计为处理字节流而不是文件——但是,在这方面的一些尝试并不顺利,我想首先探讨如何在Apache Beam/GC数据流中处理并行本地文件操作,以便更好地理解框架

在这个粗略的实现中,每个分支下载和删除文件,并进行大量的双重处理。在我的实现中,我有8个分支,所以每个文件都被下载和删除了8次。是否可以在每个工人身上安装GCS存储桶,而不是从远程下载文件

或者是否有其他方法确保向员工传递对文件的正确引用,以便:

  • 单个DownloadFilesDoFn()可以下载一批
  • 然后将PCollection中的本地文件引用扇出到所有分支
  • 然后最后一个CleanUpFilesDoFn()可以移除它们
  • 如何并行本地文件引用

如果无法避免本地文件操作,Apache Beam/GC数据流的最佳分支ParDo策略是什么


为了简单起见,我使用两个分支的现有实现的一些示例代码

# singleton decorator
def singleton(cls):
  instances = {}
  def getinstance():
      if cls not in instances:
          instances[cls] = cls()
      return instances[cls]
  return getinstance

@singleton
class Predict():
  def __init__(self, model):
    '''
    Process audio, reads in filename 
    Returns Prediction
    '''
    self.model = model

  def process(self, filename):
      #simplified pseudocode
      audio = preprocess.load(filename=filename)
      prediction = inference(self.model, audio)
      return prediction

class PredictDoFn(beam.DoFn):
  def __init__(self, model):
    self.localfile, self.model = "", model
    
  def process(self, element):
    # Construct Predict() object singleton per worker
    predict = Predict(self.model)

    subprocess.run(['gsutil','cp',element['GCSPath'],'./'], cwd=cwd, shell=False)
    self.localfile = cwd + "/" + element['GCSPath'].split('/')[-1]

    res = predict.process(self.localfile)
    return [{
        'Index': element['Index'], 
        'Title': element['Title'],
        'File' : element['GCSPath'],
        self.model + 'Prediction': res
        }]    
  def finish_bundle(self):
    subprocess.run(['rm',self.localfile], cwd=cwd, shell=False)


# DoFn to split csv into elements (GSC bucket could be read as a PCollection instead maybe)
class Split(beam.DoFn):
    def process(self, element):
        Index,Title,GCSPath = element.split(",")
        GCSPath = 'gs://mybucket/'+ GCSPath
        return [{
            'Index': int(Index),
            'Title': Title,
            'GCSPath': GCSPath
        }]

管道的简化版本:

with beam.Pipeline(argv=pipeline_args) as p:
    files = 
        ( 
        p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input)
          | 'Parse CSV into Dict' >> beam.ParDo(Split())
        )
    # prediction 1 branch
    preds1 = 
        (
          files | 'Prediction 1' >> beam.ParDo(PredictDoFn(model1))
        )
    # prediction 2 branch
    preds2 = 
        (
          files | 'Prediction 2' >> beam.ParDo(PredictDoFn(model2))
        )
    
    # join branches
    joined = { preds1, preds2 }

    # output to file
    output = 
        ( 
      joined | 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
        )


Tags: 文件selfindexmodelreturntitledef分支
1条回答
网友
1楼 · 发布于 2024-09-30 04:33:29

为了避免重复下载文件,可以将文件内容放入pCollection

class DownloadFilesDoFn(beam.DoFn):
  def __init__(self):
     import re
     self.gcs_path_regex = re.compile(r'gs:\/\/([^\/]+)\/(.*)')

  def start_bundle(self):
     import google.cloud.storage
     self.gcs = google.cloud.storage.Client()

  def process(self, element):
     file_match = self.gcs_path_regex.match(element['GCSPath'])
     bucket = self.gcs.get_bucket(file_match.group(1))
     blob = bucket.get_blob(file_match.group(2))
     element['file_contents'] = blob.download_as_bytes()
     yield element
     

然后PredictDoFn变成:

class PredictDoFn(beam.DoFn):
  def __init__(self, model):
    self.model = model

  def start_bundle(self):
    self.predict = Predict(self.model)
    
  def process(self, element):
    res = self.predict.process(element['file_contents'])
    return [{
        'Index': element['Index'], 
        'Title': element['Title'],
        'File' : element['GCSPath'],
        self.model + 'Prediction': res
        }]   

管道:

with beam.Pipeline(argv=pipeline_args) as p:
    files = 
        ( 
        p | 'Read From CSV' >> beam.io.ReadFromText(known_args.input)
          | 'Parse CSV into Dict' >> beam.ParDo(Split())
          | 'Read files' >> beam.ParDo(DownloadFilesDoFn())
        )
    # prediction 1 branch
    preds1 = 
        (
          files | 'Prediction 1' >> beam.ParDo(PredictDoFn(model1))
        )
    # prediction 2 branch
    preds2 = 
        (
          files | 'Prediction 2' >> beam.ParDo(PredictDoFn(model2))
        )
    
    # join branches
    joined = { preds1, preds2 }

    # output to file
    output = 
        ( 
      joined | 'WriteToText' >> beam.io.Write(beam.io.textio.WriteToText(known_args.output))
        )

相关问题 更多 >

    热门问题