如何在python中将从zip文件提取的CSV写入HDFS?

2024-10-03 23:31:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我想将csv文件写入HDFS

CSV文件以zip格式来自HTTP请求。请求的内容被转换为zipfile对象。从该zipfile中,如何正确提取csv以及如何将其写入HDFS

到目前为止,我已经尝试过:

import os
from hdfs.util import HdfsError
from http_wrapper import HttpWrapper
from io import BytesIO
from zipfile import ZipFile

unite_legale_data = HttpWrapper.get_zip_data(args.url_unite_legale)
    unite_legale_name = unite_legale_data['content_name']
    unite_legale_content = unite_legale_data['content']
    log("INFO", "start writing to HDFS")
    cli_hdfs = InsecureClient('http://' + os.environ['HDFS_IP'] + ':'+str(os.environ['HDFS_PORT']),user = "hdfs")

    with cli_hdfs.write(args.unit_legale_output_path, encoding = 'utf-8', overwrite = True) as writer:
        with unite_legale_content.open(unite_legale_name) as file:
            writer.write(file.read())

我的类HttpWrapper如下所示:

class HttpWrapper:

    @staticmethod
    def get_zip_data(url):
        print("get zip data from {}".format(url))
        content = urlopen(url)
        zipped_content = ZipFile(BytesIO(content.read()))
        content_name = zipped_content.namelist()[0]
        print("got data for file named {}".format(content_name))
        return {"content_name": content_name,
                "content": zipped_content}

这会产生以下错误:

AttributeError: 'bytes' object has no attribute 'encode'

对于这一行:

   writer.write(file.read())

Tags: namefromimporturldataoshdfscontent