bigQueryPythonAPI:在提取表作业期间保留空字段

job_config = bigquery.ExtractJobConfig() job_config.compression = bigquery.Compression.GZIP job_config.destination_format = ( bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON) destination_uri = 'gs://{}/{}'.format(bucket_name, gcs_filename) extract_job = client.extract_table( table, destination_uri, job_config=job_config, location='US') # API request extract_job.result() # Waits for job to complete.

2条回答

网友

1楼 · 编辑于 2024-09-30 18:16:29

这个论点以前在SO中已经提出过。我建议您回顾一下this post，包括对您的问题的解释和解决方法。在

有一些很好的答案，例如这一个，来自Mosha（谷歌软件工程师）：

This is standard behavior of NULL in SQL, and all SQL databases (Oracle, Microsoft SQL Server, PostgreSQL, MySQL etc) have exactly same behavior. If the IS NULL check is too tedious, alternative solution is to use IFNULL or COALESCE function to convert NULL into non-NULL, i.e.
select * from
(select NULL as some_nullable_col, "name1" as name),
(select 4 as some_nullable_col, "name2" as name),
(select 1 as some_nullable_col, "name3" as name),
(select 7 as some_nullable_col, "name4" as name),
(select 3 as some_nullable_col, "name5" as name)
WHERE ifnull(some_nullable_col,0) != 3

网友

2楼 · 编辑于 2024-09-30 18:16:29

我认为最好的方法是先使用查询作业。在

您需要从某处提取表并运行查询作业
以不带标题的CSV格式运行提取

有这样做的代码

job_config = bigquery.QueryJobConfig()
gcs_filename = 'file_with_nulls*.json.gzip'

table_ref = client.dataset(dataset_id).table('my_null_table')
job_config.destination = table_ref

job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE

# Start the query, passing in the extra configuration.
query_job = client.query(
    """#standardSql
    select TO_JSON_STRING(t) AS json from `project.dataset.table` as t ;""",
    location='US',
    job_config=job_config)

while not query_job.done():
    time.sleep(1)

#check if table successfully written
print("query completed")
job_config = bigquery.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
job_config.destination_format = (
    bigquery.DestinationFormat.CSV)
job_config.print_header = False

destination_uri = 'gs://{}/{}'.format(bucket_name, gcs_filename)

extract_job = client.extract_table(
    table_ref,
    destination_uri,
    job_config=job_config,
    location='US')  # API request
extract_job.result()
print("extract completed")

完成所有操作后，您可以删除在步骤1中创建的临时表。如果您快速完成，成本将非常低（每月1TB的存储是20美元—因此，对于25GB的存储，即使是1小时，也将是20/30/24=3美分）

相关问题更多 >

编程相关推荐

热门问题

热门文章