Python dbt-spark包_程序模块 - PyPI

dbt的sparksql插件（数据构建工具）

dbt-spark的Python项目详细描述

DBT火花

文件

有关在dbt中使用spark的更多信息，请参阅dbt documentation。

安装

此插件可以通过pip安装：

$ pip install dbt-spark

配置配置文件

连接方法

可以用两种不同的方式连接火花当连接到托管服务（如提供http端点的databricks）时，使用http模式；使用thrift模式直接连接到集群的主节点（本地或云中）。

可以使用以下配置将dbt配置文件配置为运行在Spark上：

Option	Description	Required?	Example
method	Specify the connection method (^{} or ^{})	Required	^{}
schema	Specify the schema (database) to build models into	Required	^{}
host	The hostname to connect to	Required	^{}
port	The port to connect to the host on	Optional (default: 443 for ^{}, 10001 for ^{})	^{}
token	The token to use for authenticating to the cluster	Required for ^{}	^{}
cluster	The name of the cluster to connect to	Required for ^{}	^{}
user	The username to use to connect to the cluster	Optional	^{}
connect_timeout	The number of seconds to wait before retrying to connect to a Pending Spark cluster	Optional (default: 10)	^{}
connect_retries	The number of times to try connecting to a Pending Spark cluster before giving up	Optional (default: 0)	^{}

亚马逊电子病历使用情况

要连接到运行在amazon emr集群上的spark，需要在集群的主节点上运行sudo /usr/lib/spark/sbin/start-thriftserver.sh，以启动精简服务器（有关更多上下文，请参见https://aws.amazon.com/premiumsupport/knowledge-center/jdbc-connection-emr/）。您还需要连接到端口10001，该端口将连接到Spark后端精简服务器；而端口10000将连接到配置单元后端，该后端无法与dbt正常工作

Example profiles.yml条目：

your_profile_name:
  target: dev
  outputs:
    dev:
      method: http
      type: spark
      schema: analytics
      host: yourorg.sparkhost.com
      port: 443
      token: abc123
      cluster: 01234-23423-coffeetime
      connect_retries: 5
      connect_timeout: 60

your_profile_name:
  target: dev
  outputs:
    dev:
      method: thrift
      type: spark
      schema: analytics
      host: 127.0.0.1
      port: 10001
      user: hadoop
      connect_retries: 5
      connect_timeout: 60

使用说明

模型配置

以下配置可以提供给使用dbt spark插件运行的模型：

Option	Description	Required?	Example
file_format	The file format to use when creating tables	Optional	^{}

增量模型

spark本身不支持delete、update或merge语句。因此，incremental models 在这个插件中的实现与通常不同。要使用增量模型，请在模型配置中指定partition_by子句。 dbt将使用insert overwrite查询覆盖查询中包含的分区。确保重新选择all的相关使用增量模型时分区的数据。

{{ config(
    materialized='incremental',
    partition_by=['date_day'],
    file_format='parquet'
) }}

/*
  Every partition returned by this query will be overwritten
  when this model runs
*/

select
    date_day,
    count(*) as users

from {{ ref('events') }}
where date_day::date >= '2019-01-01'
group by 1

报告错误和贡献代码

要报告错误或请求功能吗？请在Slack上通知我们，或打开an issue

行为准则

在dbt项目的代码库、问题跟踪程序、聊天室和邮件列表中进行交互的每个人都应该遵循PyPA Code of Conduct。

欢迎加入QQ群-->： 979659372

dbt-spark 0.13.0

dbt-spark的Python项目详细描述

DBT火花

文件

安装

配置配置文件

使用说明

报告错误和贡献代码

行为准则

推荐PyPI第三方库

dnsknife

project_generator

wargaming

iobeam

pytest-leaks

djangocms-fbcomments

skelethon

codalab

some-windows-snippets

draftqualit

Pyarser

nepali

odoo11-addon-hr-contract-multi-job

pyliveleak

FlagWaver

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

dbt-spark 0.13.0

dbt-spark的Python项目详细描述

DBT火花

文件

安装

配置配置文件

使用说明

报告错误和贡献代码

行为准则

推荐PyPI第三方库

dnsknife

project_generator

wargaming

iobeam

pytest-leaks

djangocms-fbcomments

skelethon

codalab

some-windows-snippets

draftqualit

Pyarser

nepali

odoo11-addon-hr-contract-multi-job

pyliveleak

FlagWaver

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签