Databricks

Omnata's integration service replicates data from Databricks and loads it into your destination (Snowflake) at regular intervals, as well as providing direct query capabilities into your Databricks instance.


Supported Services

  • Databricks on AWS

  • Databricks on Azure

  • Databricks on GCP


Connectivity Options

Omnata's Snowflake Native App can connect to your Databricks instance in three ways: directly (if publicly accessible), via Privatelink, or via Omnata's secure ngrok tunnel.

Direct

Connect to your Databricks instance using its fully qualified hostname and port (default 443). The instance must be network-accessible from Snowflake.

Privatelink provides a private network path between your Snowflake account and your Databricks workspace without traffic traversing the public internet.

Requirements:

  • Snowflake Business Critical Edition

  • Snowflake account and Databricks workspace on the same cloud provider and region

To configure, provision a Privatelink endpoint in your Snowflake account pointing to your Databricks workspace, then choose the Privatelink option when creating the connection in Omnata and provide the endpoint domain.

A Pre-requisite for creating a successful connection is setting up a network rule for the authentication. This can be done by calling

This will create a new network rule with the authentication endpoints and your storgae endpoints included. Keep a note as you will need it when creating new connections.

Authentication

Personal Access Token

Provide your Databricks personal access token. Tokens can be generated in your Databricks workspace under Settings > Developer > Access tokens.

Service Principal (OAuth M2M)

Omnata supports OAuth machine-to-machine authentication via an Service Principal. Provide:

  • Client ID — the Service Principal's application (client) ID

  • Client Secret — the Service Principal's client secret

and for Azure EntraID based Service Principals provide:

  • Tenant ID - The Azure tenant assocaited with the service principal


Connection Configuration

When configuring a new connection, you be asked to create a external access integration. If youa re using Private link, add the network rule you created earlier into the list of allowed rules.

When creating a connection you will need to provide:

Field
Description

Server Fully Qualified Name

The hostname of your Databricks workspace (e.g. adb-1234567890.1.azuredatabricks.net)

HTTP Path

The HTTP path of the SQL Warehouse or cluster to connect to. Found in your Databricks workspace under SQL Warehouses > [Your Warehouse] > Connection Details > HTTP Path (e.g. /sql/1.0/warehouses/abc123)

Cloud Provider

The cloud provider hosting your Databricks workspace: AWS, Azure, or GCP

External Storage Accounts

A comma-separated list of storage account names used for unloading data from Databricks. Found in your Databricks workspace under Catalog > Connect > External Data. The storage account name is extracted from the URL field: for Azure, it is the part after @ and before .dfs.core.windows.net; for AWS, it is the part before .amazonaws.com (ie bucket.s3.region). For GCP this is not required

Required Permissions

The Databricks user or service principal used for the connection requires:

  • USE CATALOG on the target catalog

  • USE SCHEMA on the target schema(s)

  • SELECT on the tables and views to be synced

Example grants in Databricks SQL:

For broader access, grant at the catalog or schema level:


Inbound Syncs

Omnata supports the following sync strategies for Databricks.

Full Refresh

Retrieves all records from the source table or view on every sync run. Use this for tables or views that do not have a suitable cursor column.

Incremental (High Watermark)

Retrieves only records that have changed since the last sync run, using a cursor column to track progress. Omnata automatically detects suitable cursor columns based on their data type.

Supported cursor column types: date, timestamp, timestamp_ntz, timestamp_ltz, bigint, long, int, integer.

If no suitable cursor column is present, only Full Refresh will be available for that stream.

Outbound Syncs

Coming soon


Managing Performance

Tuning Parameters

Three parameters control the concurrency and batch size of sync operations:

Maximum Concurrent Streams (MCS)

The number of tables/views processed in parallel. Default: 8, range: 1–64.

Maximum Concurrent Queries (MCQ)

The number of concurrent queries issued per stream when fetching data. Default: 4, range: 1–64.

Target Records Per Query (MRQ)

The target row count per individual query. Omnata splits large tables into chunks based on index distribution. Default: 1,000,000, range: 100,000–25,000,000.

Example

Setting
Value

Streams

16 tables × 5M rows each = 80M rows total

MCS

8

MCQ

4

MRQ

1,000,000

This results in 80 queries of ~1M rows each, with a maximum of 32 (MCS × MCQ) concurrent queries at any time.

Snowflake Warehouse

Omnata is designed to run on a single-cluster XS warehouse. During an initial historical backfill, consider temporarily enabling multi-cluster scaling with Standard scaling policy (not Economy) to reduce query queuing. Return to single-cluster once the initial sync completes.


Functions

Omnata exposes a SQL table function that lets you query your Databricks instance directly from Snowflake.

SQL_QUERY

Executes a SQL SELECT statement against the Databricks source and returns results as a table.

Parameters:

Parameter
Type
Description

CONNECTION_SLIG

STRING

The slug associated with the omnata connection

QUERY

STRING

The SQL query to execute

Example:

Last updated