# Databricks

Omnata's integration service replicates data from Databricks and loads it into your destination (Snowflake) at regular intervals, as well as providing direct query capabilities into your Databricks instance.

***

### Supported Services

* Databricks on AWS
* Databricks on Azure
* Databricks on GCP

***

### Connectivity Options

Omnata's Snowflake Native App can connect to your Databricks instance in three ways: directly (if publicly accessible), via Privatelink, or via Omnata's secure ngrok tunnel.

#### Direct

Connect to your Databricks instance using its fully qualified hostname and port (default 443). The instance must be network-accessible from Snowflake.

#### Privatelink

Privatelink provides a private network path between your Snowflake account and your Databricks workspace without traffic traversing the public internet.

Requirements:

* Snowflake Business Critical Edition
* Snowflake account and Databricks workspace on the same cloud provider and region

To configure, provision a Privatelink endpoint in your Snowflake account pointing to your Databricks workspace, then choose the Privatelink option when creating the connection in Omnata and provide the endpoint domain.

A Pre-requisite for creating a successful connection is setting up a network rule for the authentication. This can be done by calling

```
call OMNATA_DATABRICKS_PLUGIN.PLUGIN_API.CREATE_PUBLIC_ENDPOINT_NETWORK_RULE('<<cloud provider>>', '<<network rule name>>', ['<<list of storage accounts>>'])
```

This will create a new network rule with the authentication endpoints and your storgae endpoints included. Keep a note as you will need it when creating new connections.

### Authentication

#### Personal Access Token

Provide your Databricks personal access token. Tokens can be generated in your Databricks workspace under **Settings > Developer > Access tokens**.

#### Service Principal (OAuth M2M)

Omnata supports OAuth machine-to-machine authentication via an Service Principal. Provide:

* **Client ID** — the Service Principal's application (client) ID
* **Client Secret** — the Service Principal's client secret

and for Azure EntraID based Service Principals provide:

* **Tenant ID** - The Azure tenant assocaited with the service principal

***

### Connection Configuration

When configuring a new connection, you be asked to create a external access integration. If youa re using Private link, add the network rule you created earlier into the list of allowed rules.

When creating a connection you will need to provide:

| Field                           | Description                                                                                                                                                                                                                                                                                                                                                                                                    |
| ------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Server Fully Qualified Name** | The hostname of your Databricks workspace (e.g. `adb-1234567890.1.azuredatabricks.net`)                                                                                                                                                                                                                                                                                                                        |
| **HTTP Path**                   | The HTTP path of the SQL Warehouse or cluster to connect to. Found in your Databricks workspace under **SQL Warehouses > \[Your Warehouse] > Connection Details > HTTP Path** (e.g. `/sql/1.0/warehouses/abc123`)                                                                                                                                                                                              |
| **Cloud Provider**              | The cloud provider hosting your Databricks workspace: `AWS`, `Azure`, or `GCP`                                                                                                                                                                                                                                                                                                                                 |
| **External Storage Accounts**   | A comma-separated list of storage account names used for unloading data from Databricks. Found in your Databricks workspace under **Catalog > Connect > External Data**. The storage account name is extracted from the URL field: for Azure, it is the part after `@` and before `.dfs.core.windows.net`; for AWS, it is the part before `.amazonaws.com` (ie bucket.s3.region). For GCP this is not required |

### Required Permissions

The Databricks user or service principal used for the connection requires:

* `USE CATALOG` on the target catalog
* `USE SCHEMA` on the target schema(s)
* `SELECT` on the tables and views to be synced

Example grants in Databricks SQL:

```
GRANT USE CATALOG ON CATALOG my_catalog TO `omnata_user`;
GRANT USE SCHEMA ON SCHEMA my_catalog.my_schema TO `omnata_user`;
GRANT SELECT ON TABLE my_catalog.my_schema.my_table TO `omnata_user`;
```

For broader access, grant at the catalog or schema level:

```
GRANT USE CATALOG, USE SCHEMA, SELECT ON CATALOG my_catalog TO `omnata_user`;
```

***

### Inbound Syncs

Omnata supports the following sync strategies for Databricks.

#### Full Refresh

Retrieves all records from the source table or view on every sync run. Use this for tables or views that do not have a suitable cursor column.

#### Incremental (High Watermark)

Retrieves only records that have changed since the last sync run, using a cursor column to track progress. Omnata automatically detects suitable cursor columns based on their data type.

Supported cursor column types: `date`, `timestamp`, `timestamp_ntz`, `timestamp_ltz`, `bigint`, `long`, `int`, `integer`.

If no suitable cursor column is present, only Full Refresh will be available for that stream.

### Outbound Syncs

Coming soon

***

### Managing Performance

#### Tuning Parameters

Three parameters control the concurrency and batch size of sync operations:

**Maximum Concurrent Streams (MCS)**

The number of tables/views processed in parallel. Default: `8`, range: `1–64`.

**Maximum Concurrent Queries (MCQ)**

The number of concurrent queries issued per stream when fetching data. Default: `4`, range: `1–64`.

**Target Records Per Query (MRQ)**

The target row count per individual query. Omnata splits large tables into chunks based on index distribution. Default: `1,000,000`, range: `100,000–25,000,000`.

**Example**

| Setting | Value                                     |
| ------- | ----------------------------------------- |
| Streams | 16 tables × 5M rows each = 80M rows total |
| MCS     | 8                                         |
| MCQ     | 4                                         |
| MRQ     | 1,000,000                                 |

This results in 80 queries of \~1M rows each, with a maximum of 32 (MCS × MCQ) concurrent queries at any time.

#### Snowflake Warehouse

Omnata is designed to run on a single-cluster XS warehouse. During an initial historical backfill, consider temporarily enabling multi-cluster scaling with **Standard** scaling policy (not Economy) to reduce query queuing. Return to single-cluster once the initial sync completes.

***

### Functions

Omnata exposes a SQL table function that lets you query your Databricks instance directly from Snowflake.

#### SQL\_QUERY

Executes a SQL `SELECT` statement against the Databricks source and returns results as a table.

**Parameters:**

| Parameter         | Type   | Description                                    |
| ----------------- | ------ | ---------------------------------------------- |
| `CONNECTION_SLIG` | STRING | The slug associated with the omnata connection |
| `QUERY`           | STRING | The SQL query to execute                       |

**Example:**

```
SELECT *
FROM TABLE(OMNATA_DATABRICKS_PLUGIN.UDFS.SQL_QUERY(
    <connection_slug>>,
    'SELECT * FROM my_catalog.my_schema.my_table'
));
```
