Databricks + Tensorlake Integration Examples

Transform Unstructured Data into Queryable and AI Ready Data on Databricks

Databricks is a unified data analytics platform built on Apache Spark, designed for data engineering, machine learning, and analytics at scale. When combined with Tensorlake's document parsing and serverless agentic application runtime, you can build AI workflows and agents which can automate processing of documents and other forms of unstructured data and land them in Databricks.

This repository demonstrates building a full ingestion pipeline on Tensorlake. The orchestration of your ingestion pipeline happens on Tensorlake, so you can write a distributed and durable ingestion pipeline in pure Python and Tensorlake will automatically queue requests as they arrive and scale the cluster to process data. The platform is serverless, so you only pay for compute resources used for processing data.

SEC Filings Analysis Pipeline

This repository demonstrates how to use Tensorlake Applications to extract AI-related risk mentions from SEC filings, storing and querying results in Databricks.

The Tensorlake Application receives Document URLs over HTTP, uses Vision Language Models to classify pages containing risk factors, calls an LLM for structured extraction from only relevant pages, and then uses Databricks SQL Connector to write structured data into your Databricks SQL Warehouse. Once it's inside Databricks you can run complex analytics to track trends, compare companies, and discover emerging risk patterns.

The Application is written in Python, without any external orchestration engines, so you can build and test it like any other normal application. You can use any document AI API in the Application, or even run open source VLM models on GPUs by annotating functions with GPU enabled hardware resources.

Tensorlake automatically queues requests and scales out the cluster, there is no extra configuration required for handling spiky ingestion.

Key Features:

Page Classification with VLMs: Reduces processing from ~200 pages to ~20 relevant pages per document
Structured Extraction: Extracts AI risk categories, descriptions, and severity indicators using Pydantic schemas
Parallel Processing: Uses .map() to process multiple documents simultaneously
Dual Table Design: Summary table for aggregations, detailed table for deep analysis
Pre-built Queries: 6 analytics queries for risk distribution, trends, and company comparisons

Architecture

Document URLs → Tensorlake Application → Page Classification (VLM)
                                      → Structured Extraction (LLM)
                                      → Databricks SQL Warehouse
                                      → SQL Analytics & Dashboards

The architecture separates document processing from querying:

Processing Application: Handles ingestion, classification, extraction, and loading
Query Application: Provides pre-built analytics queries as an API

Both applications are deployed as serverless functions and can be called via HTTP or programmatically.

Quick Start

Prerequisites

Python 3.11+
Tensorlake API Key
Databricks SQL Warehouse credentials:
- Server Hostname
- HTTP Path
- Access Token

Databricks Setup

You need access to a Databricks SQL Warehouse. Find your connection details in the Databricks workspace under SQL Warehouses → Connection Details.

Local Testing

1. Clone and Install

git clone https://github.com/tensorlakeai/databricks
cd databricks
pip install --upgrade tensorlake databricks-sql-connector pandas pyarrow

2. Set Environment Variables

export TENSORLAKE_API_KEY=your_tensorlake_api_key
export DATABRICKS_SERVER_HOSTNAME=your_hostname
export DATABRICKS_HTTP_PATH=your_http_path
export DATABRICKS_ACCESS_TOKEN=your_access_token

Tip: If you encounter SSL certificate issues on macOS or other environments, you can set:
export DATABRICKS_SQL_CONNECTOR_VERIFY_SSL=false

3. Process a Test Filing

Run the processing script to extract data from a single test SEC filing:

python process-sec.py

4. Query the Data

Head into the Databricks SQL Editor and query the data using this sample query:

WITH ranked_risks AS (
    SELECT 
        company_name,
        ticker,
        risk_description,
        citation,
        LENGTH(risk_description) as description_length,
        ROW_NUMBER() OVER (PARTITION BY company_name ORDER BY LENGTH(risk_description) DESC) as rn
    FROM ai_risks
    WHERE risk_category = 'Operational'
)
SELECT 
    company_name,
    ticker,
    citation,
    risk_description,
    description_length
FROM ranked_risks
WHERE rn = 1
ORDER BY company_name

Deploying to Tensorlake Cloud

1. Verify Tensorlake Connection

tensorlake whoami

2. Set Secrets

Store your credentials securely in Tensorlake:

tensorlake secrets set TENSORLAKE_API_KEY='your_key'
tensorlake secrets set DATABRICKS_SERVER_HOSTNAME='your_hostname'
tensorlake secrets set DATABRICKS_HTTP_PATH='your_path'
tensorlake secrets set DATABRICKS_ACCESS_TOKEN='your_token'

3. Verify Secrets

tensorlake secrets list

4. Deploy Applications

Deploy the processing application:

tensorlake deploy process-sec.py

Once your applications have been deployed, you should be able to see them in your Applications on cloud.tensorlake.ai:

5. Run the Full Pipeline

Process all SEC filings using the deployed application:

python process-sec-remote.py

Quick Overview: Tensorlake Applications

Tensorlake Applications are Python programs that:

Run as serverless applications
Can be triggered by HTTP requests, message queues, or scheduled events
Can use any Python package or model
Can run on CPU or GPU
Automatically scale out based on load
Have built-in queuing and fault tolerance
Support function composition with .map() for parallel processing

Why This Integration Matters

The integration between Tensorlake and Databricks provides several key benefits:

Simplified ETL for Unstructured Data: Convert documents, images, and other unstructured data into structured formats without complex orchestration like Apache Airflow or Prefect.
Serverless Architecture: No infrastructure management required - just write Python code and deploy.
Automatic Scaling: Handle varying loads without manual intervention or cluster configuration.
GPU Support: Run ML models and VLMs efficiently when needed for document classification or embedding generation.
Databricks Integration: Leverage Databricks' powerful analytics capabilities, Unity Catalog, and Delta Lake with properly structured data.
Production Ready: Built-in error handling, retries, and observability for enterprise workloads.

Project Structure

databricks/
├── process-sec.py           # Main processing application
├── process-sec-remote.py    # Script to call deployed process app
├── databricks-query.png     # Screenshot of Databricks SQL query
├── deployed-applications.png # Screenshot of Tensorlake dashboard
├── LICENSE                  # MIT License
└── README.md                # This file

Resources

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Databricks + Tensorlake Integration Examples

Transform Unstructured Data into Queryable and AI Ready Data on Databricks

Table of Contents

SEC Filings Analysis Pipeline

Architecture

Quick Start

Prerequisites

Databricks Setup

Local Testing

1. Clone and Install

2. Set Environment Variables

3. Process a Test Filing

4. Query the Data

Deploying to Tensorlake Cloud

1. Verify Tensorlake Connection

2. Set Secrets

3. Verify Secrets

4. Deploy Applications

5. Run the Full Pipeline

Quick Overview: Tensorlake Applications

Why This Integration Matters

Project Structure

Resources

License

Support

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
databricks-query.png		databricks-query.png
deployed-applications.png		deployed-applications.png
process-sec-remote.py		process-sec-remote.py
process-sec.py		process-sec.py

License

tensorlakeai/databricks

Folders and files

Latest commit

History

Repository files navigation

Databricks + Tensorlake Integration Examples

Transform Unstructured Data into Queryable and AI Ready Data on Databricks

Table of Contents

SEC Filings Analysis Pipeline

Architecture

Quick Start

Prerequisites

Databricks Setup

Local Testing

1. Clone and Install

2. Set Environment Variables

3. Process a Test Filing

4. Query the Data

Deploying to Tensorlake Cloud

1. Verify Tensorlake Connection

2. Set Secrets

3. Verify Secrets

4. Deploy Applications

5. Run the Full Pipeline

Quick Overview: Tensorlake Applications

Why This Integration Matters

Project Structure

Resources

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages