Databricks is a unified data analytics platform built on Apache Spark, designed for data engineering, machine learning, and analytics at scale. When combined with Tensorlake's document parsing and serverless agentic application runtime, you can build AI workflows and agents which can automate processing of documents and other forms of unstructured data and land them in Databricks.
This repository demonstrates building a full ingestion pipeline on Tensorlake. The orchestration of your ingestion pipeline happens on Tensorlake, so you can write a distributed and durable ingestion pipeline in pure Python and Tensorlake will automatically queue requests as they arrive and scale the cluster to process data. The platform is serverless, so you only pay for compute resources used for processing data.
- SEC Filings Analysis Pipeline
- Quick Start
- Local Testing
- Deploying to Tensorlake Cloud
- Example Queries
- Quick Overview: Tensorlake Applications
- Why This Integration Matters
- Resources
This repository demonstrates how to use Tensorlake Applications to extract AI-related risk mentions from SEC filings, storing and querying results in Databricks.
The Tensorlake Application receives Document URLs over HTTP, uses Vision Language Models to classify pages containing risk factors, calls an LLM for structured extraction from only relevant pages, and then uses Databricks SQL Connector to write structured data into your Databricks SQL Warehouse. Once it's inside Databricks you can run complex analytics to track trends, compare companies, and discover emerging risk patterns.
The Application is written in Python, without any external orchestration engines, so you can build and test it like any other normal application. You can use any document AI API in the Application, or even run open source VLM models on GPUs by annotating functions with GPU enabled hardware resources.
Tensorlake automatically queues requests and scales out the cluster, there is no extra configuration required for handling spiky ingestion.
Key Features:
- Page Classification with VLMs: Reduces processing from ~200 pages to ~20 relevant pages per document
- Structured Extraction: Extracts AI risk categories, descriptions, and severity indicators using Pydantic schemas
- Parallel Processing: Uses
.map()to process multiple documents simultaneously - Dual Table Design: Summary table for aggregations, detailed table for deep analysis
- Pre-built Queries: 6 analytics queries for risk distribution, trends, and company comparisons
Document URLs → Tensorlake Application → Page Classification (VLM)
→ Structured Extraction (LLM)
→ Databricks SQL Warehouse
→ SQL Analytics & Dashboards
The architecture separates document processing from querying:
- Processing Application: Handles ingestion, classification, extraction, and loading
- Query Application: Provides pre-built analytics queries as an API
Both applications are deployed as serverless functions and can be called via HTTP or programmatically.
- Python 3.11+
- Tensorlake API Key
- Databricks SQL Warehouse credentials:
- Server Hostname
- HTTP Path
- Access Token
You need access to a Databricks SQL Warehouse. Find your connection details in the Databricks workspace under SQL Warehouses → Connection Details.
git clone https://github.com/tensorlakeai/databricks
cd databricks
pip install --upgrade tensorlake databricks-sql-connector pandas pyarrowexport TENSORLAKE_API_KEY=your_tensorlake_api_key
export DATABRICKS_SERVER_HOSTNAME=your_hostname
export DATABRICKS_HTTP_PATH=your_http_path
export DATABRICKS_ACCESS_TOKEN=your_access_tokenTip: If you encounter SSL certificate issues on macOS or other environments, you can set:
export DATABRICKS_SQL_CONNECTOR_VERIFY_SSL=false
Run the processing script to extract data from a single test SEC filing:
python process-sec.pyHead into the Databricks SQL Editor and query the data using this sample query:
WITH ranked_risks AS (
SELECT
company_name,
ticker,
risk_description,
citation,
LENGTH(risk_description) as description_length,
ROW_NUMBER() OVER (PARTITION BY company_name ORDER BY LENGTH(risk_description) DESC) as rn
FROM ai_risks
WHERE risk_category = 'Operational'
)
SELECT
company_name,
ticker,
citation,
risk_description,
description_length
FROM ranked_risks
WHERE rn = 1
ORDER BY company_nametensorlake whoamiStore your credentials securely in Tensorlake:
tensorlake secrets set TENSORLAKE_API_KEY='your_key'
tensorlake secrets set DATABRICKS_SERVER_HOSTNAME='your_hostname'
tensorlake secrets set DATABRICKS_HTTP_PATH='your_path'
tensorlake secrets set DATABRICKS_ACCESS_TOKEN='your_token'tensorlake secrets listDeploy the processing application:
tensorlake deploy process-sec.pyOnce your applications have been deployed, you should be able to see them in your Applications on cloud.tensorlake.ai:
Process all SEC filings using the deployed application:
python process-sec-remote.pyTensorlake Applications are Python programs that:
- Run as serverless applications
- Can be triggered by HTTP requests, message queues, or scheduled events
- Can use any Python package or model
- Can run on CPU or GPU
- Automatically scale out based on load
- Have built-in queuing and fault tolerance
- Support function composition with
.map()for parallel processing
The integration between Tensorlake and Databricks provides several key benefits:
- Simplified ETL for Unstructured Data: Convert documents, images, and other unstructured data into structured formats without complex orchestration like Apache Airflow or Prefect.
- Serverless Architecture: No infrastructure management required - just write Python code and deploy.
- Automatic Scaling: Handle varying loads without manual intervention or cluster configuration.
- GPU Support: Run ML models and VLMs efficiently when needed for document classification or embedding generation.
- Databricks Integration: Leverage Databricks' powerful analytics capabilities, Unity Catalog, and Delta Lake with properly structured data.
- Production Ready: Built-in error handling, retries, and observability for enterprise workloads.
databricks/
├── process-sec.py # Main processing application
├── process-sec-remote.py # Script to call deployed process app
├── databricks-query.png # Screenshot of Databricks SQL query
├── deployed-applications.png # Screenshot of Tensorlake dashboard
├── LICENSE # MIT License
└── README.md # This file
- Tensorlake Documentation
- Databricks Documentation
- Tutorial: Query SEC Filings in Databricks
- Integration Guide
- Community Support
This project is licensed under the MIT License - see the LICENSE file for details.

