Back to Projects

Databricks Banking Lakehouse

Databricks Banking Lakehouse – Architecture

Databricks Banking Lakehouse – Architecture Documentation

Document purpose
This document describes the end-to-end data architecture, migration strategy, and implementation patterns for a simulated enterprise banking system migrated to the Databricks Lakehouse Platform.
All schemas and data are synthetic and designed to reflect real-world banking complexity.


1. Executive Summary

1.1 Business Context

  • Type of organization (e.g. Tier-1 retail bank – simulated)
  • Primary business drivers:
    • Legacy system modernization
    • Regulatory reporting
    • Advanced analytics & ML readiness
    • Cost optimization and scalability

1.2 Problem Statement

  • Description of legacy architecture limitations
  • Pain points (data silos, slow reporting, high infra cost, limited scalability)

1.3 Target Outcome

  • Unified Lakehouse architecture on Databricks
  • Near real-time data availability
  • Governed, auditable, and scalable platform

2. High-Level Architecture Overview

2.1 Logical Architecture

  • Source systems
  • Ingestion layer
  • Lakehouse (Bronze / Silver / Gold)
  • Governance & security
  • Consumption layer

(Insert diagram – logical view)

2.2 Physical Architecture

  • Cloud provider (abstracted / optional)
  • Databricks workspace layout
  • Storage (object storage + Delta Lake)
  • Network & connectivity assumptions

(Insert diagram – physical view)


3. Source Systems

3.1 Banking Domain Model (Conceptual)

This project models a universal retail banking domain, designed to be realistic but fully synthetic.

Core entities:

  • Customer
  • Account
  • Product
  • Transaction
  • Branch
  • Card

Key relationships:

  • A Customer can own multiple Accounts
  • An Account is associated with one Product
  • An Account has many Transactions
  • Transactions may originate from Cards or Branches

(Insert conceptual ER diagram here)


3.2 Core Banking System (OLTP – Simulated)

Purpose: System of record for accounts and financial transactions.

Technology assumptions:

  • Relational OLTP database (MSSQL / Oracle-like)
  • Strong consistency
  • Change Data Capture (CDC) enabled

Core tables:

  • customers
  • accounts
  • products
  • transactions

Data characteristics:

  • High write throughput
  • Strict data integrity
  • Regulatory sensitivity

3.3 Supporting Systems

  • Payments system – card and transfer events
  • Customer Master Data – enriched customer attributes
  • Reference Data – currencies, countries, transaction types
  • Event Streams – real-time transaction events (Kafka-like)

3.4 Data Classification

Category Examples Sensitivity
PII name, address, national_id High
Financial balances, transactions Very High
Reference currency codes Low
Metadata ingestion timestamps Low

4. Ingestion Strategy

4.1 Ingestion Patterns

Source Type Pattern Technology Frequency
OLTP CDC Lakeflow / Auto Loader Near real-time
Files Batch Auto Loader Daily
Events Streaming Structured Streaming Real-time

4.2 Full Load vs Incremental Load

  • Historical backfill approach
  • Cutover strategy
  • Reconciliation logic

4.3 Schema Evolution Handling

  • Additive changes
  • Breaking changes
  • Versioning strategy

5. Lakehouse Architecture (Medallion)

5.1 Bronze Layer – Raw Data

  • Purpose
  • Data format (Delta)
  • CDC handling
  • Metadata captured
  • Retention policy

5.2 Silver Layer – Conformed Data

  • Deduplication logic
  • Data quality rules
  • Business key definitions
  • Slowly Changing Dimensions (SCD strategy)

5.3 Gold Layer – Business Models

  • Reporting-ready tables
  • Regulatory datasets
  • Aggregations & KPIs
  • ML feature tables

6. Data Quality & Reliability

6.1 Data Quality Framework

  • Expectations (DLT)
  • Validation rules
  • Reject vs quarantine strategy

6.2 Failure Scenarios

  • Late-arriving data
  • Duplicate events
  • Partial ingestion failures

6.3 Replay & Backfill Strategy

  • CDC replay
  • Point-in-time recovery
  • Idempotency guarantees

7. Governance & Security

7.1 Unity Catalog Design

  • Catalog structure
  • Schema ownership
  • Environment separation

7.2 Access Control

  • Role-based access control (RBAC)
  • Row-level security
  • Column masking for PII

7.3 Audit & Lineage

  • Data lineage tracking
  • Access audit logs
  • Compliance considerations

8. Orchestration & Pipelines

8.1 Pipeline Types

  • DLT pipelines
  • Batch Spark jobs
  • Streaming jobs

8.2 Scheduling & Dependencies

  • Job orchestration approach
  • Dependency management
  • SLA definitions

9. Performance & Cost Optimization

9.1 Compute Strategy

  • Job clusters vs all-purpose clusters
  • Autoscaling configuration
  • Photon usage

9.2 Storage Optimization

  • Partitioning strategy
  • Z-Ordering
  • Vacuum & retention

9.3 Cost Trade-offs

  • Streaming vs batch
  • CDC granularity
  • Data retention policies

10. Consumption & Use Cases

10.1 BI & Reporting

  • Semantic layer design
  • Example dashboards

10.2 Regulatory Reporting

  • Data accuracy guarantees
  • Reproducibility
  • Audit support

10.3 Advanced Analytics & ML

  • Feature engineering approach
  • Real-time scoring readiness

11. Migration Strategy

11.1 Migration Phases

  1. Assessment
  2. Historical load
  3. Dual-run period
  4. Cutover

11.2 Risk Mitigation

  • Data reconciliation
  • Rollback strategy
  • Parallel validation

12. Key Architecture Decisions

12.1 Technology Choices

  • Why Databricks Lakehouse
  • Why Delta Lake
  • Why DLT vs custom Spark jobs

12.2 Alternatives Considered

  • Traditional DW
  • Lambda architecture
  • Custom open-source stack

13. Assumptions & Limitations

  • What is simulated
  • What is out of scope
  • Known limitations

14. Appendix

14.1 Glossary

14.2 Reference Diagrams

14.3 Sample Schemas


Author: Benjamin Ibrulj
Role: Senior Data Engineer / Architect
Repository: GitHub link
License: Apache 2.0