Databricks Banking Lakehouse – Architecture Documentation
Document purpose
This document describes the end-to-end data architecture, migration strategy, and implementation patterns for a simulated enterprise banking system migrated to the Databricks Lakehouse Platform.
All schemas and data are synthetic and designed to reflect real-world banking complexity.
1. Executive Summary
1.1 Business Context
- Type of organization (e.g. Tier-1 retail bank – simulated)
- Primary business drivers:
- Legacy system modernization
- Regulatory reporting
- Advanced analytics & ML readiness
- Cost optimization and scalability
1.2 Problem Statement
- Description of legacy architecture limitations
- Pain points (data silos, slow reporting, high infra cost, limited scalability)
1.3 Target Outcome
- Unified Lakehouse architecture on Databricks
- Near real-time data availability
- Governed, auditable, and scalable platform
2. High-Level Architecture Overview
2.1 Logical Architecture
- Source systems
- Ingestion layer
- Lakehouse (Bronze / Silver / Gold)
- Governance & security
- Consumption layer
(Insert diagram – logical view)
2.2 Physical Architecture
- Cloud provider (abstracted / optional)
- Databricks workspace layout
- Storage (object storage + Delta Lake)
- Network & connectivity assumptions
(Insert diagram – physical view)
3. Source Systems
3.1 Banking Domain Model (Conceptual)
This project models a universal retail banking domain, designed to be realistic but fully synthetic.
Core entities:
- Customer
- Account
- Product
- Transaction
- Branch
- Card
Key relationships:
- A Customer can own multiple Accounts
- An Account is associated with one Product
- An Account has many Transactions
- Transactions may originate from Cards or Branches
(Insert conceptual ER diagram here)
3.2 Core Banking System (OLTP – Simulated)
Purpose: System of record for accounts and financial transactions.
Technology assumptions:
- Relational OLTP database (MSSQL / Oracle-like)
- Strong consistency
- Change Data Capture (CDC) enabled
Core tables:
- customers
- accounts
- products
- transactions
Data characteristics:
- High write throughput
- Strict data integrity
- Regulatory sensitivity
3.3 Supporting Systems
- Payments system – card and transfer events
- Customer Master Data – enriched customer attributes
- Reference Data – currencies, countries, transaction types
- Event Streams – real-time transaction events (Kafka-like)
3.4 Data Classification
| Category | Examples | Sensitivity |
|---|---|---|
| PII | name, address, national_id | High |
| Financial | balances, transactions | Very High |
| Reference | currency codes | Low |
| Metadata | ingestion timestamps | Low |
4. Ingestion Strategy
4.1 Ingestion Patterns
| Source Type | Pattern | Technology | Frequency |
|---|---|---|---|
| OLTP | CDC | Lakeflow / Auto Loader | Near real-time |
| Files | Batch | Auto Loader | Daily |
| Events | Streaming | Structured Streaming | Real-time |
4.2 Full Load vs Incremental Load
- Historical backfill approach
- Cutover strategy
- Reconciliation logic
4.3 Schema Evolution Handling
- Additive changes
- Breaking changes
- Versioning strategy
5. Lakehouse Architecture (Medallion)
5.1 Bronze Layer – Raw Data
- Purpose
- Data format (Delta)
- CDC handling
- Metadata captured
- Retention policy
5.2 Silver Layer – Conformed Data
- Deduplication logic
- Data quality rules
- Business key definitions
- Slowly Changing Dimensions (SCD strategy)
5.3 Gold Layer – Business Models
- Reporting-ready tables
- Regulatory datasets
- Aggregations & KPIs
- ML feature tables
6. Data Quality & Reliability
6.1 Data Quality Framework
- Expectations (DLT)
- Validation rules
- Reject vs quarantine strategy
6.2 Failure Scenarios
- Late-arriving data
- Duplicate events
- Partial ingestion failures
6.3 Replay & Backfill Strategy
- CDC replay
- Point-in-time recovery
- Idempotency guarantees
7. Governance & Security
7.1 Unity Catalog Design
- Catalog structure
- Schema ownership
- Environment separation
7.2 Access Control
- Role-based access control (RBAC)
- Row-level security
- Column masking for PII
7.3 Audit & Lineage
- Data lineage tracking
- Access audit logs
- Compliance considerations
8. Orchestration & Pipelines
8.1 Pipeline Types
- DLT pipelines
- Batch Spark jobs
- Streaming jobs
8.2 Scheduling & Dependencies
- Job orchestration approach
- Dependency management
- SLA definitions
9. Performance & Cost Optimization
9.1 Compute Strategy
- Job clusters vs all-purpose clusters
- Autoscaling configuration
- Photon usage
9.2 Storage Optimization
- Partitioning strategy
- Z-Ordering
- Vacuum & retention
9.3 Cost Trade-offs
- Streaming vs batch
- CDC granularity
- Data retention policies
10. Consumption & Use Cases
10.1 BI & Reporting
- Semantic layer design
- Example dashboards
10.2 Regulatory Reporting
- Data accuracy guarantees
- Reproducibility
- Audit support
10.3 Advanced Analytics & ML
- Feature engineering approach
- Real-time scoring readiness
11. Migration Strategy
11.1 Migration Phases
- Assessment
- Historical load
- Dual-run period
- Cutover
11.2 Risk Mitigation
- Data reconciliation
- Rollback strategy
- Parallel validation
12. Key Architecture Decisions
12.1 Technology Choices
- Why Databricks Lakehouse
- Why Delta Lake
- Why DLT vs custom Spark jobs
12.2 Alternatives Considered
- Traditional DW
- Lambda architecture
- Custom open-source stack
13. Assumptions & Limitations
- What is simulated
- What is out of scope
- Known limitations
14. Appendix
14.1 Glossary
14.2 Reference Diagrams
14.3 Sample Schemas
Author: Benjamin Ibrulj
Role: Senior Data Engineer / Architect
Repository: GitHub link
License: Apache 2.0