Code Blue Database Intelligence Platform
I built an AI-powered database intelligence platform that transforms raw telemetry into actionable insights, providing automated performance analysis, regression detection, and optimization recommendations with full explainability and human-in-the-loop controls.
At a glance
- •Detects regressions before user impact
- •AI recommendations with explainability
- •GDPR-compliant data classification
Code Blue Platform Documentation
Platform Purpose
Problem Statement
Database management teams face a critical challenge: understanding what is happening inside their databases in real-time, detecting performance degradation before it impacts users, and maintaining compliance with data governance requirements. Traditional monitoring tools provide metrics but lack intelligence—they tell you what is happening, but not why it matters or what to do about it.
Code Blue addresses this gap by providing database intelligence, not just monitoring. It collects, correlates, and analyzes database telemetry to provide actionable insights backed by evidence and explanation.
Why This Platform Exists
Modern database environments generate massive amounts of metadata, performance metrics, and execution data. Without intelligent analysis, this data becomes noise rather than signal. Code Blue exists to:
- Transform raw database telemetry into actionable intelligence through automated analysis and pattern recognition
- Detect performance regressions and anomalies before they escalate into production incidents
- Maintain historical context for all database changes, enabling impact analysis and trend identification
- Enforce data governance and classification by tracking sensitive data and schema changes
- Provide AI-assisted recommendations while maintaining human oversight and explainability
Difference from Simple Monitoring
Traditional monitoring tools:
- Collect point-in-time metrics (CPU, memory, query duration)
- Alert when thresholds are exceeded
- Provide dashboards showing current state
- Require manual correlation and analysis
Code Blue provides database intelligence through:
- Historical baselines and trend analysis: Compare current performance against established patterns to detect meaningful deviations
- Root cause correlation: Link performance issues to specific changes (schema modifications, index changes, query plan alterations)
- Predictive detection: Identify emerging problems through statistical regression analysis before they become critical
- Contextual recommendations: Provide optimization suggestions tied to specific issues with confidence scoring
- Change impact assessment: Analyze how metadata changes affect dependent queries and workloads
- Automated data classification: Identify and track sensitive data for GDPR and compliance requirements
Core Capabilities
1. Metadata Intelligence and Change Tracking
What It Does:
- Captures complete schema metadata (tables, columns, indexes, statistics, stored procedures, functions, views)
- Records all schema changes with timestamps and before/after states
- Tracks index modifications (creation, deletion, rebuild, reorganization)
- Monitors statistics updates and staleness
- Maintains append-only change history for full audit trail
How It Works:
- Connects to SQL Server system catalog views (
sys.tables,sys.indexes,sys.stats, etc.) - Periodically snapshots current metadata state
- Compares against previous snapshots to detect changes
- Stores change events with complete context (database, schema, object type, tenant)
Use Cases:
- Audit trail for compliance requirements
- Root cause analysis (correlate performance issues with recent schema changes)
- Change impact assessment (identify queries affected by index or table modifications)
- Rollback planning (understand what changed and when)
Data Collected:
- Object metadata (names, types, definitions)
- Change events (create, alter, drop)
- Index properties (columns, includes, filters, fragmentation)
- Statistics metadata (last update, modification counter, row count)
2. Data Classification and Governance
What It Does:
- Identifies columns containing sensitive data types (SSN, credit cards, emails, phone numbers, account numbers)
- Classifies data according to sensitivity levels for GDPR and regulatory compliance
- Tracks data lineage and usage patterns
- Provides opt-out enforcement for AI analysis per database or tenant
How It Works:
- Pattern-based detection using regular expressions on sample data
- Classification metadata stored per column
- Integration with AI audit layer to mask sensitive data before LLM analysis
- Tenant-based policies control whether data is included in AI recommendations
GDPR Relevance:
- Identifies personally identifiable information (PII) automatically
- Tracks where sensitive data resides across databases
- Enables data subject access requests (DSAR) by showing what data exists and where
- Supports right to deletion by identifying all locations of personal data
- Provides audit trail of data access and processing
Safety Features:
- All classification is read-only (no data modification)
- Sensitive data masking before external processing
- Tenant-level opt-out policies respected throughout system
- Audit logging of all access to classified data
3. Query and Workload Monitoring
What It Does:
- Captures performance metrics for resource-intensive queries (CPU, duration, logical reads, physical reads, row counts)
- Stores execution plans for analysis and comparison
- Detects query performance regressions through statistical analysis
- Identifies parameter sniffing issues via plan hash variance
- Recommends missing indexes based on SQL Server DMV data
How It Works:
- Queries SQL Server DMVs (
sys.dm_exec_query_stats,sys.dm_exec_cached_plans,sys.dm_exec_sql_text) - Collects top N queries by resource consumption (configurable, default top 100)
- Establishes performance baselines over time
- Compares current execution metrics against baselines
- Flags queries with >50% performance degradation as regressions
Regression Detection:
- Statistical comparison: (current_metric - baseline_metric) / baseline_metric
- Confidence scoring based on sample size and variance
- Severity classification:
- Critical: >200% degradation
- High: >100% degradation
- Medium: >50% degradation
- Low: <50% degradation
Missing Index Detection:
- Reads
sys.dm_db_missing_index_detailsand related DMVs - Correlates recommendations with affected queries
- Calculates impact score (user seeks × average improvement)
- Provides index creation scripts
Parameter Sniffing Detection:
- Analyzes query hash stability (multiple plan hashes for same query hash)
- Measures execution time variance across plan changes
- Flags queries with high variance as potential parameter sniffing candidates
Data Collected:
- Query text and normalized form
- Execution plan XML
- Performance metrics (CPU time, elapsed duration, reads/writes, row counts)
- Execution counts and last execution timestamp
- Plan hash and query hash for tracking changes
4. Server, Instance, and Database Observability
Server-Level Metrics:
- CPU utilization (SQL Server process and total system)
- Memory utilization and available memory
- Disk I/O latency (read/write)
- Network I/O (packets sent/received, errors)
Instance-Level Metrics:
- Active connections and session counts
- Buffer cache hit ratio
- Page Life Expectancy (PLE)
- TempDB usage (data file and version store)
- Lock waits, timeouts, and deadlock counts
- Batch requests per second
- SQL compilations and recompilations
- Wait statistics (categorized by type: CPU, I/O, locking, memory, network)
Database-Level Metrics:
- Database size and growth trends
- Transaction log usage and growth
- Index fragmentation (percentage and page count)
- Statistics staleness (time since last update, modification counter)
Health Scoring:
- Composite score (0-100) calculated from weighted factors:
- Fragmentation (20% weight)
- Statistics freshness (20% weight)
- Blocking and deadlocks (25% weight)
- Query performance (20% weight)
- Index health (15% weight)
- Status classification:
- Excellent: 90-100
- Good: 70-89
- Fair: 50-69
- Poor: 30-49
- Critical: 0-29
Anomaly Detection:
- Identifies metric spikes (>50% increase from baseline)
- Detects resource exhaustion patterns
- Flags unexpected behavior (plan changes, configuration drift)
- Categorizes anomalies by type and severity
5. AI-Assisted Troubleshooting and Optimization
What AI Does:
- Analyzes query performance data and execution plans to generate optimization recommendations
- Suggests specific index creations or modifications based on missing index data and query patterns
- Recommends statistics updates when staleness contributes to poor performance
- Proposes query rewrites for inefficient patterns
- Provides natural language explanations for performance issues
What AI Does NOT Do:
- Execute any changes automatically (all recommendations require human approval)
- Access production databases directly (operates on collected metadata only)
- Make decisions without explainability (every recommendation includes reasoning and confidence score)
- Override human judgment (acts as advisor, not autonomous agent)
Safety Principles:
- Read-Only Data Collection: All database interactions are SELECT queries only; no INSERT, UPDATE, DELETE, or DDL operations on monitored databases
- Human-in-the-Loop: Every AI recommendation requires explicit approval before documentation; no automated execution
- Explainability: Each recommendation includes:
- Reasoning (why this change is suggested)
- Expected impact (estimated performance improvement)
- Confidence score (0-100, based on data quality and historical patterns)
- Risk assessment (potential side effects or dependencies)
- Audit Trail: Complete logging of all AI interactions in
AIAuditLogtable:- Input prompt and output response
- Model name and version
- Timestamp and requesting user
- Confidence score and safety validation results
- Token count and latency metrics
- Sensitive Data Masking: Pattern-based masking applied before AI analysis:
- SSN patterns (###-##-#### format)
- Credit card numbers (16 digits)
- Email addresses
- Phone numbers (various formats)
- Account numbers
- Opt-Out Enforcement: Tenants or databases can disable AI analysis entirely through configuration
- Multi-LLM Consensus (future): Support for comparing recommendations from multiple models to increase confidence
- Destructive Operation Detection: Recommendations involving DROP or DELETE operations automatically flagged for elevated review
Confidence Scoring:
- Based on historical data volume and quality
- Adjusts for query complexity and workload patterns
- Downgraded for operations identified as risky
- Includes statistical measures (sample size, variance)
Recommendation Workflow:
- System generates recommendation based on analysis
- Recommendation stored with "Pending" status
- DBA or authorized user reviews recommendation details
- User approves or rejects with optional comments
- Approval/rejection logged to audit trail with timestamp, user, and IP address
- Approved recommendations documented for implementation (platform does not execute)
6. Change Tracking and Notifications
Change Impact Analysis:
- When metadata changes are detected (index creation/deletion, table schema alterations), system analyzes impact on dependent objects
- Identifies queries that reference affected tables or columns
- Categorizes risk levels:
- Critical: >5 high-risk queries or >50 total affected queries
- High: >0 high-risk queries or >20 total affected queries
- Medium: >5 medium-risk queries or >10 total affected queries
- Low: >0 affected queries
- None: No affected queries
- Estimates performance impact (positive, negative, or neutral)
- Generates recommendations for testing or further analysis
Change Event Types Tracked:
- Schema changes (table/column additions, modifications, deletions)
- Index changes (creation, alteration, rebuild, drop)
- Statistics updates (manual or automatic)
- Configuration changes (server or database settings)
- Query plan changes (plan hash modifications)
Notification Foundation:
- Event severity classification (Informational, Warning, Error, Critical)
- Alert service foundation for threshold-based notifications
- Configurable alert rules (extensible for email, SMS, webhook integrations)
- Event aggregation to prevent alert fatigue
- Correlation of related events for consolidated notifications
Intended Integration Points (infrastructure needed):
- Email notifications for critical events
- SMS alerts for high-severity issues
- Webhook integration for ITSM platforms (ServiceNow, PagerDuty)
- Slack/Teams integration for collaborative response
- Dashboard widgets for real-time event streams
Role of AI
How AI is Used
AI integration in Code Blue is designed as an advisory layer, not an autonomous agent. The platform uses Large Language Models (LLMs) to:
- Pattern Recognition: Analyze historical query performance data to identify optimization opportunities that may not be obvious from individual metrics
- Contextual Recommendations: Generate specific, actionable suggestions (e.g., "Create index on Orders(CustomerId, OrderDate) INCLUDE (TotalAmount)") rather than generic advice
- Natural Language Explanation: Translate technical metrics and execution plans into human-readable explanations of performance issues
- Root Cause Synthesis: Correlate multiple signals (query slowdown, index fragmentation, plan changes) to identify likely root causes
- Risk Assessment: Evaluate potential side effects of proposed changes based on workload patterns and schema relationships
AI Safety Architecture
The platform implements multiple safety layers:
Layer 1: Data Collection Isolation
- AI never accesses production databases directly
- Operates on metadata and metrics collected by read-only queries
- Sensitive data masked before AI processing (SSN, credit cards, PII)
Layer 2: Human Approval Gate
- All AI recommendations require explicit human approval
- No automatic execution of suggested changes
- Approval workflow includes review step with detailed explanation
Layer 3: Audit and Traceability
- Every AI interaction logged with full context
- Includes prompt, response, model version, confidence score, timestamp
- Audit trail supports compliance and incident investigation
- Logs retained according to organizational policy
Layer 4: Confidence Scoring
- Recommendations include confidence scores (0-100)
- Low confidence recommendations flagged for extra scrutiny
- Confidence based on data quality, sample size, historical accuracy
Layer 5: Opt-Out Controls
- Tenant-level or database-level AI opt-out supported
- Opt-out enforced throughout system (no AI analysis if disabled)
- Provides control for sensitive or regulated environments
What AI Does NOT Do
No Autonomous Execution:
- AI never executes DDL (CREATE, ALTER, DROP)
- AI never executes DML (INSERT, UPDATE, DELETE)
- AI never modifies configurations or settings
- All changes require human review and manual implementation
No Black-Box Decisions:
- Every recommendation includes reasoning
- Explanations reference specific metrics and patterns
- Users can validate recommendations against raw data
- No "trust the model" opaque outputs
No Direct Database Access:
- AI does not connect to monitored databases
- AI does not query production systems
- AI operates on pre-collected, sanitized metadata
No Unmonitored Behavior:
- All AI requests and responses logged
- Token usage and costs tracked
- Model version and parameters recorded
- No hidden or unaudited AI operations
Explainability Principles
- Evidence-Based: Every recommendation cites specific data points (e.g., "Query execution time increased from 150ms to 450ms, 200% regression")
- Actionable: Recommendations include concrete steps, not abstract suggestions
- Contextual: Explanations reference database-specific patterns and workload characteristics
- Confidence-Qualified: Uncertainty is explicit; low-confidence recommendations clearly marked
- Reversible: Recommendations include information about potential rollback or mitigation if change causes issues
Intended Users
Database Administrators (DBAs)
Primary Use Cases:
- Monitor query performance regressions in real-time
- Identify and resolve blocking and deadlock issues
- Analyze execution plans for optimization opportunities
- Track index fragmentation and statistics staleness
- Review AI recommendations for index creation or query rewrites
- Investigate root causes of performance degradation
- Maintain audit trail of schema changes for compliance
Key Features:
- Health score dashboard for quick database status assessment
- Regression detection alerts for proactive issue resolution
- Missing index recommendations based on actual workload
- Query execution plan comparison over time
- Change impact analysis before implementing modifications
Data Engineers
Primary Use Cases:
- Optimize ETL job performance through query analysis
- Monitor data pipeline workload impact on database resources
- Track schema changes that affect data integration processes
- Ensure statistics are updated for query optimizer effectiveness
- Analyze wait statistics to identify bottlenecks in data processing
Key Features:
- Query performance snapshots showing resource-intensive operations
- Workload monitoring to understand peak usage patterns
- Historical performance trends for capacity planning
- Change tracking to correlate schema modifications with pipeline failures
Platform and Infrastructure Teams
Primary Use Cases:
- Monitor server and instance health across database fleet
- Track resource utilization trends for capacity planning
- Detect anomalies in CPU, memory, disk I/O, and network metrics
- Ensure database infrastructure meets SLAs
- Coordinate database maintenance windows based on workload patterns
Key Features:
- Server-level metrics (CPU, memory, disk I/O)
- Instance-level metrics (connections, buffer cache, TempDB usage, wait statistics)
- Anomaly detection for resource exhaustion or unusual behavior
- Multi-database observability with single pane of glass
- Historical trends for infrastructure planning
Compliance and Governance Teams
Primary Use Cases:
- Identify databases and tables containing sensitive data (PII, PHI, financial data)
- Track data classification for GDPR, HIPAA, SOC2, and other regulatory requirements
- Maintain audit trail of schema changes and data access patterns
- Verify data retention policies are enforced
- Generate compliance reports showing what data exists and where
Key Features:
- Automated data classification (SSN, credit cards, emails, account numbers)
- Metadata change tracking with append-only audit history
- AI audit logs showing all model interactions with database metadata
- Tenant-based opt-out controls for sensitive environments
- Change impact reports showing affected systems and queries
Design Principles
1. Read-Only Data Collection
Principle: All interactions with monitored databases are SELECT queries only. The platform never modifies data, schema, or configuration on monitored systems.
Implementation:
- Database connections use read-only accounts where possible
- Handler methods query DMVs and system catalogs with SELECT statements
- No INSERT, UPDATE, DELETE, or DDL operations on monitored databases
- Configuration changes and recommendations stored in platform's own database
Rationale:
- Eliminates risk of platform causing outages or data loss
- Ensures platform can be deployed in production without elevated privileges
- Enables deployment in regulated environments with strict change control
- Builds trust with DBAs who are cautious about monitoring tools
2. Append-Only History
Principle: All collected data is immutable once written. Historical records are never updated or deleted (except per data retention policies).
Implementation:
- Time-series data stored with timestamps
- Schema changes recorded as discrete events, not updates to existing records
- Query performance snapshots stored separately for each collection interval
- Audit logs are write-once records
Rationale:
- Enables accurate trend analysis without data distortion
- Supports forensic investigation of incidents ("What was the state at 2 AM on Tuesday?")
- Provides complete audit trail for compliance
- Prevents accidental or malicious tampering with historical data
3. Safety-First Automation
Principle: Automation is used for data collection and analysis, not for applying changes. All recommendations require human review and approval.
Implementation:
- Handlers execute automatically on schedules for data collection
- AI generates recommendations automatically based on analysis
- Recommendations stored with "Pending" status
- Approval workflow requires explicit human action before documentation
- No automatic execution of DDL, DML, or configuration changes
Rationale:
- Respects the critical nature of database operations
- Acknowledges that context and business requirements vary
- Provides value through intelligence without introducing risk
- Aligns with DBA workflows (review, test, implement)
4. Explainability Over Black-Box Decisions
Principle: Every insight, recommendation, and alert must be traceable to underlying data and logic. No opaque "AI magic."
Implementation:
- Regression detection shows baseline vs. current metrics with percentage change
- AI recommendations include reasoning section citing specific metrics
- Confidence scores based on measurable factors (sample size, data quality)
- Audit logs capture complete context for every AI interaction
- Users can drill down from recommendations to raw query performance data
Rationale:
- DBAs require justification before making changes to production systems
- Explainability builds trust in platform recommendations
- Enables validation of insights against other tools or manual analysis
- Supports learning and knowledge transfer within teams
- Required for regulated industries (financial services, healthcare) where black-box decisions are unacceptable
5. Multi-Tenant Isolation
Principle: Platform supports multiple tenants with strict data isolation. Tenants cannot access each other's data or insights.
Implementation:
- All domain entities include
TenantIdfield - Database queries filter by
TenantIdautomatically - API endpoints validate tenant context from authentication token
- Schedules and handlers parameterized with tenant identifier
Rationale:
- Enables SaaS deployment model with shared infrastructure
- Meets security and compliance requirements for multi-tenant systems
- Prevents data leakage between organizations or business units
- Supports enterprise deployments with separate divisions or subsidiaries
6. Schedule-Driven Execution
Principle: All data collection and analysis tasks are defined as schedules, not hardcoded behaviors. Platform behavior is fully configurable.
Implementation:
- Schedules stored in
dbo.Schedulestable with parameters - Handlers are stateless and parametrized
- Execution intervals configurable per schedule
- Work distributed across nodes via
NodeIDassignment
Rationale:
- Enables customization without code changes
- Supports different collection intervals per database or tenant
- Facilitates testing and gradual rollout of new features
- Allows load balancing across distributed nodes
7. Distributed and Fault-Tolerant Architecture
Principle: System operates as distributed fabric of independent nodes with no single point of failure. Node failures do not cause data loss or system-wide outages.
Implementation:
- Multiple worker nodes execute schedules independently
- Nodes coordinate through database, not direct communication
- Schedule state persists in database (work survives node crashes)
- Failed schedules can be reassigned to healthy nodes manually
Rationale:
- Supports horizontal scaling for large database fleets
- Enables hybrid deployment (on-premises and cloud nodes)
- Provides resilience against node failures
- Allows maintenance without service interruption
Deployment and Scalability
Code Blue's node-based architecture supports multiple deployment models:
On-Premises Deployment:
- Worker service installed on Windows Server or Linux
- Connects to central SQL Server database for coordination
- Monitors local database instances with minimal network latency
Cloud Deployment:
- Worker services deployed as Azure App Services, AWS ECS tasks, or GCP Cloud Run services
- Central database as Azure SQL, RDS SQL Server, or Cloud SQL
- Monitors cloud-hosted databases with cloud-native integrations
Container Deployment:
- Worker service packaged as Docker container
- Orchestrated via Kubernetes, Docker Swarm, or ECS
- Horizontal scaling through replica sets
- Resource isolation with CPU and memory limits
Hybrid Deployment:
- On-premises nodes monitor local databases
- Cloud nodes monitor cloud databases
- All nodes write to single central repository (on-premises or cloud)
- Unified view across hybrid infrastructure
Scaling Model:
- Add nodes by deploying Worker service with unique
NodeID - Assign schedules to nodes via database configuration
- No coordination overhead or inter-node communication
- Linear scalability limited only by database connection pool
Conclusion
Code Blue is a database intelligence platform that transforms raw telemetry into actionable insights through automated collection, intelligent analysis, and AI-assisted recommendations. It maintains strict read-only access, human-in-the-loop approval workflows, and complete audit trails to provide value while minimizing risk.
The platform serves DBAs, data engineers, infrastructure teams, and compliance officers with targeted capabilities for performance optimization, anomaly detection, change tracking, and data governance. Its design prioritizes explainability, safety, and operational simplicity, making it suitable for production deployment in enterprise environments.
By focusing on intelligence rather than simple monitoring, Code Blue enables proactive database management: detecting issues before they escalate, recommending optimizations based on evidence, and maintaining historical context for forensic analysis and compliance reporting.