Code Blue Platform Documentation

Platform Purpose

Problem Statement

Database management teams face a critical challenge: understanding what is happening inside their databases in real-time, detecting performance degradation before it impacts users, and maintaining compliance with data governance requirements. Traditional monitoring tools provide metrics but lack intelligence—they tell you what is happening, but not why it matters or what to do about it.

Code Blue addresses this gap by providing database intelligence, not just monitoring. It collects, correlates, and analyzes database telemetry to provide actionable insights backed by evidence and explanation.

Why This Platform Exists

Modern database environments generate massive amounts of metadata, performance metrics, and execution data. Without intelligent analysis, this data becomes noise rather than signal. Code Blue exists to:

Transform raw database telemetry into actionable intelligence through automated analysis and pattern recognition
Detect performance regressions and anomalies before they escalate into production incidents
Maintain historical context for all database changes, enabling impact analysis and trend identification
Enforce data governance and classification by tracking sensitive data and schema changes
Provide AI-assisted recommendations while maintaining human oversight and explainability

Difference from Simple Monitoring

Traditional monitoring tools:

Collect point-in-time metrics (CPU, memory, query duration)
Alert when thresholds are exceeded
Provide dashboards showing current state
Require manual correlation and analysis

Code Blue provides database intelligence through:

Historical baselines and trend analysis: Compare current performance against established patterns to detect meaningful deviations
Root cause correlation: Link performance issues to specific changes (schema modifications, index changes, query plan alterations)
Predictive detection: Identify emerging problems through statistical regression analysis before they become critical
Contextual recommendations: Provide optimization suggestions tied to specific issues with confidence scoring
Change impact assessment: Analyze how metadata changes affect dependent queries and workloads
Automated data classification: Identify and track sensitive data for GDPR and compliance requirements

Core Capabilities

1. Metadata Intelligence and Change Tracking

What It Does:

Captures complete schema metadata (tables, columns, indexes, statistics, stored procedures, functions, views)
Records all schema changes with timestamps and before/after states
Tracks index modifications (creation, deletion, rebuild, reorganization)
Monitors statistics updates and staleness
Maintains append-only change history for full audit trail

How It Works:

Connects to SQL Server system catalog views (sys.tables, sys.indexes, sys.stats, etc.)
Periodically snapshots current metadata state
Compares against previous snapshots to detect changes
Stores change events with complete context (database, schema, object type, tenant)

Use Cases:

Audit trail for compliance requirements
Root cause analysis (correlate performance issues with recent schema changes)
Change impact assessment (identify queries affected by index or table modifications)
Rollback planning (understand what changed and when)

Data Collected:

Object metadata (names, types, definitions)
Change events (create, alter, drop)
Index properties (columns, includes, filters, fragmentation)
Statistics metadata (last update, modification counter, row count)

2. Data Classification and Governance

What It Does:

Identifies columns containing sensitive data types (SSN, credit cards, emails, phone numbers, account numbers)
Classifies data according to sensitivity levels for GDPR and regulatory compliance
Tracks data lineage and usage patterns
Provides opt-out enforcement for AI analysis per database or tenant

How It Works:

Pattern-based detection using regular expressions on sample data
Classification metadata stored per column
Integration with AI audit layer to mask sensitive data before LLM analysis
Tenant-based policies control whether data is included in AI recommendations

GDPR Relevance:

Identifies personally identifiable information (PII) automatically
Tracks where sensitive data resides across databases
Enables data subject access requests (DSAR) by showing what data exists and where
Supports right to deletion by identifying all locations of personal data
Provides audit trail of data access and processing

Safety Features:

All classification is read-only (no data modification)
Sensitive data masking before external processing
Tenant-level opt-out policies respected throughout system
Audit logging of all access to classified data

3. Query and Workload Monitoring

What It Does:

Captures performance metrics for resource-intensive queries (CPU, duration, logical reads, physical reads, row counts)
Stores execution plans for analysis and comparison
Detects query performance regressions through statistical analysis
Identifies parameter sniffing issues via plan hash variance
Recommends missing indexes based on SQL Server DMV data

How It Works:

Queries SQL Server DMVs (sys.dm_exec_query_stats, sys.dm_exec_cached_plans, sys.dm_exec_sql_text)
Collects top N queries by resource consumption (configurable, default top 100)
Establishes performance baselines over time
Compares current execution metrics against baselines
Flags queries with >50% performance degradation as regressions

Regression Detection:

Statistical comparison: (current_metric - baseline_metric) / baseline_metric
Confidence scoring based on sample size and variance
Severity classification:
- Critical: >200% degradation
- High: >100% degradation
- Medium: >50% degradation
- Low: <50% degradation

Missing Index Detection:

Reads sys.dm_db_missing_index_details and related DMVs
Correlates recommendations with affected queries
Calculates impact score (user seeks × average improvement)
Provides index creation scripts

Parameter Sniffing Detection:

Analyzes query hash stability (multiple plan hashes for same query hash)
Measures execution time variance across plan changes
Flags queries with high variance as potential parameter sniffing candidates

Data Collected:

Query text and normalized form
Execution plan XML
Performance metrics (CPU time, elapsed duration, reads/writes, row counts)
Execution counts and last execution timestamp
Plan hash and query hash for tracking changes

4. Server, Instance, and Database Observability

Server-Level Metrics:

CPU utilization (SQL Server process and total system)
Memory utilization and available memory
Disk I/O latency (read/write)
Network I/O (packets sent/received, errors)

Instance-Level Metrics:

Active connections and session counts
Buffer cache hit ratio
Page Life Expectancy (PLE)
TempDB usage (data file and version store)
Lock waits, timeouts, and deadlock counts
Batch requests per second
SQL compilations and recompilations
Wait statistics (categorized by type: CPU, I/O, locking, memory, network)

Database-Level Metrics:

Database size and growth trends
Transaction log usage and growth
Index fragmentation (percentage and page count)
Statistics staleness (time since last update, modification counter)

Health Scoring:

Composite score (0-100) calculated from weighted factors:
- Fragmentation (20% weight)
- Statistics freshness (20% weight)
- Blocking and deadlocks (25% weight)
- Query performance (20% weight)
- Index health (15% weight)
Status classification:
- Excellent: 90-100
- Good: 70-89
- Fair: 50-69
- Poor: 30-49
- Critical: 0-29

Anomaly Detection:

Identifies metric spikes (>50% increase from baseline)
Detects resource exhaustion patterns
Flags unexpected behavior (plan changes, configuration drift)
Categorizes anomalies by type and severity

5. AI-Assisted Troubleshooting and Optimization

What AI Does:

Analyzes query performance data and execution plans to generate optimization recommendations
Suggests specific index creations or modifications based on missing index data and query patterns
Recommends statistics updates when staleness contributes to poor performance
Proposes query rewrites for inefficient patterns
Provides natural language explanations for performance issues

What AI Does NOT Do:

Execute any changes automatically (all recommendations require human approval)
Access production databases directly (operates on collected metadata only)
Make decisions without explainability (every recommendation includes reasoning and confidence score)
Override human judgment (acts as advisor, not autonomous agent)

Safety Principles:

Read-Only Data Collection: All database interactions are SELECT queries only; no INSERT, UPDATE, DELETE, or DDL operations on monitored databases
Human-in-the-Loop: Every AI recommendation requires explicit approval before documentation; no automated execution
Explainability: Each recommendation includes:
- Reasoning (why this change is suggested)
- Expected impact (estimated performance improvement)
- Confidence score (0-100, based on data quality and historical patterns)
- Risk assessment (potential side effects or dependencies)
Audit Trail: Complete logging of all AI interactions in AIAuditLog table:
- Input prompt and output response
- Model name and version
- Timestamp and requesting user
- Confidence score and safety validation results
- Token count and latency metrics
Sensitive Data Masking: Pattern-based masking applied before AI analysis:
- SSN patterns (###-##-#### format)
- Credit card numbers (16 digits)
- Email addresses
- Phone numbers (various formats)
- Account numbers
Opt-Out Enforcement: Tenants or databases can disable AI analysis entirely through configuration
Multi-LLM Consensus (future): Support for comparing recommendations from multiple models to increase confidence
Destructive Operation Detection: Recommendations involving DROP or DELETE operations automatically flagged for elevated review

Confidence Scoring:

Based on historical data volume and quality
Adjusts for query complexity and workload patterns
Downgraded for operations identified as risky
Includes statistical measures (sample size, variance)

Recommendation Workflow:

System generates recommendation based on analysis
Recommendation stored with "Pending" status
DBA or authorized user reviews recommendation details
User approves or rejects with optional comments
Approval/rejection logged to audit trail with timestamp, user, and IP address
Approved recommendations documented for implementation (platform does not execute)

6. Change Tracking and Notifications

Change Impact Analysis:

When metadata changes are detected (index creation/deletion, table schema alterations), system analyzes impact on dependent objects
Identifies queries that reference affected tables or columns
Categorizes risk levels:
- Critical: >5 high-risk queries or >50 total affected queries
- High: >0 high-risk queries or >20 total affected queries
- Medium: >5 medium-risk queries or >10 total affected queries
- Low: >0 affected queries
- None: No affected queries
Estimates performance impact (positive, negative, or neutral)
Generates recommendations for testing or further analysis

Change Event Types Tracked:

Schema changes (table/column additions, modifications, deletions)
Index changes (creation, alteration, rebuild, drop)
Statistics updates (manual or automatic)
Configuration changes (server or database settings)
Query plan changes (plan hash modifications)

Notification Foundation:

Event severity classification (Informational, Warning, Error, Critical)
Alert service foundation for threshold-based notifications
Configurable alert rules (extensible for email, SMS, webhook integrations)
Event aggregation to prevent alert fatigue
Correlation of related events for consolidated notifications

Intended Integration Points (infrastructure needed):

Email notifications for critical events
SMS alerts for high-severity issues
Webhook integration for ITSM platforms (ServiceNow, PagerDuty)
Slack/Teams integration for collaborative response
Dashboard widgets for real-time event streams

Role of AI

How AI is Used

AI integration in Code Blue is designed as an advisory layer, not an autonomous agent. The platform uses Large Language Models (LLMs) to:

Pattern Recognition: Analyze historical query performance data to identify optimization opportunities that may not be obvious from individual metrics
Contextual Recommendations: Generate specific, actionable suggestions (e.g., "Create index on Orders(CustomerId, OrderDate) INCLUDE (TotalAmount)") rather than generic advice
Natural Language Explanation: Translate technical metrics and execution plans into human-readable explanations of performance issues
Root Cause Synthesis: Correlate multiple signals (query slowdown, index fragmentation, plan changes) to identify likely root causes
Risk Assessment: Evaluate potential side effects of proposed changes based on workload patterns and schema relationships

AI Safety Architecture

The platform implements multiple safety layers:

Layer 1: Data Collection Isolation

AI never accesses production databases directly
Operates on metadata and metrics collected by read-only queries
Sensitive data masked before AI processing (SSN, credit cards, PII)

Layer 2: Human Approval Gate

All AI recommendations require explicit human approval
No automatic execution of suggested changes
Approval workflow includes review step with detailed explanation

Layer 3: Audit and Traceability

Every AI interaction logged with full context
Includes prompt, response, model version, confidence score, timestamp
Audit trail supports compliance and incident investigation
Logs retained according to organizational policy

Layer 4: Confidence Scoring

Recommendations include confidence scores (0-100)
Low confidence recommendations flagged for extra scrutiny
Confidence based on data quality, sample size, historical accuracy

Layer 5: Opt-Out Controls

Tenant-level or database-level AI opt-out supported
Opt-out enforced throughout system (no AI analysis if disabled)
Provides control for sensitive or regulated environments

What AI Does NOT Do

No Autonomous Execution:

AI never executes DDL (CREATE, ALTER, DROP)
AI never executes DML (INSERT, UPDATE, DELETE)
AI never modifies configurations or settings
All changes require human review and manual implementation

No Black-Box Decisions:

Every recommendation includes reasoning
Explanations reference specific metrics and patterns
Users can validate recommendations against raw data
No "trust the model" opaque outputs

No Direct Database Access:

AI does not connect to monitored databases
AI does not query production systems
AI operates on pre-collected, sanitized metadata

No Unmonitored Behavior:

All AI requests and responses logged
Token usage and costs tracked
Model version and parameters recorded
No hidden or unaudited AI operations

Explainability Principles

Evidence-Based: Every recommendation cites specific data points (e.g., "Query execution time increased from 150ms to 450ms, 200% regression")
Actionable: Recommendations include concrete steps, not abstract suggestions
Contextual: Explanations reference database-specific patterns and workload characteristics
Confidence-Qualified: Uncertainty is explicit; low-confidence recommendations clearly marked
Reversible: Recommendations include information about potential rollback or mitigation if change causes issues

Intended Users

Database Administrators (DBAs)

Primary Use Cases:

Monitor query performance regressions in real-time
Identify and resolve blocking and deadlock issues
Analyze execution plans for optimization opportunities
Track index fragmentation and statistics staleness
Review AI recommendations for index creation or query rewrites
Investigate root causes of performance degradation
Maintain audit trail of schema changes for compliance

Key Features:

Health score dashboard for quick database status assessment
Regression detection alerts for proactive issue resolution
Missing index recommendations based on actual workload
Query execution plan comparison over time
Change impact analysis before implementing modifications

Data Engineers

Primary Use Cases:

Optimize ETL job performance through query analysis
Monitor data pipeline workload impact on database resources
Track schema changes that affect data integration processes
Ensure statistics are updated for query optimizer effectiveness
Analyze wait statistics to identify bottlenecks in data processing

Key Features:

Query performance snapshots showing resource-intensive operations
Workload monitoring to understand peak usage patterns
Historical performance trends for capacity planning
Change tracking to correlate schema modifications with pipeline failures

Platform and Infrastructure Teams

Primary Use Cases:

Monitor server and instance health across database fleet
Track resource utilization trends for capacity planning
Detect anomalies in CPU, memory, disk I/O, and network metrics
Ensure database infrastructure meets SLAs
Coordinate database maintenance windows based on workload patterns

Key Features:

Server-level metrics (CPU, memory, disk I/O)
Instance-level metrics (connections, buffer cache, TempDB usage, wait statistics)
Anomaly detection for resource exhaustion or unusual behavior
Multi-database observability with single pane of glass
Historical trends for infrastructure planning

Compliance and Governance Teams

Primary Use Cases:

Identify databases and tables containing sensitive data (PII, PHI, financial data)
Track data classification for GDPR, HIPAA, SOC2, and other regulatory requirements
Maintain audit trail of schema changes and data access patterns
Verify data retention policies are enforced
Generate compliance reports showing what data exists and where

Key Features:

Automated data classification (SSN, credit cards, emails, account numbers)
Metadata change tracking with append-only audit history
AI audit logs showing all model interactions with database metadata
Tenant-based opt-out controls for sensitive environments
Change impact reports showing affected systems and queries

Design Principles

1. Read-Only Data Collection

Principle: All interactions with monitored databases are SELECT queries only. The platform never modifies data, schema, or configuration on monitored systems.

Implementation:

Database connections use read-only accounts where possible
Handler methods query DMVs and system catalogs with SELECT statements
No INSERT, UPDATE, DELETE, or DDL operations on monitored databases
Configuration changes and recommendations stored in platform's own database

Rationale:

Eliminates risk of platform causing outages or data loss
Ensures platform can be deployed in production without elevated privileges
Enables deployment in regulated environments with strict change control
Builds trust with DBAs who are cautious about monitoring tools

2. Append-Only History

Principle: All collected data is immutable once written. Historical records are never updated or deleted (except per data retention policies).

Implementation:

Time-series data stored with timestamps
Schema changes recorded as discrete events, not updates to existing records
Query performance snapshots stored separately for each collection interval
Audit logs are write-once records

Rationale:

Enables accurate trend analysis without data distortion
Supports forensic investigation of incidents ("What was the state at 2 AM on Tuesday?")
Provides complete audit trail for compliance
Prevents accidental or malicious tampering with historical data

3. Safety-First Automation

Principle: Automation is used for data collection and analysis, not for applying changes. All recommendations require human review and approval.

Implementation:

Handlers execute automatically on schedules for data collection
AI generates recommendations automatically based on analysis
Recommendations stored with "Pending" status
Approval workflow requires explicit human action before documentation
No automatic execution of DDL, DML, or configuration changes

Rationale:

Respects the critical nature of database operations
Acknowledges that context and business requirements vary
Provides value through intelligence without introducing risk
Aligns with DBA workflows (review, test, implement)

4. Explainability Over Black-Box Decisions

Principle: Every insight, recommendation, and alert must be traceable to underlying data and logic. No opaque "AI magic."

Implementation:

Regression detection shows baseline vs. current metrics with percentage change
AI recommendations include reasoning section citing specific metrics
Confidence scores based on measurable factors (sample size, data quality)
Audit logs capture complete context for every AI interaction
Users can drill down from recommendations to raw query performance data

Rationale:

DBAs require justification before making changes to production systems
Explainability builds trust in platform recommendations
Enables validation of insights against other tools or manual analysis
Supports learning and knowledge transfer within teams
Required for regulated industries (financial services, healthcare) where black-box decisions are unacceptable

5. Multi-Tenant Isolation

Principle: Platform supports multiple tenants with strict data isolation. Tenants cannot access each other's data or insights.

Implementation:

All domain entities include TenantId field
Database queries filter by TenantId automatically
API endpoints validate tenant context from authentication token
Schedules and handlers parameterized with tenant identifier

Rationale:

Enables SaaS deployment model with shared infrastructure
Meets security and compliance requirements for multi-tenant systems
Prevents data leakage between organizations or business units
Supports enterprise deployments with separate divisions or subsidiaries

6. Schedule-Driven Execution

Principle: All data collection and analysis tasks are defined as schedules, not hardcoded behaviors. Platform behavior is fully configurable.

Implementation:

Schedules stored in dbo.Schedules table with parameters
Handlers are stateless and parametrized
Execution intervals configurable per schedule
Work distributed across nodes via NodeID assignment

Rationale:

Enables customization without code changes
Supports different collection intervals per database or tenant
Facilitates testing and gradual rollout of new features
Allows load balancing across distributed nodes

7. Distributed and Fault-Tolerant Architecture

Principle: System operates as distributed fabric of independent nodes with no single point of failure. Node failures do not cause data loss or system-wide outages.

Implementation:

Multiple worker nodes execute schedules independently
Nodes coordinate through database, not direct communication
Schedule state persists in database (work survives node crashes)
Failed schedules can be reassigned to healthy nodes manually

Rationale:

Supports horizontal scaling for large database fleets
Enables hybrid deployment (on-premises and cloud nodes)
Provides resilience against node failures
Allows maintenance without service interruption

Deployment and Scalability

Code Blue's node-based architecture supports multiple deployment models:

On-Premises Deployment:

Worker service installed on Windows Server or Linux
Connects to central SQL Server database for coordination
Monitors local database instances with minimal network latency

Cloud Deployment:

Worker services deployed as Azure App Services, AWS ECS tasks, or GCP Cloud Run services
Central database as Azure SQL, RDS SQL Server, or Cloud SQL
Monitors cloud-hosted databases with cloud-native integrations

Container Deployment:

Worker service packaged as Docker container
Orchestrated via Kubernetes, Docker Swarm, or ECS
Horizontal scaling through replica sets
Resource isolation with CPU and memory limits

Hybrid Deployment:

On-premises nodes monitor local databases
Cloud nodes monitor cloud databases
All nodes write to single central repository (on-premises or cloud)
Unified view across hybrid infrastructure

Scaling Model:

Add nodes by deploying Worker service with unique NodeID
Assign schedules to nodes via database configuration
No coordination overhead or inter-node communication
Linear scalability limited only by database connection pool

Conclusion

Code Blue is a database intelligence platform that transforms raw telemetry into actionable insights through automated collection, intelligent analysis, and AI-assisted recommendations. It maintains strict read-only access, human-in-the-loop approval workflows, and complete audit trails to provide value while minimizing risk.

The platform serves DBAs, data engineers, infrastructure teams, and compliance officers with targeted capabilities for performance optimization, anomaly detection, change tracking, and data governance. Its design prioritizes explainability, safety, and operational simplicity, making it suitable for production deployment in enterprise environments.

By focusing on intelligence rather than simple monitoring, Code Blue enables proactive database management: detecting issues before they escalate, recommending optimizations based on evidence, and maintaining historical context for forensic analysis and compliance reporting.

Code Blue Database Intelligence Platform

At a glance

Code Blue Platform Documentation

Platform Purpose

Problem Statement

Why This Platform Exists

Difference from Simple Monitoring

Core Capabilities

1. Metadata Intelligence and Change Tracking

2. Data Classification and Governance

3. Query and Workload Monitoring

4. Server, Instance, and Database Observability

5. AI-Assisted Troubleshooting and Optimization

6. Change Tracking and Notifications

Role of AI

How AI is Used

AI Safety Architecture

What AI Does NOT Do

Explainability Principles

Intended Users

Database Administrators (DBAs)

Data Engineers

Platform and Infrastructure Teams

Compliance and Governance Teams

Design Principles

1. Read-Only Data Collection

2. Append-Only History

3. Safety-First Automation

4. Explainability Over Black-Box Decisions

5. Multi-Tenant Isolation

6. Schedule-Driven Execution

7. Distributed and Fault-Tolerant Architecture

Deployment and Scalability

Conclusion