Abbott

Libre

Patient-Critical Real-Time Glucose Monitoring

Problem
Alert delivery failures during peak load; sync delays affecting 4M+ patients; caregiver notification gaps
Solution
Rebuilt real-time pipeline achieving 99.99% alert delivery with sub-second latency for patient safety
Abbott Libre CGM Platform
4M+Patients monitored
99.99%Alert delivery
<1sAlert latency
0Missed critical alerts

When alerts fail, patients are at risk.

Abbott's FreeStyle Libre is a continuous glucose monitor worn by 4+ million diabetes patients. The system sends real-time readings to mobile apps and can alert patients and caregivers when glucose levels are dangerously high or low.

The existing alert infrastructure was showing cracks. During peak usage (mornings, meal times), alert delivery was delayed. Some caregiver notifications were failing silently. Sync between devices could lag by minutes—unacceptable for a patient safety system.

I embedded with the reliability engineering team. Mapped every data path from sensor to notification. Identified single points of failure and capacity bottlenecks.

4M+patients at risk
2-5 minalert delays
3%silent failures
RWE Data Challenges
Fragmented Data
Multiple EHR systems, labs, claims
Quality Issues
Inconsistent coding, missing values
Access Delays
6-week analyst wait times
No Reuse
Each study starts from scratch

Patient Engagement Funnel

Tracking patient progression through platform features—gaps indicate reliability issues.

Total Patients: 4.2M (100%)4.2MTotal Patients100%App Connected: 3.8M (90.5%)3.8MApp Connected90.5%9.5%Real-time Sync: 3.2M (84.2%)3.2MReal-time Sync84.2%15.8%Alert Enabled: 2.8M (87.5%)2.8MAlert Enabled87.5%12.5%Caregiver Linked: 2.1M (75.0%)2.1MCaregiver Linked75.0%25.0%
Total Visitors:4.2M
Conversions:2.1M
Overall Rate:50.00%

Every millisecond matters.

Redesigned the real-time pipeline with patient safety as the primary constraint. Alert delivery is now a separate, prioritized path with its own capacity allocation and failover.

Architecture principles: no single points of failure for critical alerts, multi-channel delivery (push + SMS + email), automatic retry with escalation, and complete delivery confirmation tracking.

Priority Queues: Critical alerts bypass normal processing
Multi-Channel: Push, SMS, and email redundancy
Confirmation: Delivery verification for every alert
Data Platform Architecture
Ingestion
EHR Connectors
Claims APIs
Lab Feeds
Processing
Data Quality
Feature Store
ML Pipelines

Libre CGM System Context

End-to-end data flow from sensor to patient and caregiver notifications.

Loading diagram...

Live Glucose Monitor Simulation

Interactive demonstration of real-time CGM data with alerts and trend analysis.

Continuous Glucose Monitor
105mg/dL
Steady
Time in Range (70-180)
0%
70100140180250Target5h ago2.5h agoNow
Target (70-180)
Caution
Urgent

99.99% is not optional. It's the floor.

Implemented SRE practices with error budgets specifically designed for patient safety systems. The error budget for critical alerts is essentially zero—any missed alert triggers immediate incident response.

Built comprehensive observability: real-time dashboards for alert delivery, automatic anomaly detection, and on-call escalation for any delivery degradation.

99.99%delivery SLO
<1salert latency
0missed criticals
Platform Performance
Data Access6 wk2 d
Studies Active315+
Query Time4 hr< 5 min
Reusable Features0200+

Alert Delivery Error Budget

Real-time tracking of error budget consumption against 99.99% SLO target.

34.0%
Budget Remaining
14.7
Minutes Left
1.8x
Burn Rate
8
Days Left
Error Budget28.5 / 43.2 min consumed
Day 22
Period StartPeriod End
Budget Consumption Trend
43m32m22m11m0mBudgetDay 1Day 7Day 14Day 21Day 28
Service Level Indicators
Availability
Target: 99.9%99.87%
Latency p99
Target: 200ms185ms
Error Rate
Target: 0.1%0.08%
Throughput
Target: 1000rps1250rps
Recent Incidents
INC-2847Nov 2012min
DB connection pool exhaustion
Budget Impact: -2.2min
INC-2831Nov 178min
Certificate expiration
Budget Impact: -2.1min
INC-2815Nov 1118min
Upstream timeout cascade
Budget Impact: -3.5min
Actual Consumption
Ideal Burn
Incident

When the alert fires, everyone knows what to do.

Created comprehensive runbooks for every failure mode. Trained on-call engineers on patient safety implications. Established clear escalation paths with medical team involvement for critical incidents.

Regular chaos engineering exercises: we deliberately introduce failures to verify that redundancy works and that the team responds correctly.

Before

  • ✕ 2-5 min alert delays
  • ✕ Silent delivery failures
  • ✕ No delivery confirmation

After

  • ✓ Sub-second delivery
  • ✓ Multi-channel redundancy
  • ✓ 100% confirmation tracking
Research Enablement
50+
Analysts Trained
15+
Active Studies
5M+
Patient Records
200+
Reusable Features

Incident Response Timeline

Sample incident showing detection, response, and resolution within SLA.

2m
Time to Detect
2m
Time to Ack
16m
Time to Mitigate
72m
Time to Resolve
12,450
Customers Impacted
+
Alert Triggered
PagerDuty alert: Payment API latency > 500ms p99
14:32
Datadog
+
Incident Acknowledged
On-call engineer acknowledged alert
14:34
Mike Johnson
SEV-1 Declared
Escalated to SEV-1, incident commander assigned
14:36
Sarah Chen
02
War Room Opened
Slack channel #inc-2847 created, Zoom bridge started
14:38
System
02
Initial Triage
DB connection pool exhaustion identified as likely cause
14:42
David Kim
Mitigation Attempted
Increased connection pool size from 100 to 200
14:48
Mike Johnson
Partial Recovery
Latency improved but still elevated, investigating further
14:52
Sarah Chen
02
Root Cause Found
Slow query from recent deployment causing connection leak
15:05
Lisa Wang
Rollback Initiated
Rolling back deployment v2.4.1 → v2.4.0
15:12
Mike Johnson
Rollback Complete
All pods running v2.4.0, connections stabilizing
15:28
System
Monitoring Recovery
Latency returned to normal, error rate dropping
15:35
Sarah Chen
Incident Resolved
All metrics nominal, incident closed
15:44
Sarah Chen
Metric Impact
Latency p99
1802400175
ms
Error Rate
0.024.80.03
%
Throughput
12004501180
rps
Impacted Services
payment-api
checkout-service
order-service
Incident Commander
Sarah Chen
Postmortem scheduled: 11/22/2026
0/3 action items complete

The Result

Alert delivery now runs at 99.99% with sub-second latency. Zero missed critical alerts since the new architecture launched. Caregiver notification reliability improved from 97% to 99.95%.

The platform now handles 4M+ patients with room to scale. The reliability improvements have directly contributed to better patient outcomes—caregivers trust the alerts, and patients trust the system.

2-5 min<1 secAlert Latency
97%99.99%Delivery Rate
Unknown0 missedCritical Alerts
Platform Impact
95%
Faster Data Access
More Studies
99.9%
Uptime

Technology Stack

Core technologies powering the patient-critical real-time platform.

Real-Time Infrastructure
AWS
Global infrastructure
Redis
Real-time data store
Kubernetes
Container platform
Monitoring
Prometheus
Time-series metrics
Grafana
Dashboards
Datadog
Application monitoring
Compliance
HIPAA
PHI protection
FDA
Medical device software