Abbott

Libre

Patient-Critical Real-Time Glucose Monitoring

Problem
Alert delivery failures during peak load; sync delays affecting 4M+ patients; caregiver notification gaps
Solution
Rebuilt real-time pipeline achieving 99.99% alert delivery with sub-second latency for patient safety
Libre Platform
4M+Patients monitored
99.99%Alert delivery
<1sAlert latency
0Missed critical alerts

When alerts fail, patients are at risk.

Abbott's FreeStyle Libre is a continuous glucose monitor worn by 4+ million diabetes patients. The system sends real-time readings to mobile apps and can alert patients and caregivers when glucose levels are dangerously high or low.

The existing alert infrastructure was showing cracks. During peak usage (mornings, meal times), alert delivery was delayed. Some caregiver notifications were failing silently. Sync between devices could lag by minutes—unacceptable for a patient safety system.

I embedded with the reliability engineering team. Mapped every data path from sensor to notification. Identified single points of failure and capacity bottlenecks.

4M+patients at risk
2-5 minalert delays
3%silent failures
Diagnose

Patient Engagement Funnel

Tracking patient progression through platform features—gaps indicate reliability issues.

Total Patients: 4200.0K (100%)4200.0KTotal Patients100%App Connected: 3800.0K (90.5%)3800.0KApp Connected90.5%9.5%Real-time Sync: 3200.0K (84.2%)3200.0KReal-time Sync84.2%15.8%Alert Enabled: 2800.0K (87.5%)2800.0KAlert Enabled87.5%12.5%Caregiver Linked: 2100.0K (75.0%)2100.0KCaregiver Linked75.0%25.0%
Total Visitors:4200.0K
Conversions:2100.0K
Overall Rate:50.00%

Every millisecond matters.

Redesigned the real-time pipeline with patient safety as the primary constraint. Alert delivery is now a separate, prioritized path with its own capacity allocation and failover.

Architecture principles: no single points of failure for critical alerts, multi-channel delivery (push + SMS + email), automatic retry with escalation, and complete delivery confirmation tracking.

Priority Queues: Critical alerts bypass normal processing
Multi-Channel: Push, SMS, and email redundancy
Confirmation: Delivery verification for every alert
Architect

Libre CGM System Context

End-to-end data flow from sensor to patient and caregiver notifications.

Loading diagram...

99.99% is not optional. It's the floor.

Implemented SRE practices with error budgets specifically designed for patient safety systems. The error budget for critical alerts is essentially zero—any missed alert triggers immediate incident response.

Built comprehensive observability: real-time dashboards for alert delivery, automatic anomaly detection, and on-call escalation for any delivery degradation.

99.99%delivery SLO
<1salert latency
0missed criticals
Engineer

Alert Delivery Error Budget

Real-time tracking of error budget consumption against 99.99% SLO target.

34.0%
Budget Remaining
14.7
Minutes Left
1.8x
Burn Rate
8
Days Left
Error Budget28.5 / 43.2 min consumed
Day 22
Period StartPeriod End
Budget Consumption Trend
43m32m22m11m0mBudgetDay 1Day 7Day 14Day 21Day 28
Service Level Indicators
Availability
Target: 99.9%99.87%
Latency p99
Target: 200ms185ms
Error Rate
Target: 0.1%0.08%
Throughput
Target: 1000rps1250rps
Recent Incidents
INC-2847Nov 2012min
DB connection pool exhaustion
Budget Impact: -2.2min
INC-2831Nov 178min
Certificate expiration
Budget Impact: -2.1min
INC-2815Nov 1118min
Upstream timeout cascade
Budget Impact: -3.5min
Actual Consumption
Ideal Burn
Incident

When the alert fires, everyone knows what to do.

Created comprehensive runbooks for every failure mode. Trained on-call engineers on patient safety implications. Established clear escalation paths with medical team involvement for critical incidents.

Regular chaos engineering exercises: we deliberately introduce failures to verify that redundancy works and that the team responds correctly.

Before

  • ✕ 2-5 min alert delays
  • ✕ Silent delivery failures
  • ✕ No delivery confirmation

After

  • ✓ Sub-second delivery
  • ✓ Multi-channel redundancy
  • ✓ 100% confirmation tracking
Enable

Incident Response Timeline

Sample incident showing detection, response, and resolution within SLA.

2m
Time to Detect
2m
Time to Ack
16m
Time to Mitigate
72m
Time to Resolve
12,450
Customers Impacted
+
Alert Triggered
PagerDuty alert: Payment API latency > 500ms p99
14:32
Datadog
+
Incident Acknowledged
On-call engineer acknowledged alert
14:34
Mike Johnson
SEV-1 Declared
Escalated to SEV-1, incident commander assigned
14:36
Sarah Chen
02
War Room Opened
Slack channel #inc-2847 created, Zoom bridge started
14:38
System
02
Initial Triage
DB connection pool exhaustion identified as likely cause
14:42
David Kim
Mitigation Attempted
Increased connection pool size from 100 to 200
14:48
Mike Johnson
Partial Recovery
Latency improved but still elevated, investigating further
14:52
Sarah Chen
02
Root Cause Found
Slow query from recent deployment causing connection leak
15:05
Lisa Wang
Rollback Initiated
Rolling back deployment v2.4.1 → v2.4.0
15:12
Mike Johnson
Rollback Complete
All pods running v2.4.0, connections stabilizing
15:28
System
Monitoring Recovery
Latency returned to normal, error rate dropping
15:35
Sarah Chen
Incident Resolved
All metrics nominal, incident closed
15:44
Sarah Chen
Metric Impact
Latency p99
1802400175
ms
Error Rate
0.024.80.03
%
Throughput
12004501180
rps
Impacted Services
payment-api
checkout-service
order-service
Incident Commander
Sarah Chen
Postmortem scheduled: 11/22/2026
0/3 action items complete

The Result

Alert delivery now runs at 99.99% with sub-second latency. Zero missed critical alerts since the new architecture launched. Caregiver notification reliability improved from 97% to 99.95%.

The platform now handles 4M+ patients with room to scale. The reliability improvements have directly contributed to better patient outcomes—caregivers trust the alerts, and patients trust the system.

2-5 min<1 secAlert Latency
97%99.99%Delivery Rate
Unknown0 missedCritical Alerts
Impact