1. 🎯 Phase Summary
| Phase | Hi-End Server Upgrade |
| Duration | 24 Dec 2026 - 6 Jan 2027 (2 weeks) |
| Goal | Migrate from mid-tier inference (1× L40S equivalent) to Hi-End (4× H100 SXM or 4× L40S budget). Self-host Whisper · Llama · Qwen · embed model · ColBERT reranker. End OpenAI cloud burst dependency for sensitive tenants. |
| Capacity | 2.5 FTE (1 DevOps + 0.5 Eng Lead + 0.5 Founder + 0.5 BE) |
| Critical milestone | 6 Jan 2027 · Hi-End live · MOH audit prep starts 7 Jan |
| Blocked by | Hardware procurement (lead time 4-6 weeks · order in Nov) |
| Blocks | MOH audit prep (7 Jan) requires demonstrable on-prem inference |
2. 🤔 Why Upgrade Now
- Tenant capacity: 4 tenants today · 10+ projected · mid-tier saturates at ~7 concurrent ambient SOAP sessions
- OpenAI cost trajectory: RM 8K/month at 4 tenants · projects to RM 25K/month at 10 tenants — Hi-End amortises in ~14 months
- Sensitive-tenant on-prem mandate: 2 of 5 prospect clinics require zero-cloud-leakage for clinical data · requires 100% local inference
- MOH audit defensibility: Easier compliance story when inference is on-prem · audit log + model version + GPU allocation traceable
- Performance ceiling: Mid-tier ambient SOAP p95 ≈ 6s · Hi-End targets ≤ 2s · meaningful UX win
- Multi-model freedom: Run Whisper-large + Llama-70B + Qwen-72B + embed concurrently without VRAM thrash
3. 🖥️ Hardware Spec (Two Options)
PREFERRED · H100 SXM
Premium Tier
- 4× NVIDIA H100 SXM 80GB
- 2× AMD EPYC 9654 96-core
- 1.5TB DDR5 ECC
- 4× 7.68TB NVMe Gen5
- 2× 100GbE NIC redundant
- Dual 3kW redundant PSU
Capex: ~RM 1.2M · TDP ~6kW
BUDGET · L40S
Pragmatic Tier
- 4× NVIDIA L40S 48GB
- 2× AMD EPYC 9354 32-core
- 768GB DDR5 ECC
- 2× 7.68TB NVMe Gen4
- 2× 25GbE NIC
- 2× 1.6kW redundant PSU
Capex: ~RM 280K · TDP ~3kW
Decision: Founder + Eng Lead pick by 15 Nov based on cohort onboarding traction. Default = L40S Budget tier (4-month payback) · upgrade to H100 SXM later if 10+ tenants confirmed.
4. 🏗️ Architecture Plan
┌──────────────────────────────────────────────┐
│ MediEco App Tier (srv151 LSWS · 2 nodes) │
│ Laravel 11 · Filament 3 · Redis cluster │
└─────────────────┬────────────────────────────┘
│ HTTPS / mTLS
▼
┌──────────────────────────────────────────────┐
│ Inference Gateway (Litellm proxy) │
│ Routes: │
│ /audio → Whisper-large-v3 │
│ /chat → Llama-3.1-70B-Instruct │
│ /reason → Qwen-2.5-72B (clinical) │
│ /embed → bge-m3 + ColBERT reranker │
│ /vision → Llava-1.6 (document/photo) │
└─────────────────┬────────────────────────────┘
│ NCCL · TensorRT-LLM
▼
┌──────────────────────────────────────────────┐
│ Hi-End GPU Server (4× H100 / 4× L40S) │
│ vLLM · Triton Inference Server │
│ Per-model GPU allocation: │
│ GPU 0: Whisper + embed │
│ GPU 1: Llama 70B (sharded) │
│ GPU 2: Llama 70B (sharded) │
│ GPU 3: Qwen 72B + Llava │
│ Concurrent capacity: 12 ambient SOAP, 30 chat │
└─────────────────┬────────────────────────────┘
│ pgvector replicas
▼
┌──────────────────────────────────────────────┐
│ Storage Tier (NVMe + cold backup) │
│ MariaDB 11 cluster │
│ pgvector for CPG + RAG │
│ MinIO (S3-compatible) for audio + DICOM │
└──────────────────────────────────────────────┘
5. 📅 Day-by-Day Plan
D1Wed 24 Dec · Hardware Receipt + Rack
Hardware delivered · rack inventory · power audit · UPS sized.
Hardware delivered · rack inventory · power audit · UPS sized.
D2Thu 25 Dec · Cabling + Network
Power · cooling · 100G NIC trunk to switch · DNS records prep.
Power · cooling · 100G NIC trunk to switch · DNS records prep.
D3Fri 26 Dec · OS + Driver Install
Ubuntu 24.04 LTS · CUDA 12.4 · NVIDIA driver · cuDNN · NCCL test.
Ubuntu 24.04 LTS · CUDA 12.4 · NVIDIA driver · cuDNN · NCCL test.
D4Mon 29 Dec · Inference Stack
vLLM + Triton + Litellm proxy install · per-model config.
vLLM + Triton + Litellm proxy install · per-model config.
D5Tue 30 Dec · Model Pull + Smoke Test
Whisper-large-v3 · Llama-3.1-70B · Qwen-2.5-72B · bge-m3. Smoke benchmark.
Whisper-large-v3 · Llama-3.1-70B · Qwen-2.5-72B · bge-m3. Smoke benchmark.
D6Wed 31 Dec · Year-End Burn-In
24h burn-in load test · thermals · stability · sustained throughput.
24h burn-in load test · thermals · stability · sustained throughput.
D7Thu 1 Jan · Holiday — Monitor Only
On-call monitoring · auto-recover testing.
On-call monitoring · auto-recover testing.
D8Fri 2 Jan · Traffic Mirror Mode
Production traffic mirrored to Hi-End · response compared · accuracy verified.
Production traffic mirrored to Hi-End · response compared · accuracy verified.
D9Mon 5 Jan · Cutover Tenant 1 (Doc Zam)
Tenant 1 routed to Hi-End · monitor 24h · rollback path tested.
Tenant 1 routed to Hi-End · monitor 24h · rollback path tested.
D10Tue 6 Jan · Cutover All Tenants
All tenants on Hi-End · OpenAI fallback only for emergencies · MOH audit prep starts 7 Jan.
All tenants on Hi-End · OpenAI fallback only for emergencies · MOH audit prep starts 7 Jan.
6. 🔄 Migration Strategy
- Mirror first, cutover second: Day 8 — Hi-End receives mirrored traffic but does not return responses to user · just compare accuracy/latency
- Per-tenant cutover: Day 9 cutover Doc Zam first (highest comfort) · Day 10 onwards remaining
- OpenAI fallback retained: Litellm proxy keeps OpenAI as backup route · auto-failover if Hi-End response > 10s or error rate > 5%
- Per-model rollback: Each model can rollback independently · Whisper rollback doesn't affect Llama
- Audit log: M9 records which model + which version + which GPU served each request
7. 💰 Cost Breakdown
| Item | Capex | Monthly Opex | Notes |
|---|---|---|---|
| Hardware (L40S budget tier) | RM 280K | — | 3-year amortise = RM 7.8K/mo |
| Colocation rack (full unit) | — | RM 1.5K | Power + cooling + bandwidth |
| OS + maintenance | — | RM 0.5K | DevOps share-of-cost |
| OpenAI cloud burst (residual) | — | RM 1K | Down from RM 8K · backup only |
| Total monthly run-rate | — | RM 10.8K | vs cloud-only ~RM 25K projected |
| Payback period | ~14 months at projected 10-tenant load | ||
8. 👥 Team Capacity
| Role | Allocation | Focus |
|---|---|---|
| DevOps | 1.0 FTE | Hardware install · OS · driver · inference stack |
| Eng Lead | 0.5 FTE | Architecture · Litellm config · failover |
| BE Dev | 0.5 FTE | Inference gateway integration · audit log extension |
| Founder | 0.5 FTE | Vendor liaison · cost approval · risk decisions |
9. ⚡ Performance Targets
| Metric | Mid-tier baseline | Hi-End target |
|---|---|---|
| Whisper p95 (10-min audio) | ~8s | ≤ 2.5s |
| SOAP generation p95 | ~6s | ≤ 2s |
| Concurrent ambient SOAP | 7 | ≥ 12 |
| Concurrent chat sessions | 15 | ≥ 30 |
| CPG retrieval p95 | ~400ms | ≤ 150ms |
| Inference availability | 99.5% | ≥ 99.9% |
10. 🛡️ Contingency
| Risk | Trigger | Response |
|---|---|---|
| Hardware delivery slip | > 4 weeks late | Cloud burst extended · cloud GPU rental · MOH audit prep delayed 2 weeks |
| Power/cooling inadequate | Thermal alarm | De-rate to 3 GPUs · upgrade colo cooling · or relocate |
| Driver instability | NCCL crash | Pin known-stable driver · LTS branch only · TensorRT-LLM fallback |
| Mirror mode reveals accuracy gap | Quality regression | Hold cutover · iterate prompts · upgrade model size if needed |
| Tenant rejects on-prem cutover | Wants OpenAI assurance | Litellm route allow-list · keep tenant on cloud · charge premium |