MediEco · Hi-End Server Upgrade · 24 Dec 2026

1. 🎯 Phase Summary

Phase	Hi-End Server Upgrade
Duration	24 Dec 2026 - 6 Jan 2027 (2 weeks)
Goal	Migrate from mid-tier inference (1× L40S equivalent) to Hi-End (4× H100 SXM or 4× L40S budget). Self-host Whisper · Llama · Qwen · embed model · ColBERT reranker. End OpenAI cloud burst dependency for sensitive tenants.
Capacity	2.5 FTE (1 DevOps + 0.5 Eng Lead + 0.5 Founder + 0.5 BE)
Critical milestone	6 Jan 2027 · Hi-End live · MOH audit prep starts 7 Jan
Blocked by	Hardware procurement (lead time 4-6 weeks · order in Nov)
Blocks	MOH audit prep (7 Jan) requires demonstrable on-prem inference

2. 🤔 Why Upgrade Now

Tenant capacity: 4 tenants today · 10+ projected · mid-tier saturates at ~7 concurrent ambient SOAP sessions
OpenAI cost trajectory: RM 8K/month at 4 tenants · projects to RM 25K/month at 10 tenants — Hi-End amortises in ~14 months
Sensitive-tenant on-prem mandate: 2 of 5 prospect clinics require zero-cloud-leakage for clinical data · requires 100% local inference
MOH audit defensibility: Easier compliance story when inference is on-prem · audit log + model version + GPU allocation traceable
Performance ceiling: Mid-tier ambient SOAP p95 ≈ 6s · Hi-End targets ≤ 2s · meaningful UX win
Multi-model freedom: Run Whisper-large + Llama-70B + Qwen-72B + embed concurrently without VRAM thrash

3. 🖥️ Hardware Spec (Two Options)

PREFERRED · H100 SXM

Premium Tier

4× NVIDIA H100 SXM 80GB
2× AMD EPYC 9654 96-core
1.5TB DDR5 ECC
4× 7.68TB NVMe Gen5
2× 100GbE NIC redundant
Dual 3kW redundant PSU

Capex: ~RM 1.2M · TDP ~6kW

BUDGET · L40S

Pragmatic Tier

4× NVIDIA L40S 48GB
2× AMD EPYC 9354 32-core
768GB DDR5 ECC
2× 7.68TB NVMe Gen4
2× 25GbE NIC
2× 1.6kW redundant PSU

Capex: ~RM 280K · TDP ~3kW

Decision: Founder + Eng Lead pick by 15 Nov based on cohort onboarding traction. Default = L40S Budget tier (4-month payback) · upgrade to H100 SXM later if 10+ tenants confirmed.

4. 🏗️ Architecture Plan

┌──────────────────────────────────────────────┐
│ MediEco App Tier (srv151 LSWS · 2 nodes)     │
│ Laravel 11 · Filament 3 · Redis cluster       │
└─────────────────┬────────────────────────────┘
                  │ HTTPS / mTLS
                  ▼
┌──────────────────────────────────────────────┐
│ Inference Gateway (Litellm proxy)            │
│ Routes:                                       │
│   /audio → Whisper-large-v3                   │
│   /chat  → Llama-3.1-70B-Instruct             │
│   /reason → Qwen-2.5-72B (clinical)            │
│   /embed → bge-m3 + ColBERT reranker          │
│   /vision → Llava-1.6 (document/photo)        │
└─────────────────┬────────────────────────────┘
                  │ NCCL · TensorRT-LLM
                  ▼
┌──────────────────────────────────────────────┐
│ Hi-End GPU Server (4× H100 / 4× L40S)        │
│ vLLM · Triton Inference Server                │
│ Per-model GPU allocation:                     │
│   GPU 0: Whisper + embed                      │
│   GPU 1: Llama 70B (sharded)                  │
│   GPU 2: Llama 70B (sharded)                  │
│   GPU 3: Qwen 72B + Llava                     │
│ Concurrent capacity: 12 ambient SOAP, 30 chat │
└─────────────────┬────────────────────────────┘
                  │ pgvector replicas
                  ▼
┌──────────────────────────────────────────────┐
│ Storage Tier (NVMe + cold backup)            │
│ MariaDB 11 cluster                            │
│ pgvector for CPG + RAG                        │
│ MinIO (S3-compatible) for audio + DICOM       │
└──────────────────────────────────────────────┘

5. 📅 Day-by-Day Plan

D1Wed 24 Dec · Hardware Receipt + Rack
Hardware delivered · rack inventory · power audit · UPS sized.

D2Thu 25 Dec · Cabling + Network
Power · cooling · 100G NIC trunk to switch · DNS records prep.

D3Fri 26 Dec · OS + Driver Install
Ubuntu 24.04 LTS · CUDA 12.4 · NVIDIA driver · cuDNN · NCCL test.

D4Mon 29 Dec · Inference Stack
vLLM + Triton + Litellm proxy install · per-model config.

D5Tue 30 Dec · Model Pull + Smoke Test
Whisper-large-v3 · Llama-3.1-70B · Qwen-2.5-72B · bge-m3. Smoke benchmark.

D6Wed 31 Dec · Year-End Burn-In
24h burn-in load test · thermals · stability · sustained throughput.

D7Thu 1 Jan · Holiday — Monitor Only
On-call monitoring · auto-recover testing.

D8Fri 2 Jan · Traffic Mirror Mode
Production traffic mirrored to Hi-End · response compared · accuracy verified.

D9Mon 5 Jan · Cutover Tenant 1 (Doc Zam)
Tenant 1 routed to Hi-End · monitor 24h · rollback path tested.

D10Tue 6 Jan · Cutover All Tenants
All tenants on Hi-End · OpenAI fallback only for emergencies · MOH audit prep starts 7 Jan.

6. 🔄 Migration Strategy

Mirror first, cutover second: Day 8 — Hi-End receives mirrored traffic but does not return responses to user · just compare accuracy/latency
Per-tenant cutover: Day 9 cutover Doc Zam first (highest comfort) · Day 10 onwards remaining
OpenAI fallback retained: Litellm proxy keeps OpenAI as backup route · auto-failover if Hi-End response > 10s or error rate > 5%
Per-model rollback: Each model can rollback independently · Whisper rollback doesn't affect Llama
Audit log: M9 records which model + which version + which GPU served each request

7. 💰 Cost Breakdown

Item	Capex	Monthly Opex	Notes
Hardware (L40S budget tier)	RM 280K	—	3-year amortise = RM 7.8K/mo
Colocation rack (full unit)	—	RM 1.5K	Power + cooling + bandwidth
OS + maintenance	—	RM 0.5K	DevOps share-of-cost
OpenAI cloud burst (residual)	—	RM 1K	Down from RM 8K · backup only
Total monthly run-rate	—	RM 10.8K	vs cloud-only ~RM 25K projected
Payback period	~14 months at projected 10-tenant load

8. 👥 Team Capacity

Role	Allocation	Focus
DevOps	1.0 FTE	Hardware install · OS · driver · inference stack
Eng Lead	0.5 FTE	Architecture · Litellm config · failover
BE Dev	0.5 FTE	Inference gateway integration · audit log extension
Founder	0.5 FTE	Vendor liaison · cost approval · risk decisions

9. ⚡ Performance Targets

Metric	Mid-tier baseline	Hi-End target
Whisper p95 (10-min audio)	~8s	≤ 2.5s
SOAP generation p95	~6s	≤ 2s
Concurrent ambient SOAP	7	≥ 12
Concurrent chat sessions	15	≥ 30
CPG retrieval p95	~400ms	≤ 150ms
Inference availability	99.5%	≥ 99.9%

10. 🛡️ Contingency

Risk	Trigger	Response
Hardware delivery slip	> 4 weeks late	Cloud burst extended · cloud GPU rental · MOH audit prep delayed 2 weeks
Power/cooling inadequate	Thermal alarm	De-rate to 3 GPUs · upgrade colo cooling · or relocate
Driver instability	NCCL crash	Pin known-stable driver · LTS branch only · TensorRT-LLM fallback
Mirror mode reveals accuracy gap	Quality regression	Hold cutover · iterate prompts · upgrade model size if needed
Tenant rejects on-prem cutover	Wants OpenAI assurance	Litellm route allow-list · keep tenant on cloud · charge premium