⚡ Hi-End Server Upgrade · 24 Dec 2026 - 6 Jan 2027 · 4× H100 GPU Deployment
← Hub ← Onboarding HI-END UPGRADE 2 minggu

Hi-End Server Upgrade · 4× H100

Mid-tier inference server graduates to Hi-End. 4× H100 SXM (or 4× L40S budget alternative) · 2× redundant 100G NIC · NVMe storage tier · doubled DRAM. Capacity for 10+ tenants concurrent · sub-2s ambient SOAP · self-host all LLM inference. Ends OpenAI cloud burst dependency.

1. 🎯 Phase Summary

PhaseHi-End Server Upgrade
Duration24 Dec 2026 - 6 Jan 2027 (2 weeks)
GoalMigrate from mid-tier inference (1× L40S equivalent) to Hi-End (4× H100 SXM or 4× L40S budget). Self-host Whisper · Llama · Qwen · embed model · ColBERT reranker. End OpenAI cloud burst dependency for sensitive tenants.
Capacity2.5 FTE (1 DevOps + 0.5 Eng Lead + 0.5 Founder + 0.5 BE)
Critical milestone6 Jan 2027 · Hi-End live · MOH audit prep starts 7 Jan
Blocked byHardware procurement (lead time 4-6 weeks · order in Nov)
BlocksMOH audit prep (7 Jan) requires demonstrable on-prem inference

2. 🤔 Why Upgrade Now

  • Tenant capacity: 4 tenants today · 10+ projected · mid-tier saturates at ~7 concurrent ambient SOAP sessions
  • OpenAI cost trajectory: RM 8K/month at 4 tenants · projects to RM 25K/month at 10 tenants — Hi-End amortises in ~14 months
  • Sensitive-tenant on-prem mandate: 2 of 5 prospect clinics require zero-cloud-leakage for clinical data · requires 100% local inference
  • MOH audit defensibility: Easier compliance story when inference is on-prem · audit log + model version + GPU allocation traceable
  • Performance ceiling: Mid-tier ambient SOAP p95 ≈ 6s · Hi-End targets ≤ 2s · meaningful UX win
  • Multi-model freedom: Run Whisper-large + Llama-70B + Qwen-72B + embed concurrently without VRAM thrash

3. 🖥️ Hardware Spec (Two Options)

PREFERRED · H100 SXM
Premium Tier
  • 4× NVIDIA H100 SXM 80GB
  • 2× AMD EPYC 9654 96-core
  • 1.5TB DDR5 ECC
  • 4× 7.68TB NVMe Gen5
  • 2× 100GbE NIC redundant
  • Dual 3kW redundant PSU
Capex: ~RM 1.2M · TDP ~6kW
BUDGET · L40S
Pragmatic Tier
  • 4× NVIDIA L40S 48GB
  • 2× AMD EPYC 9354 32-core
  • 768GB DDR5 ECC
  • 2× 7.68TB NVMe Gen4
  • 2× 25GbE NIC
  • 2× 1.6kW redundant PSU
Capex: ~RM 280K · TDP ~3kW
Decision: Founder + Eng Lead pick by 15 Nov based on cohort onboarding traction. Default = L40S Budget tier (4-month payback) · upgrade to H100 SXM later if 10+ tenants confirmed.

4. 🏗️ Architecture Plan

┌──────────────────────────────────────────────┐
│ MediEco App Tier (srv151 LSWS · 2 nodes)     │
│ Laravel 11 · Filament 3 · Redis cluster       │
└─────────────────┬────────────────────────────┘
                  │ HTTPS / mTLS
                  ▼
┌──────────────────────────────────────────────┐
│ Inference Gateway (Litellm proxy)            │
│ Routes:                                       │
│   /audio → Whisper-large-v3                   │
│   /chat  → Llama-3.1-70B-Instruct             │
│   /reason → Qwen-2.5-72B (clinical)            │
│   /embed → bge-m3 + ColBERT reranker          │
│   /vision → Llava-1.6 (document/photo)        │
└─────────────────┬────────────────────────────┘
                  │ NCCL · TensorRT-LLM
                  ▼
┌──────────────────────────────────────────────┐
│ Hi-End GPU Server (4× H100 / 4× L40S)        │
│ vLLM · Triton Inference Server                │
│ Per-model GPU allocation:                     │
│   GPU 0: Whisper + embed                      │
│   GPU 1: Llama 70B (sharded)                  │
│   GPU 2: Llama 70B (sharded)                  │
│   GPU 3: Qwen 72B + Llava                     │
│ Concurrent capacity: 12 ambient SOAP, 30 chat │
└─────────────────┬────────────────────────────┘
                  │ pgvector replicas
                  ▼
┌──────────────────────────────────────────────┐
│ Storage Tier (NVMe + cold backup)            │
│ MariaDB 11 cluster                            │
│ pgvector for CPG + RAG                        │
│ MinIO (S3-compatible) for audio + DICOM       │
└──────────────────────────────────────────────┘

5. 📅 Day-by-Day Plan

D1Wed 24 Dec · Hardware Receipt + Rack
Hardware delivered · rack inventory · power audit · UPS sized.
D2Thu 25 Dec · Cabling + Network
Power · cooling · 100G NIC trunk to switch · DNS records prep.
D3Fri 26 Dec · OS + Driver Install
Ubuntu 24.04 LTS · CUDA 12.4 · NVIDIA driver · cuDNN · NCCL test.
D4Mon 29 Dec · Inference Stack
vLLM + Triton + Litellm proxy install · per-model config.
D5Tue 30 Dec · Model Pull + Smoke Test
Whisper-large-v3 · Llama-3.1-70B · Qwen-2.5-72B · bge-m3. Smoke benchmark.
D6Wed 31 Dec · Year-End Burn-In
24h burn-in load test · thermals · stability · sustained throughput.
D7Thu 1 Jan · Holiday — Monitor Only
On-call monitoring · auto-recover testing.
D8Fri 2 Jan · Traffic Mirror Mode
Production traffic mirrored to Hi-End · response compared · accuracy verified.
D9Mon 5 Jan · Cutover Tenant 1 (Doc Zam)
Tenant 1 routed to Hi-End · monitor 24h · rollback path tested.
D10Tue 6 Jan · Cutover All Tenants
All tenants on Hi-End · OpenAI fallback only for emergencies · MOH audit prep starts 7 Jan.

6. 🔄 Migration Strategy

  • Mirror first, cutover second: Day 8 — Hi-End receives mirrored traffic but does not return responses to user · just compare accuracy/latency
  • Per-tenant cutover: Day 9 cutover Doc Zam first (highest comfort) · Day 10 onwards remaining
  • OpenAI fallback retained: Litellm proxy keeps OpenAI as backup route · auto-failover if Hi-End response > 10s or error rate > 5%
  • Per-model rollback: Each model can rollback independently · Whisper rollback doesn't affect Llama
  • Audit log: M9 records which model + which version + which GPU served each request

7. 💰 Cost Breakdown

ItemCapexMonthly OpexNotes
Hardware (L40S budget tier)RM 280K3-year amortise = RM 7.8K/mo
Colocation rack (full unit)RM 1.5KPower + cooling + bandwidth
OS + maintenanceRM 0.5KDevOps share-of-cost
OpenAI cloud burst (residual)RM 1KDown from RM 8K · backup only
Total monthly run-rateRM 10.8Kvs cloud-only ~RM 25K projected
Payback period~14 months at projected 10-tenant load

8. 👥 Team Capacity

RoleAllocationFocus
DevOps1.0 FTEHardware install · OS · driver · inference stack
Eng Lead0.5 FTEArchitecture · Litellm config · failover
BE Dev0.5 FTEInference gateway integration · audit log extension
Founder0.5 FTEVendor liaison · cost approval · risk decisions

9. ⚡ Performance Targets

MetricMid-tier baselineHi-End target
Whisper p95 (10-min audio)~8s≤ 2.5s
SOAP generation p95~6s≤ 2s
Concurrent ambient SOAP7≥ 12
Concurrent chat sessions15≥ 30
CPG retrieval p95~400ms≤ 150ms
Inference availability99.5%≥ 99.9%

10. 🛡️ Contingency

RiskTriggerResponse
Hardware delivery slip> 4 weeks lateCloud burst extended · cloud GPU rental · MOH audit prep delayed 2 weeks
Power/cooling inadequateThermal alarmDe-rate to 3 GPUs · upgrade colo cooling · or relocate
Driver instabilityNCCL crashPin known-stable driver · LTS branch only · TensorRT-LLM fallback
Mirror mode reveals accuracy gapQuality regressionHold cutover · iterate prompts · upgrade model size if needed
Tenant rejects on-prem cutoverWants OpenAI assuranceLitellm route allow-list · keep tenant on cloud · charge premium