Enterprise IT services — ITSM operations
AI-Powered Change Risk Engine That Predicts Incidents Before They Happen
Built an end-to-end change-risk assessment platform that mines years of ITSM history to predict whether a planned infrastructure change will cause an incident, and tells implementers exactly how to prevent it.
The challenge
What needed solving.
A global IT services provider was approving infrastructure changes with no systematic way to know which ones would blow up into incidents. The history existed — tens of thousands of change requests and the incidents they caused — but it was scattered across inconsistent CSV exports with drifting schemas, and the hard part wasn't classification: it was making the output specific. Generic "be careful" advice gets ignored; implementers needed verdicts tied to actual past incidents, with concrete prevention steps and an auditable explanation of why the risk level was assigned.
What we built
The system.
- Ingestion pipeline normalizing ~100K change/incident records across drifting CSV schemas, linking changes to the incidents they caused
- Embedding pipeline (OpenAI text-embedding-3-small) populating two Qdrant collections — change-level and incident-level — for company-filtered semantic retrieval
- Deterministic risk-scoring service computing incident rate, severity weighting, and recency factors from retrieved history
- Agent-based LLM assessment layer that fuses retrieved evidence and computed metrics into a High/Medium/Low verdict with a grounded explanation, plus a rule-based fallback when the LLM is unavailable
- FastAPI backend with prediction, ticket-analysis, feedback, and history endpoints; daily retraining/re-embedding pipeline with run telemetry
- React + TypeScript frontend for analysts: change input, risk summary, similar-incident evidence, monthly patterns, and prompt customization
Architecture
The core engineering judgment was refusing to let a single model do everything. An early iteration tried exactly that — a binary classifier (Random Forest over embedding-derived features) predicting “will this change cause an incident?” It worked statistically but failed operationally: a probability with no evidence trail is something a change advisory board can’t act on. The production system splits the problem into three layers that each do what they’re best at.
Retrieval layer. Every historical change description is embedded with
text-embedding-3-small and stored in Qdrant across two collections: one keyed
to changes (filterable by client organization) and one keyed to the incidents
those changes caused. When a new change comes in, the system pulls up to 200
semantically similar past changes above a similarity threshold, then walks the
change→incident linkage to surface what actually went wrong last time someone did
something like this. Dot-product search over normalized vectors keeps retrieval
in the tens of milliseconds.
Deterministic risk math. Rather than asking an LLM to “feel out” the risk, the backend computes hard metrics from the retrieved set — high-impact incident rate, a severity-weighted score, and a recency factor that up-weights incidents from the last few months. These numbers are auditable, reproducible, and independent of any model’s mood.
The LLM never decides the risk alone — it interprets evidence the system has already retrieved and quantified, which is what makes the output defensible to an enterprise change board.
Agent assessment layer. The metrics, similar incidents, monthly failure patterns, and severity breakdown are assembled into a structured prompt for a hosted LLM agent, which returns a constrained JSON verdict (High/Medium/Low) plus a short grounded explanation and prevention recommendations referencing the retrieved incidents. A rule-based fallback mirrors the same thresholds, so the API degrades gracefully instead of failing when the LLM is unreachable. A feedback loop lets analysts inject corrections that are fed into subsequent assessments, and a daily pipeline re-ingests new tickets, regenerates embeddings, and records run telemetry (merge counts, vector-creation timing, failures) for operational visibility.
┌────────────────────────────────────────────────────┐
│ DAILY PIPELINE (CRON) │
│ CSV exports → normalize → link CR↔incident → │
│ embed → upsert Qdrant → run telemetry │
└───────────────┬────────────────────────────────────┘
│
┌────────────┐ ┌──────▼──────┐ ┌──────────────────┐
│ React UI │───▶│ FastAPI │────▶│ Qdrant (2 colls) │
│ (analyst) │◀───│ backend │◀────│ changes/incidents│
└────────────┘ │ │ └──────────────────┘
│ risk math │
│ rate·sev·rec│ ┌──────────────────┐
│ │────▶│ LLM agent layer │
│ fallback │◀────│ (JSON verdict) │
└──────┬──────┘ └──────────────────┘
│
┌──────▼──────┐
│ PostgreSQL │
│ history + │
│ feedback │
└─────────────┘Results
Outcomes.
Stack
What it runs on.
- Python / FastAPI
- Qdrant
- OpenAI embeddings
- PostgreSQL
- pandas / scikit-learn
- React + TypeScript (Vite, Tailwind)