Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 80 additions & 58 deletions docs/reservations/committed-resource-reservations.md
Original file line number Diff line number Diff line change
@@ -1,102 +1,124 @@
# Committed Resource Reservation System

The committed resource reservation system manages capacity commitments, i.e. strict reservation guarantees usable by projects.
The committed resource (CR) reservation system manages capacity commitments, i.e. strict reservation guarantees.
When customers pre-commit to resource usage, Cortex reserves capacity on hypervisors to guarantee availability.
The system integrates with Limes (via the LIQUID protocol) to receive commitments, expose usage and capacity data, and provides acceptance/rejection feedback.

## File Structure

```text
internal/scheduling/reservations/commitments/
├── config.go # Configuration (intervals, API flags, secrets)
├── controller.go # Reconciliation of reservations
├── syncer.go # Periodic sync task with Limes, ensures local state matches Limes' commitments
├── reservation_manager.go # Reservation CRUD operations
├── api.go # HTTP API initialization
├── api_change_commitments.go # Handle commitment changes from Limes and updates local reservations accordingly
├── api_report_usage.go # Report VM usage per project, accounting to commitments or PAYG
├── api_report_capacity.go # Report capacity per AZ
├── api_info.go # Readiness endpoint with versioning (of underlying flavor group configuration)
├── capacity.go # Capacity calculation from Hypervisor CRDs
├── usage.go # VM-to-commitment assignment logic
├── flavor_group_eligibility.go # Validates VMs belong to correct flavor groups
└── state.go # Commitment state helper functions
```
Cortex receives commitments, exposes usage and capacity data, and provides acceptance/rejection via APIs.

## Implementation

The CR reservation implementation is located in `internal/scheduling/reservations/commitments/`. Key components include:
- Controller logic (`controller.go`)
- API endpoints (`api_*.go`)
- Capacity and usage calculation logic (`capacity.go`, `usage.go`)
- Syncer for periodic state sync (`syncer.go`)

## Operations
## Configuration and Observability

### Configuration
**Configuration**: Helm values for intervals, API flags, and pipeline configuration are defined in `helm/bundles/cortex-nova/values.yaml`. Key configuration includes:
- API endpoint toggles (change-commitments, report-usage, report-capacity)
- Reconciliation intervals (grace period, active monitoring)
- Scheduling pipeline selection per flavor group

| Helm Value | Description |
|------------|-------------|
| `committedResourceEnableChangeCommitmentsAPI` | Enable/disable the change-commitments endpoint |
| `committedResourceEnableReportUsageAPI` | Enable/disable the usage reporting endpoint |
| `committedResourceEnableReportCapacityAPI` | Enable/disable the capacity reporting endpoint |
| `committedResourceRequeueIntervalActive` | How often to revalidate active reservations |
| `committedResourceRequeueIntervalRetry` | Retry interval when knowledge not ready |
| `committedResourceChangeAPIWatchReservationsTimeout` | Timeout waiting for reservations to become ready while processing commitment changes via API |
| `committedResourcePipelineDefault` | Default scheduling pipeline |
| `committedResourceFlavorGroupPipelines` | Map of flavor group to pipeline name |
| `committedResourceSyncInterval` | How often the syncer reconciles Limes commitments to Reservation CRDs |
**Metrics and Alerts**: Defined in `helm/bundles/cortex-nova/alerts/nova.alerts.yaml` with prefixes:
- `cortex_committed_resource_change_api_*`
- `cortex_committed_resource_usage_api_*`
- `cortex_committed_resource_capacity_api_*`

Each API endpoint can be disabled independently. The periodic sync task can be disabled by removing it (`commitments-sync-task`) from the list of enabled tasks in the `cortex-nova` Helm chart.
## Lifecycle Management

### Observability
### State (CRDs)
Defined in `api/v1/reservations_types.go`, which contains definitions for CR reservations and failover reservations (see [./failover-reservations.md](./failover-reservations.md)).

Alerts and metrics are defined in `helm/bundles/cortex-nova/alerts/nova.alerts.yaml`. Key metric prefixes:
- `cortex_committed_resource_change_api_*` - Change API metrics
- `cortex_committed_resource_usage_api_*` - Usage API metrics
- `cortex_committed_resource_capacity_api_*` - Capacity API metrics
A reservation CRD represents a single reservation slot on a hypervisor, which holds multiple VMs.
A single CR entry typically refers to multiple reservation CRDs (slots).

## Architecture Overview

### CR Reservation Lifecycle

```mermaid
flowchart LR
subgraph State
Res[(Reservation CRDs)]
end

ChangeAPI[Change API]
UsageAPI[Usage API]
Syncer[Syncer Task]
ChangeAPI[Change API]
CapacityAPI[Capacity API]
Controller[Controller]
UsageAPI[Usage API]
Scheduler[Scheduler API]

ChangeAPI -->|CRUD| Res
Syncer -->|CRUD| Res
UsageAPI -->|read| Res
CapacityAPI -->|read| Res
CapacityAPI -->|capacity request| Scheduler
Res -->|watch| Controller
Controller -->|update spec/status| Res
Controller -->|placement request| Scheduler
Controller -->|reservation placement request| Scheduler
```

Reservations are managed through the Change API, Syncer Task, and Controller reconciliation.

| Component | Event | Timing | Action |
|-----------|-------|--------|--------|
| **Change API / Syncer** | CR Create, Resize, Delete | Immediate/Hourly | Create/update/delete Reservation CRDs |
| **Controller** | Placement | On creation | Find host via scheduler API, set `TargetHost` |
| **Controller** | Optimize unused slots | >> minutes | Assign PAYG VMs or re-place reservations |

### VM Lifecycle

VM allocations are tracked within reservations:

```mermaid
flowchart LR
subgraph State
Res[(Reservation CRDs)]
end
A[Nova Scheduler] -->|VM Create/Migrate/Resize| B[Scheduling Pipeline]
B -->|update Spec.Allocations| Res
Res -->|watch| Controller
Res -->|>>min reconcile| Controller
Controller -->|update spec/status| Res
Controller --> E{Verify allocations}
```
| Component | Event | Timing | Action |
|-----------|-------|--------|--------|
| **Scheduling Pipeline** | Placement call for: VM Create, Migrate, Resize | Immediate | Update VM in `Spec.Allocations` |
| **Controller** | Reservation CRD update: `Status`/`Spec` `.Allocations` | 1 min | Verify via Nova API and Hypervisor CRD; update `Status`/`Spec` `.Allocations` |
| **Controller** | Regular VM lifecycle check (VM off, deletion); maybe watch Hypervisor CRD VMs | >> min | Verify allocations |

Reservations are managed through the Change API, Syncer Task, and Controller reconciliation. The Usage API provides read-only access to report usage data back to Limes.
**Allocation States**:
- `Spec.Allocations` — Expected VMs (from scheduling events)
- `Status.Allocations` — Confirmed VMs (verified on host)

**Note**: VM allocations may not consume all resources of a reservation slot. A reservation with 128 GB may have VMs totaling only 96 GB if that's what fits the project's needs. Allocations may exceeding reservation capacity (e.g., after VM resize).

### Change-Commitments API

The change-commitments API receives batched commitment changes from Limes. A request can contain multiple commitment changes across different projects and flavor groups. The semantic is **all-or-nothing**: if any commitment in the batch cannot be fulfilled (e.g., insufficient capacity), the entire request is rejected and rolled back.
The change-commitments API receives batched commitment changes from Limes and manages reservations accordingly.

**Request Semantics**: A request can contain multiple commitment changes across different projects and flavor groups. The semantic is **all-or-nothing** — if any commitment in the batch cannot be fulfilled (e.g., insufficient capacity), the entire request is rejected and rolled back.

Cortex performs CRUD operations on local Reservation CRDs to match the new desired state:
**Operations**: Cortex performs CRUD operations on local Reservation CRDs to match the new desired state:
- Creates new reservations for increased commitment amounts
- Deletes existing reservations
- Cortex preserves existing reservations that already have VMs allocated when possible
- Deletes existing reservations for decreased commitments
- Preserves existing reservations that already have VMs allocated when possible

### Syncer Task

The syncer task runs periodically and fetches all commitments from Limes. It syncs the local Reservation CRD state to match Limes' view of commitments.
The syncer task runs periodically and fetches all commitments from Limes. It syncs the local Reservation CRD state to match Limes' view of commitments. Theoretically, this should find no differences of local state and Limes state.

### Controller (Reconciliation)

The controller watches Reservation CRDs and performs reconciliation:
The controller watches Reservation CRDs and performs two types of reconciliation:

1. **For new reservations** (no target host assigned):
- Calls Cortex for scheduling to find a suitable host
- Assigns the target host and marks the reservation as Ready
**Placement** - Finds hosts for new reservations (calls scheduler API)

2. **For existing reservations** (already have a target host):
- Validates that allocated VMs are still on the expected host
- Updates allocations if VMs have migrated or been deleted
- Requeues for periodic revalidation
**Allocation Verification** - Tracks VM lifecycle on reservations:
- New VMs (< 15min): Checked via Nova API every 1 minute
- Established VMs: Checked via Hypervisor CRD every 5 min - 1 hour
- Missing VMs: Removed after verification

### Usage API

Expand Down
17 changes: 16 additions & 1 deletion internal/scheduling/reservations/commitments/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,17 @@ import (

type Config struct {

// RequeueIntervalActive is the interval for requeueing active reservations for verification.
// RequeueIntervalActive is the interval for requeueing active reservations for periodic verification.
RequeueIntervalActive time.Duration `json:"committedResourceRequeueIntervalActive"`
// RequeueIntervalRetry is the interval for requeueing when retrying after knowledge is not ready.
RequeueIntervalRetry time.Duration `json:"committedResourceRequeueIntervalRetry"`
// AllocationGracePeriod is the time window after a VM is allocated to a reservation
// during which it's expected to appear on the target host. VMs not confirmed within
// this period are considered stale and removed from the reservation.
AllocationGracePeriod time.Duration `json:"committedResourceAllocationGracePeriod"`
// RequeueIntervalGracePeriod is the interval for requeueing when VMs are in grace period.
// Shorter than RequeueIntervalActive for faster verification of new allocations.
RequeueIntervalGracePeriod time.Duration `json:"committedResourceRequeueIntervalGracePeriod"`
// PipelineDefault is the default pipeline used for scheduling committed resource reservations.
PipelineDefault string `json:"committedResourcePipelineDefault"`

Expand Down Expand Up @@ -68,6 +75,12 @@ func (c *Config) ApplyDefaults() {
if c.RequeueIntervalRetry == 0 {
c.RequeueIntervalRetry = defaults.RequeueIntervalRetry
}
if c.RequeueIntervalGracePeriod == 0 {
c.RequeueIntervalGracePeriod = defaults.RequeueIntervalGracePeriod
}
if c.AllocationGracePeriod == 0 {
c.AllocationGracePeriod = defaults.AllocationGracePeriod
}
if c.PipelineDefault == "" {
c.PipelineDefault = defaults.PipelineDefault
}
Expand All @@ -88,6 +101,8 @@ func DefaultConfig() Config {
return Config{
RequeueIntervalActive: 5 * time.Minute,
RequeueIntervalRetry: 1 * time.Minute,
RequeueIntervalGracePeriod: 1 * time.Minute,
AllocationGracePeriod: 15 * time.Minute,
PipelineDefault: "kvm-general-purpose-load-balancing",
SchedulerURL: "http://localhost:8080/scheduler/nova/external",
ChangeAPIWatchReservationsTimeout: 10 * time.Second,
Expand Down
Loading
Loading