7. Monitoring
Monitoring is where the platform proves that the secure path is usable. A pipeline standard, runner tier, or policy gate is only valuable if the platform team can see when it is slow, failing, saturated, misconfigured, or quietly drifting away from the intended design.
For this platform, monitoring has two main surfaces:
- the EKS cluster that runs GitLab Runner workloads
- GitLab metrics that show whether developers can move changes through the delivery workflow
The goal is not to collect every possible metric. The goal is to collect enough signal to answer practical questions quickly: Are runners available? Are jobs waiting too long? Are pods failing to schedule? Are protected deployment runners healthy? Are pipelines failing because of application code, platform controls, registry access, or cluster capacity?
Monitoring in plain language
Section titled “Monitoring in plain language”Monitoring means watching the system for known signals. Observability means collecting enough metrics, logs, traces, and events to investigate problems you did not predict.
In a delivery platform, both matter. Monitoring tells the platform team when a known threshold has been crossed. Observability helps explain why it happened.
Start with the user journeys that matter:
- a developer pushes code and opens a merge request
- a pipeline starts without unusual delay
- required scans complete successfully
- a runner pod schedules in the expected tier
- a Terraform plan is produced and reviewed
- a protected deployment job runs only through the approved path
- an artifact, SBOM, signature, and provenance record are produced
Those journeys turn monitoring into something concrete. Instead of asking whether “the platform” is up, the team asks whether the important delivery paths are working.
EKS cluster monitoring
Section titled “EKS cluster monitoring”EKS monitoring should be deployed as part of the platform baseline, not added manually after the cluster is live. AWS guidance emphasizes starting with critical business, application, infrastructure, and security metrics, deploying monitoring through infrastructure-as-code, validating the monitoring path, and refining coverage over time.
For runner clusters, the platform should monitor four layers:
| Layer | Signals to watch |
|---|---|
| Control plane | API server latency, API errors, admission failures, authentication failures, audit events, scheduler behavior |
| Nodes | CPU, memory, disk pressure, network errors, node readiness, autoscaler activity |
| Pods | pending pods, failed scheduling, restarts, image pull errors, eviction, pod startup time |
| Runner tiers | queue time, running jobs, request concurrency, failed jobs, protected runner availability |
Amazon EKS can expose control plane logs to CloudWatch Logs for audit and diagnostic use. For a runner platform, audit and authenticator logs are especially useful because they help explain who or what interacted with the cluster and whether authentication is failing.
For workload metrics, AWS now recommends OTel Container Insights for EKS through the amazon-cloudwatch-observability add-on. That path gives teams a managed way to collect Kubernetes workload telemetry while still leaving room for Prometheus-compatible collection where the organization already uses Prometheus, Amazon Managed Service for Prometheus, or Grafana.
Runner-specific monitoring
Section titled “Runner-specific monitoring”GitLab Runner exposes native Prometheus metrics from an embedded /metrics endpoint when the metrics server is enabled. For Helm-based runner deployments, the runner values can enable the metrics port and, where Prometheus Operator is used, a ServiceMonitor.
The platform should label runner metrics by tier. A single global runner dashboard hides the difference between a sandbox tier under normal feature-branch load and a protected deployment tier that cannot run production jobs.
Useful runner signals include:
gitlab_runner_jobs_running_totalgitlab_runner_concurrentgitlab_runner_limitgitlab_runner_request_concurrencygitlab_runner_request_concurrency_exceeded_total- runner API request errors
- runner version information
- pod scheduling failures by namespace and tier
The exact metric names can vary by runner version and deployment model, so the implementation should scrape the live endpoint and build dashboards from the metrics the deployed runner actually exposes.
GitLab delivery metrics
Section titled “GitLab delivery metrics”GitLab metrics should answer whether the delivery workflow is healthy from the developer’s point of view. There are several useful sources:
- GitLab Runner metrics show runner process and capacity behavior.
- GitLab Prometheus metrics expose internal GitLab service metrics for self-managed instances.
- CI/CD analytics show pipeline runs, duration, failure rate, and success rate.
- Pipeline efficiency guidance highlights pipeline health, job duration, pipeline duration, and API-based collection for external monitoring.
- DORA metrics show delivery outcomes such as deployment frequency, lead time for changes, time to restore service, and change failure rate.
For this platform, the baseline GitLab dashboard should include:
| Area | Metrics |
|---|---|
| Pipeline health | total runs, success rate, failure rate, canceled rate, median duration |
| Queue health | pending time, runner request concurrency, jobs waiting by runner tier |
| Job reliability | top failing jobs, retry rate, scan failure rate, artifact upload failures |
| Merge flow | merge request age, review latency, time to merge, approval bottlenecks |
| Deployment flow | deployment frequency, failed protected deployments, rollback events |
| Security flow | SAST failures, secret detection failures, dependency scan failures, policy-blocked merges |
| Cost and capacity | runner saturation, cost per pipeline minute, wasted minutes from repeated failures |
Some of these metrics come from GitLab UI analytics. Some come from Prometheus scraping. Some may need API polling or a purpose-built exporter if the organization wants long-term, cross-project trends. That is acceptable as long as the platform is clear about which source owns each metric.
Alerts
Section titled “Alerts”Alerts should focus on user impact and control failure, not every noisy implementation detail.
Good first alerts include:
- standard runner queue time stays above the SLO threshold
- protected deployment runner has no healthy capacity
- runner pods fail to schedule because of node, quota, image pull, or admission issues
- GitLab Runner cannot reach GitLab
- GitLab cannot reach required registries or package proxies
- required security scans fail for platform reasons across many projects
- artifact upload, SBOM generation, provenance generation, or signing fails across many projects
- EKS node pressure threatens runner capacity
- API server latency or authentication failures spike
Each alert should have an owner, severity, dashboard link, and runbook. If an alert does not lead to action, it should be removed or rewritten.
SLO seeds
Section titled “SLO seeds”The first SLOs should map to the delivery paths teams use every day:
slos: - name: standard-runner-queue-latency target: 95 objective: p95 <= 120s window: 7d source: gitlab-runner-prometheus - name: protected-deploy-runner-availability target: 99.5 window: 30d source: runner-tier-health - name: platform-owned-pipeline-success-rate target: 99.0 window: 30d source: gitlab-ci-analytics - name: eks-runner-pod-scheduling-success target: 99.0 window: 7d source: kubernetes-eventsThese are seeds, not universal targets. The real thresholds should be tuned after the platform team has baseline data. A new platform should avoid pretending it knows the correct SLO before it has measured normal behavior.
Retention and cost
Section titled “Retention and cost”Monitoring can become expensive if every pod, container, job, and label produces high-cardinality telemetry forever. AWS guidance calls out retention policy, sampling rate, aggregation, and log archival as part of a cost-aware monitoring design.
The baseline should define:
- short retention for high-cardinality pod and job details
- longer retention for aggregated SLO and trend data
- separate retention for audit logs and incident evidence
- label rules that avoid unbounded dimensions such as branch names where they are not needed
- dashboards that use aggregated views first and detailed views only for investigation
This keeps monitoring useful without turning it into another platform cost problem.
Runbooks
Section titled “Runbooks”Monitoring is not complete until the team knows what to do when a signal turns red. Runner and EKS runbooks should cover:
- runner queue saturation
- protected runner outage
- failed pod scheduling
- image pull and registry failures
- runner token or registration problems
- CloudWatch or Prometheus scrape failures
- sudden scan failure spikes
- Terraform plan or apply delays caused by shared platform dependencies
The runbooks should point to the owning team, the dashboard, the likely causes, the first checks, and the escalation path. They should also explain how to distinguish an application problem from a platform problem.
Baseline decision
Section titled “Baseline decision”The reference platform should start with these monitoring decisions:
- enable EKS control plane logging for audit and diagnostic value
- deploy OTel Container Insights or the organization’s approved Prometheus path through Terraform
- enable GitLab Runner Prometheus metrics on every runner tier
- label runner metrics by tier and environment
- collect GitLab CI/CD analytics and DORA metrics for project and group trends
- define SLO seeds for queue time, runner availability, scheduling success, and platform-owned pipeline success
- create runbooks before turning monitoring thresholds into paging alerts
Monitoring connects runner isolation to the supply chain and SRE work that comes next. Once the platform can see whether the delivery path is healthy, it can trust the evidence produced by that path and improve the system when teams feel friction.