7. Monitoring

Monitoring is where the platform proves that the secure path is usable. A pipeline standard, runner tier, or policy gate is only valuable if the platform team can see when it is slow, failing, saturated, misconfigured, or quietly drifting away from the intended design.

For this platform, monitoring has two main surfaces:

the EKS cluster that runs GitLab Runner workloads
GitLab metrics that show whether developers can move changes through the delivery workflow

The goal is not to collect every possible metric. The goal is to collect enough signal to answer practical questions quickly: Are runners available? Are jobs waiting too long? Are pods failing to schedule? Are protected deployment runners healthy? Are pipelines failing because of application code, platform controls, registry access, or cluster capacity?

Monitoring in plain language

Monitoring means watching the system for known signals. Observability means collecting enough metrics, logs, traces, and events to investigate problems you did not predict.

In a delivery platform, both matter. Monitoring tells the platform team when a known threshold has been crossed. Observability helps explain why it happened.

Start with the user journeys that matter:

a developer pushes code and opens a merge request
a pipeline starts without unusual delay
required scans complete successfully
a runner pod schedules in the expected tier
a Terraform plan is produced and reviewed
a protected deployment job runs only through the approved path
an artifact, SBOM, signature, and provenance record are produced

Those journeys turn monitoring into something concrete. Instead of asking whether “the platform” is up, the team asks whether the important delivery paths are working.

EKS cluster monitoring

EKS monitoring should be deployed as part of the platform baseline, not added manually after the cluster is live. AWS guidance emphasizes starting with critical business, application, infrastructure, and security metrics, deploying monitoring through infrastructure-as-code, validating the monitoring path, and refining coverage over time.

For runner clusters, the platform should monitor four layers:

Layer	Signals to watch
Control plane	API server latency, API errors, admission failures, authentication failures, audit events, scheduler behavior
Nodes	CPU, memory, disk pressure, network errors, node readiness, autoscaler activity
Pods	pending pods, failed scheduling, restarts, image pull errors, eviction, pod startup time
Runner tiers	queue time, running jobs, request concurrency, failed jobs, protected runner availability

Amazon EKS can expose control plane logs to CloudWatch Logs for audit and diagnostic use. For a runner platform, audit and authenticator logs are especially useful because they help explain who or what interacted with the cluster and whether authentication is failing.

For workload metrics, AWS now recommends OTel Container Insights for EKS through the amazon-cloudwatch-observability add-on. That path gives teams a managed way to collect Kubernetes workload telemetry while still leaving room for Prometheus-compatible collection where the organization already uses Prometheus, Amazon Managed Service for Prometheus, or Grafana.

Runner-specific monitoring

GitLab Runner exposes native Prometheus metrics from an embedded /metrics endpoint when the metrics server is enabled. For Helm-based runner deployments, the runner values can enable the metrics port and, where Prometheus Operator is used, a ServiceMonitor.

The platform should label runner metrics by tier. A single global runner dashboard hides the difference between a sandbox tier under normal feature-branch load and a protected deployment tier that cannot run production jobs.

Useful runner signals include:

gitlab_runner_jobs_running_total
gitlab_runner_concurrent
gitlab_runner_limit
gitlab_runner_request_concurrency
gitlab_runner_request_concurrency_exceeded_total
runner API request errors
runner version information
pod scheduling failures by namespace and tier

The exact metric names can vary by runner version and deployment model, so the implementation should scrape the live endpoint and build dashboards from the metrics the deployed runner actually exposes.

GitLab delivery metrics

GitLab metrics should answer whether the delivery workflow is healthy from the developer’s point of view. There are several useful sources:

GitLab Runner metrics show runner process and capacity behavior.
GitLab Prometheus metrics expose internal GitLab service metrics for self-managed instances.
CI/CD analytics show pipeline runs, duration, failure rate, and success rate.
Pipeline efficiency guidance highlights pipeline health, job duration, pipeline duration, and API-based collection for external monitoring.
DORA metrics show delivery outcomes such as deployment frequency, lead time for changes, time to restore service, and change failure rate.

For this platform, the baseline GitLab dashboard should include:

Area	Metrics
Pipeline health	total runs, success rate, failure rate, canceled rate, median duration
Queue health	pending time, runner request concurrency, jobs waiting by runner tier
Job reliability	top failing jobs, retry rate, scan failure rate, artifact upload failures
Merge flow	merge request age, review latency, time to merge, approval bottlenecks
Deployment flow	deployment frequency, failed protected deployments, rollback events
Security flow	SAST failures, secret detection failures, dependency scan failures, policy-blocked merges
Cost and capacity	runner saturation, cost per pipeline minute, wasted minutes from repeated failures

Some of these metrics come from GitLab UI analytics. Some come from Prometheus scraping. Some may need API polling or a purpose-built exporter if the organization wants long-term, cross-project trends. That is acceptable as long as the platform is clear about which source owns each metric.

Alerts

Alerts should focus on user impact and control failure, not every noisy implementation detail.

Good first alerts include:

standard runner queue time stays above the SLO threshold
protected deployment runner has no healthy capacity
runner pods fail to schedule because of node, quota, image pull, or admission issues
GitLab Runner cannot reach GitLab
GitLab cannot reach required registries or package proxies
required security scans fail for platform reasons across many projects
artifact upload, SBOM generation, provenance generation, or signing fails across many projects
EKS node pressure threatens runner capacity
API server latency or authentication failures spike

Each alert should have an owner, severity, dashboard link, and runbook. If an alert does not lead to action, it should be removed or rewritten.

SLO seeds

The first SLOs should map to the delivery paths teams use every day:

slos:
  - name: standard-runner-queue-latency
    target: 95
    objective: p95 <= 120s
    window: 7d
    source: gitlab-runner-prometheus
  - name: protected-deploy-runner-availability
    target: 99.5
    window: 30d
    source: runner-tier-health
  - name: platform-owned-pipeline-success-rate
    target: 99.0
    window: 30d
    source: gitlab-ci-analytics
  - name: eks-runner-pod-scheduling-success
    target: 99.0
    window: 7d
    source: kubernetes-events

These are seeds, not universal targets. The real thresholds should be tuned after the platform team has baseline data. A new platform should avoid pretending it knows the correct SLO before it has measured normal behavior.

Retention and cost

Monitoring can become expensive if every pod, container, job, and label produces high-cardinality telemetry forever. AWS guidance calls out retention policy, sampling rate, aggregation, and log archival as part of a cost-aware monitoring design.

The baseline should define:

short retention for high-cardinality pod and job details
longer retention for aggregated SLO and trend data
separate retention for audit logs and incident evidence
label rules that avoid unbounded dimensions such as branch names where they are not needed
dashboards that use aggregated views first and detailed views only for investigation

This keeps monitoring useful without turning it into another platform cost problem.

Runbooks

Monitoring is not complete until the team knows what to do when a signal turns red. Runner and EKS runbooks should cover:

runner queue saturation
protected runner outage
failed pod scheduling
image pull and registry failures
runner token or registration problems
CloudWatch or Prometheus scrape failures
sudden scan failure spikes
Terraform plan or apply delays caused by shared platform dependencies

The runbooks should point to the owning team, the dashboard, the likely causes, the first checks, and the escalation path. They should also explain how to distinguish an application problem from a platform problem.

Baseline decision

The reference platform should start with these monitoring decisions:

enable EKS control plane logging for audit and diagnostic value
deploy OTel Container Insights or the organization’s approved Prometheus path through Terraform
enable GitLab Runner Prometheus metrics on every runner tier
label runner metrics by tier and environment
collect GitLab CI/CD analytics and DORA metrics for project and group trends
define SLO seeds for queue time, runner availability, scheduling success, and platform-owned pipeline success
create runbooks before turning monitoring thresholds into paging alerts

Monitoring connects runner isolation to the supply chain and SRE work that comes next. Once the platform can see whether the delivery path is healthy, it can trust the evidence produced by that path and improve the system when teams feel friction.