alt text

Hey everyone!

I'm happy to share that I've passed the Prometheus Certified Associate (PCA) exam. This was a really interesting deep dive into observability. Even though I had some experience with Prometheus, this exam forced me to really understand the internal mechanics—like how Service Discovery actually works, the nuances of rate vs irate, and how to construct complex PromQL queries without exploding the database.

I documented my entire learning process in the notes below. I hope they help you as much as they helped me!

My Secret Weapon for Passing

While official documentation is great, nothing beats practicing with realistic scenarios. I used this specific Udemy course to validate my knowledge before the real exam. It covered every edge case I encountered:

https://www.udemy.com/course/ultimate-prometheus-certified-associate-pca-practice-tests/

If you can score well on these practice tests, you are ready for the real thing.

Prerequisite Notes & "Gotchas"

I only talk about some sections that will show up in the PCA exam here, because some of them you and I may never use or know about otherwise xD.

You have basic knowledge about Prometheus.
How it works, how to install it, how it scrapes metrics, and how to query metrics via PromQL (Prometheus Query Language).

Some sections need to be understood

Push Gateway

It is used for Short-lived jobs (Ephemeral jobs). Example Cronjob/Job that will die after 15-30 seconds.
You know that prometheus scrape metric within interval, so push gateway help to keep metric from Cronjob/Job then it will let Prometheus scrape metric from Push Gateway.
Trap in PCA exam: If they asked to be use Push Gateway for long-running service, mark them false!

Operator used for Regex matching?

Answer: =~, remember it xD

In context of Metric, what is Label?

Key-value pair attached to a metric (e.g., method="POST") to add context and allow filtering

What is Gauge metric?

A metric that represents a single numerical value that can arbitrarily go up and down
Example: memory_usage_bytes, cpu_temp_celsius
PCA Exam rule: Never use rate() or increase() function for Gauge. Commonly used with min(), max(), avg(), sum() functions. Reason: Gauge can reduce, so rate() will give negative result or not have any meaning.

How to monitor request of each customer without explode Prometheus?

No, Prometheus can only count how many requests have 5xx status.
For specific request, you have to use Log/trace (Elastic/Loki/VictoriaLogs), I won't talk much detail about it since we are in PCA exam preparation.
But there is secret weapon "Exemplars". Issue: see line of CPU usage surge. Solution: Use Exemplars, it allows to attach a little metadata like trace_id but need app instrument code to inject trace_id into metric (commonly achieved by using SDK of Prometheus or Opentelemetry)
PCA note for Exemplar: used to link metric into Log/Trace, used most with Histograms, doesn't increase Cardinality.

Instant Vector vs Range Vector

I have no idea about Instant/Range Vector before, so this is a chance for me xD. So Instant Vector can be understanding simply as snapshot for data at a unique time.

Function rate(), increase(), delta() need input as Range Vector, but output is Instant Vector
So: metric_name is Instant Vector, metric_name[5m] is Range Vector
Example: http_requests[5m] is data set (Range Vector), rate(...) is function that calculate avg value of those data set, so result for example will be 2 req/s which is Instant Vector

Different between rate and irate

rate uses all data points in the range (smooth) irate uses only the last two data points (spiky/instant) but good to remember it ignores the rest of the range provided in [].

rate(): avg sum rate, Smooth line, should be use for alerting, bad for debug.
irate(): instant rate, jagged line, should not be use for alerting because spam, good for debug.
rate() = (Last - First) / Time (Average over range)
irate() with data set in 5 minutes (10,20,30,40,50) and scrape interval is 15s => (50-40)/15 = 0.66s (Instant look)

What is counter metric?

A metric that only increase, not going down.
Document detail here: https://prometheus.io/docs/concepts/metric_types/#counter
It used to count http_requests_total or errors_total which never can be 0, right? For example metric http_requests_total at 9:00 AM counter: 1000 requests, next 9:10 AM http_requests_total counter: 1200 requests. But in reality, we don't care about raw number 1200 requests, we only care about how many queries from 9:00 to 9:10. In this scenario it can be achieved by using increase(http_requests_total[10m]).

Black-box Monitoring / Blackbox Exporter

Link: https://github.com/prometheus/blackbox_exporter
I think I heard this for the first time, but after read document, it works same logic like plugin monitor http/tcp in Nagios/CheckMK.
BlackBox exporter will return some important metrics like this:

probe_success{instance="https://api.example.com/health"} 1  # 1 = OK, 0 = FAIL
probe_http_status_code{...} 200
probe_duration_seconds{...} 0.123
probe_ssl_earliest_cert_expiry{...} 1234567890

What is a Recording Rule?

Purposes: Performance, No Backfilling, Evaluation Interval.
Little explain in Performance: Recording rule was born to increase performance for Prometheus, instead of query when user refresh, it already calculated in background.
Documents: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
Some best practices https://prometheus.io/docs/practices/rules/
Naming Convention: level:metric:operations. Example: job:http_inprogress_requests:sum

What is the suffix for the total count of observations in a Histogram?

Answer: _count
Document: the count of events that have been observed, exposed as _count (identical to _bucket{le="+Inf"} above)

Service Discovery

Wow, I never really understood this before, until I started preparing for the PCA exam. So, what have I learned from this section?

First, read the document first

To scrape metrics from a K8s cluster using Service Discovery, Prometheus requires a Service Account bound to a ClusterRole. This grants permissions to fetch targets/metadata (not metrics) directly from the K8s API. It uses the WATCH mechanism, so the K8s API actively notifies Prometheus of changes (New Pod Added, Pod Deleted...) instead of polling constantly.

The flow is:

Discover: Get target list from API.
Relabel: Filter or modify targets (e.g., keep only pods with specific annotations).
Scrape: Prometheus connects to the Pod's IP to pull metrics.

# 1. Declare job in prometheus.yml
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod  # <-- Tell it to get the fucking list of Pod for me.

    # 2. Filtering step (Relabeling)
    relabel_configs:
      # If Pod doesn't have annotation 'scrape=true', drop it.
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

But wait, I wonder how it scrapes infrastructure metrics like CPU/Memory/Network usage? Node Exporter won't get that information for individual containers, right?

Exactly! It gets those metrics from the Kubelet API Port (10250). Here is the flow:

Prometheus calls into API https://<Node-IP>:10250/metrics/cadvisor with Authorization: Bearer <SA_TOKEN>. Example for kube-prometheus-stack link here
The Kubelet (listening on port 10250) receives the scrape request.
The Kubelet forwards the request to its internal cAdvisor module.
cAdvisor reads directly from Cgroups (Linux Control Groups – where the Linux kernel tracks resource usage for each process/container) on the node.
The Kubelet aggregates this data and returns it to Prometheus in metrics format.

Document to prove what I said: https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#metrics-in-kubernetes You will see section related to cAdvisor like:

Note that kubelet also exposes metrics in /metrics/cadvisor, /metrics/resource and /metrics/probes endpoints.

Document for cAdvisor

cAdvisor (Container Advisor) provides container users an understanding of the resource usage and performance characteristics of their running containers. It is a running daemon that collects, aggregates, processes, and exports information about running containers. Specifically, for each container it keeps resource isolation parameters, historical resource usage, histograms of complete historical resource usage and network statistics. This data is exported by container and machine-wide.

Haha, even though I passed the CKA exam, I never knew about this deep dive! What great information to finally understand!

So with this, you can answer the question: How does kube-state-metrics differ from node-exporter?

Bonus complete flow

┌─────────────────────────────────────────────────────────────┐
│ STEP 1: Prometheus ServiceAccount (SA)                      │
│ - SA has permission to list/watch nodes from API Server     │
│ - SA receives token (JWT) automatically mounted into pod    │
└────────────┬────────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 2: Kubernetes Service Discovery                        │
│ - Prometheus uses SA token to query API Server              │
│ - Gets list of nodes: GET /api/v1/nodes                     │
│ - API Server checks RBAC: Does SA have "list nodes" perm?   │
│ - Returns: node1 (10.0.1.5), node2 (10.0.1.6)...            │
└────────────┬────────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 3: Scrape Kubelet API (IMPORTANT!)                     │
│ ┌───────────────────────────────────────────────────────┐   │
│ │ Prometheus sends request:                             │   │
│ │ GET https://10.0.1.5:10250/metrics/cadvisor           │   │
│ │ Header: Authorization: Bearer <SA_TOKEN>              │   │
│ └───────────────────────────────────────────────────────┘   │
│                                                             │
│ ┌───────────────────────────────────────────────────────┐   │
│ │ Kubelet receives request:                             │   │
│ │ 1. Authentication: Verify token via API Server        │   │
│ │    (Kubelet flag: --authentication-token-webhook)     │   │
│ │                                                       │   │
│ │ 2. Authorization: Check permissions                   │   │
│ │    (Kubelet flag: --authorization-mode=Webhook)       │   │
│ │    - Kubelet sends SubjectAccessReview to API Server  │   │
│ │    - API Server checks RBAC rules                     │   │
│ │    - Allow/Deny based on ClusterRole                  │   │
│ └───────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Prometheus metric core type?

Prometheus has 4 core types: Counter, Gauge, Histogram, Summary.

Histogram

Samples observations (like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values. You can calculate quantiles from the server side using these buckets. Great for aggregating data across multiple instances.

Summary

Also samples observations but calculates quantiles (like p50, p95, p99) directly on the client side. It provides pre-calculated quantiles, count, and sum. Can't aggregate quantiles across multiple instances though - the quantiles are calculated per instance.

Different between Summary and Histogram

Histogram does server-side quantile calculation (more flexible for aggregation)
Summary does client-side quantile calculation (more accurate but can't aggregate).

What is RED Method stand for in Prometheus?

The 'RED Method' in monitoring, especially for microservices with tools like Prometheus, stands for three key request-focused metrics: Rate (requests per second), Errors (count of failed requests), and Duration (time taken for requests), providing a streamlined view of service health and performance from the user's perspective. (Copy from Google AI xD)

What does the 'USE Method' stand for?

The Utilization Saturation and Errors (USE) Method is a methodology for analyzing the performance of any system.

Some basic operator for query you need to remember

Document: https://prometheus.io/docs/prometheus/latest/querying/basics/

= : Equal
!= : Not Equal
=~ : Regex Match
!~ : Regex Not Match

Some examples:

{...} : use != to remove labels
If you find/compare value, use ==, >, ... outside {....}

What does the `offset` modifier do?

Shifts the time of the query evaluation (e.g., http_requests_total offset 1w returns values from 1 week ago)

What is 'White-box Monitoring'?

Monitoring based on metrics exposed by the internals of the system (e.g., HTTP handler, JVM memory)

clamp_max() function?

Used to set capping for value of metric.
Example: clamp_max(metric_here, 100). If metric_here is 120, it will set to 100 and if metric_here is 80, it still is 80, get the idea?
What it used for? Fix display in graph, sometime or somehow CPU usage reached more than 100%, so set capping at 100 will make display better! Or Remove some outliers/spikes metric values like 99999 that make graph nonsense xD

Vector matching

Another knowledge that I never know before! Document here!

Enrichment - Most used case

So issue metric container_cpu_usage_seconds_total from cAdvisor let you know Pod X is eating how much CPU. But it doesn't tell you Pod X come from which Node or which Image it is using.
Your leader asked you, hey show me container used image version 2.0 is eating how much CPU compared with container used image version 1.0? So original metric doesn't have labeled image, how group by?
Solution: join metric kube_pod_info, this guy contains any information about pod.

# Metric 1: CPU Usage (only have pod name, namespace)
container_cpu_usage_seconds_total

# Metric 2: Pod Info (have pod name, namespace and image version, host ip...)
kube_pod_info

# JOIN TOGETHER:
container_cpu_usage_seconds_total * on(pod, namespace) group_left(image) kube_pod_info

# Attach node and image
container_cpu * on(pod, namespace) group_left(node, image) kube_pod_info

# attach more label (if needed!)
container_cpu * on(pod, namespace) group_left(image, created_by_name, host_ip) kube_pod_info{created_by_kind="ReplicaSet"}

Result: You still have CPU usage but include label image
Explain 1: group_left(image) means take the label image from right, cover into left!
Explain 2: It copies the image label from the matching series on the Right to the series on the Left.

Common Pitfalls

# BAD - Cardinality explosion
container_cpu * on() group_left(image) kube_pod_info

# GOOD - Specify matching labels
container_cpu * on(pod, namespace) group_left(image) kube_pod_info

# BAD - Wrong side (left has more series than right)
kube_pod_info * on(pod) group_left container_cpu

# GOOD
container_cpu * on(pod) group_left(image) kube_pod_info

Ignoring case:

metric1 * ignoring(instance, job) metric2 : it will ignore those label when matching, it doesn't mean metric2 doesn't have those label. Both metric can be have instance and job, but when match Prometheus will ignore those label.

What does `absent(metric_name)` return?

it will return 1 if the metric does not exist, empty if it does
Use case: alert if a service died or not gonna to send metric!

What is `evaluation_interval`?

evaluation_intervaldetermines how frequently Prometheus evaluates recording rules and alerting rules. Or How frequently to evaluate rules.

Why is 'Sampling' (in instrumentation) bad for metrics?

Metrics should account for 100% of events (e.g., total request count); sampling loses accuracy and Traces (too heavy to keep all).

Which function predicts the value of a gauge in 4 hours based on recent trend?

predict_linear(metric[range], 4*3600)
Linear regression prediction. Great for 'Disk Full in X hours' alerts.

What is the difference between `sum` and `sum_over_time`?

sum() input: Instant Vector. Example: sum(node_cpu_seconds_total) --> Adds up CPU usage of all nodes right now.

sum_over_time() input: Range Vector Example: sum_over_time(http_requests_total[5m]) --> Adds up all scrape values of one specific metric over the last 5 minutes. More clarify: sum_over_time will add all data points in range for each time series, if have multiple series (multiple instances), it will return more results, each result is sum of 1 series.

What is `relabel_configs` used for?

Modifying labels before the scrape (e.g., filtering targets, changing addresses)

How do you drop a target during discovery if a certain label is present?

To drop a target during discovery, you use relabel_configs with the action: drop.

Example 1:

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod

    relabel_configs:
      # Scenario: Drop if the label 'do_not_scrape' is PRESENT (any value)
      - source_labels: [__meta_kubernetes_pod_label_do_not_scrape]
        action: drop
        regex: '.+'  # Matches any string with at least 1 character

Example 2:

scrape_configs:
  - job_name: 'my-app'
    # ...
    metric_relabel_configs:
      - source_labels: [__name__]
        action: drop
        # Use | List metric for "drop"
        regex: 'go_gc_duration_seconds|kubelet_volume_stats_inodes_used|other_metric_to_drop'

Example 3:

relabel_configs:
  # Rule 1: If env=test -> DROP
  - source_labels: [__meta_kubernetes_pod_label_env]
    action: drop
    regex: test

  # Rule 2: If role=backup -> DROP (If metric pass Rule 1, in Rule 2 will continue to check!)
  - source_labels: [__meta_kubernetes_pod_label_role]
    action: drop
    regex: backup

Which operator has the highest precedence in Prometheus?

Remember this for exam!

^ (Powers)
*, /, %
+, -
==, !=, <=, <, >=, >
and, unless
or

If you want to attach a constant label `env=prod` to all metrics in a job, where do you put it?

Example

scrape_configs:
  # Job 1: Scrape env PROD
  - job_name: 'web-app-prod'
    static_configs:
      - targets: ['192.168.1.100:9090']
        labels:
          env: prod      # <--- Here we go
          region: us-east

  # Job 2: Scrape env STAGING
  - job_name: 'web-app-staging'
    static_configs:
      - targets: ['192.168.1.200:9090']
        labels:
          env: staging   # <--- Different Label for this target!
          region: us-east

What is Federation in Prometheus?

Short answer for exam:

A mechanism for a Prometheus server to scrape metrics from another Prometheus server

Document: https://prometheus.io/docs/prometheus/latest/federation/#use-cases

There are different use cases for federation. Commonly, it is used to either achieve scalable 
Prometheus monitoring setups or to pull related metrics from one service's Prometheus into another.

What is the `external_labels` config in Prometheus used for?

Seem like same as labels under static_config? No, it's not!

Instead of set label for specific job, external_labels will set for whole Prometheus instance!

It was born to identify Prometheus instance.

Example when we have 2 Prometheus instance for HA. Both of them scrape same targets.

Issue when alert, both prometheus instances will yell at fucking Alert Manager. Without external_labels, AlertManager will see it is 2 different issues --> duplicate alert!
RemoteWrite: when write data into Thanos for long-live storage. Thanos needs to know metrics come from which Prometheus instances to deduplicate

From Document Page

  # The labels to add to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  # Environment variable references `${var}` or `$var` are replaced according
  # to the values of the current environment variables.
  # References to undefined variables are replaced by the empty string.
  # The `$` character can be escaped by using `$$`.

What is Subquery?

Subquery allow you run query that return Range Vector

Syntax: query[<range>:<resolution>]

Example: rate(http_requests[5m])[30m:1m] or max_over_time( rate(http_requests_total[5m])[1h:1m] )

Explain:

[1h:1m] or [30m:1m] is Subquery
[1h:1m]: within hour, each minute, calculate the fucking rate for me!
Result will be rate values (Range Vector)
After that, use max_over_time to find the largest number in that range which return Instant Vector

Prometheus Subquery Cheat Sheet:

Query Type	Returns	Example
Normal query	Instant vector	`rate(http_requests_total[5m])`
With subquery	Range vector	`rate(http_requests_total[5m])[1h:]`
Subquery + _over_time function	Instant vector	`max_over_time(rate(http_requests_total[5m])[1h:])`
Subquery + other aggregators (avg, min, last, etc.)	Instant vector	`avg_over_time(rate(node_cpu_seconds_total[5m])[30m:10s])`

Document: https://prometheus.io/docs/prometheus/latest/querying/basics/#subquery

You want to scrape a target via HTTPS with a custom CA. Which config param?

scrape_configs:
  - job_name: 'scrape-secure-app'
    # 1. must be https
    scheme: https 

    static_configs:
      - targets: ['my-internal-app.local:8443']

    # 2. config TLS here
    tls_config:
      # path to file CA Certificate (file .pem or .crt)
      ca_file: /etc/prometheus/certs/root-ca.pem

      # (Optional) if server require Client Cert (mTLS) then add 2 lines below:
      # cert_file: /etc/prometheus/certs/client.pem
      # key_file: /etc/prometheus/certs/client.key

      # (Optional) if you lazy or just need to test, can bypass check CA with this line (ofc, not recommend for fucking production):
      # insecure_skip_verify: true

What is the danger of count(http_requests_total) without a by() clause?

Using count(http_requests_total) without a by() clause is dangerous because it aggregates everything into a single scalar value and strips away all labels, effectively destroying context.

Return nonsense value: Without by (method, code, instance...), Prometheus counts the total number of time series named http_requests_total across the entire system and returns a single number.
Logical Error (Count vs. Sum): It is a common mistake to confuse PromQL's count() with SQL's COUNT(*).

- Reality: `count()` calculates the number of time series (the number of lines on the graph).
- Misconception: Users often think it calculates the total volume of requests. To calculate total volume, you must use sum().

Examples:

Imagine your system has 3 active time series:

http_requests_total{status="200"} = 1000
http_requests_total{status="404"} = 5
http_requests_total{status="500"} = 2

The Wrong Query (Logic Error): count(http_requests_total) which will return 3 (Because there are 3 active series).
The Correct Query (For Volume): sum(http_requests_total) which will return 1007 (The total sum of all values).
The Correct Query (For Cardinality): count(http_requests_total) by (status) which will return

{status="200"} : 1
{status="404"} : 1
{status="500"} : 1

What is scalar() function used for?

Let me deep dive into this since I did often see it before but really never understood how it works!

From Prometheus document:

Given an input vector that contains only one element with a float sample, scalar(v instant-vector) returns the sample value of that float sample as a scalar...

Haha, I understand nothing after read that document. Let take some real example! From KodeKloud

$ process_start_time_seconds
process_start_time_seconds{instance="node1"} 1662763800

$ scalar(process_start_time_seconds)
scalar 1662763800

Note: It's worth mentioning that scalar() returns NaN (Not a Number) if the input vector does not have exactly one element. If there are 0 or >1 elements, it fails/returns NaN. This is a common exam "gotcha"

What does histogram_quantile output unit match?

Document: https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile
The output unit of histogram_quantile perfectly matches the unit of the original metric (specifically, the value defined in label le).
Example:

# TYPE http_request_duration_seconds histogram

http_request_duration_seconds_bucket{le="0.05", method="POST"} 24054
http_request_duration_seconds_bucket{le="0.1",  method="POST"} 33444
http_request_duration_seconds_bucket{le="0.2",  method="POST"} 100392

How do you select metrics from yesterday?

Simple answer: offset

Example: http_requests_total offset 1d or rate of current hour yesterday

rate(http_requests_total[5m] offset 1d)

Note: offset appear inside rate function, it means: get the 5 minutes data of yesterday, then calculate rate

without - block/remove label

Syntax: <aggr-op> without (list-label-want-to-remove) (metric_name)

Example: we have metric http_requests_total with 4 labels:

{instance="10.0.0.1", method="POST", status="200", app="backend"}.

And we don't care about instance and want to remove it from metric labels.

Different with by: sum by (method, status, app) (http_requests_total)
Same result but with without: sum without (instance) (http_requests_total)

What is purpose of function delta()?

Definition: Calculates the difference between the last value and the first value in a specified time range.

Important:

Do NOT use with Counters. It does not handle counter resets (e.g., server restarts).
For Counters, use increase() instead.
This was appeared in exam xD

Example: delta(node_filesystem_free_bytes[1h]) (Calculates how many bytes of disk space changed over the last hour).

Note to the exam

Be sure to understand concept of four core metric types. It will give you at least 5 questions in exam!
Be understand how alert routing works in AlertManager, in exam will have some questions like which team will receive alert.
The exam required 75/100 to pass, to be honest I didn't take serious training for exam (Lazy AF), but somehow I still pass with 79/100!
Exam duration: 90 minutes
Exam questions: 60 questions

That's a Wrap!

Hope these notes save you some time. If you spot any mistakes or have questions, feel free to reach out.

Now go pass that exam!!!