
Hey everyone!
I'm happy to share that I've passed the Prometheus Certified Associate (PCA) exam. This was a really interesting deep dive into observability. Even though I had some experience with Prometheus, this exam forced me to really understand the internal mechanics—like how Service Discovery actually works, the nuances of rate vs irate, and how to construct complex PromQL queries without exploding the database.
I documented my entire learning process in the notes below. I hope they help you as much as they helped me!
My Secret Weapon for Passing
While official documentation is great, nothing beats practicing with realistic scenarios. I used this specific Udemy course to validate my knowledge before the real exam. It covered every edge case I encountered:
https://www.udemy.com/course/ultimate-prometheus-certified-associate-pca-practice-tests/
If you can score well on these practice tests, you are ready for the real thing.
Prerequisite Notes & "Gotchas"
I only talk about some sections that will show up in the PCA exam here, because some of them you and I may never use or know about otherwise xD.
- You have basic knowledge about Prometheus.
- How it works, how to install it, how it scrapes metrics, and how to query metrics via PromQL (Prometheus Query Language).
Some sections need to be understood
Push Gateway
- It is used for Short-lived jobs (Ephemeral jobs). Example Cronjob/Job that will die after 15-30 seconds.
- You know that prometheus scrape metric within interval, so push gateway help to keep metric from Cronjob/Job then it will let Prometheus scrape metric from Push Gateway.
- Trap in PCA exam: If they asked to be use Push Gateway for long-running service, mark them false!
Operator used for Regex matching?
- Answer:
=~, remember it xD
In context of Metric, what is Label?
- Key-value pair attached to a metric (e.g.,
method="POST") to add context and allow filtering
What is Gauge metric?
- A metric that represents a single numerical value that can arbitrarily go up and down
- Example:
memory_usage_bytes,cpu_temp_celsius - PCA Exam rule: Never use
rate()orincrease()function for Gauge. Commonly used withmin(),max(),avg(),sum()functions. Reason: Gauge can reduce, sorate()will give negative result or not have any meaning.
How to monitor request of each customer without explode Prometheus?
- No, Prometheus can only count how many requests have 5xx status.
- For specific request, you have to use Log/trace (Elastic/Loki/VictoriaLogs), I won't talk much detail about it since we are in PCA exam preparation.
- But there is secret weapon "Exemplars". Issue: see line of CPU usage surge. Solution: Use Exemplars, it allows to attach a little metadata like trace_id but need app instrument code to inject trace_id into metric (commonly achieved by using SDK of Prometheus or Opentelemetry)
- PCA note for Exemplar: used to link metric into Log/Trace, used most with Histograms, doesn't increase Cardinality.
Instant Vector vs Range Vector
I have no idea about Instant/Range Vector before, so this is a chance for me xD. So Instant Vector can be understanding simply as snapshot for data at a unique time.
- Function
rate(),increase(),delta()need input asRange Vector, but output isInstant Vector - So:
metric_nameisInstant Vector,metric_name[5m]isRange Vector - Example:
http_requests[5m]is data set (Range Vector),rate(...)is function that calculate avg value of those data set, so result for example will be2 req/swhich isInstant Vector
Different between rate and irate
rate uses all data points in the range (smooth) irate uses only the last two data points (spiky/instant) but good to remember it ignores the rest of the range provided in [].
rate(): avg sum rate, Smooth line, should be use for alerting, bad for debug.irate(): instant rate, jagged line, should not be use for alerting because spam, good for debug.rate()= (Last - First) / Time (Average over range)irate()with data set in 5 minutes (10,20,30,40,50) and scrape interval is 15s => (50-40)/15 = 0.66s (Instant look)
What is counter metric?
- A metric that only increase, not going down.
- Document detail here: https://prometheus.io/docs/concepts/metric_types/#counter
- It used to count
http_requests_totalorerrors_totalwhich never can be 0, right? For example metrichttp_requests_totalat 9:00 AM counter: 1000 requests, next 9:10 AMhttp_requests_totalcounter: 1200 requests. But in reality, we don't care about raw number 1200 requests, we only care about how many queries from 9:00 to 9:10. In this scenario it can be achieved by usingincrease(http_requests_total[10m]).
Black-box Monitoring / Blackbox Exporter
- Link: https://github.com/prometheus/blackbox_exporter
- I think I heard this for the first time, but after read document, it works same logic like plugin monitor http/tcp in Nagios/CheckMK.
- BlackBox exporter will return some important metrics like this:
probe_success{instance="https://api.example.com/health"} 1 # 1 = OK, 0 = FAIL
probe_http_status_code{...} 200
probe_duration_seconds{...} 0.123
probe_ssl_earliest_cert_expiry{...} 1234567890
What is a Recording Rule?
- Purposes: Performance, No Backfilling, Evaluation Interval.
- Little explain in Performance: Recording rule was born to increase performance for Prometheus, instead of query when user refresh, it already calculated in background.
- Documents: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
- Some best practices https://prometheus.io/docs/practices/rules/
- Naming Convention:
level:metric:operations. Example:job:http_inprogress_requests:sum
What is the suffix for the total count of observations in a Histogram?
- Answer:
_count - Document: the count of events that have been observed, exposed as
_count (identical to _bucket{le="+Inf"} above)
Service Discovery
Wow, I never really understood this before, until I started preparing for the PCA exam. So, what have I learned from this section?
First, read the document first
To scrape metrics from a K8s cluster using Service Discovery, Prometheus requires a Service Account bound to a ClusterRole. This grants permissions to fetch targets/metadata (not metrics) directly from the K8s API. It uses the WATCH mechanism, so the K8s API actively notifies Prometheus of changes (New Pod Added, Pod Deleted...) instead of polling constantly.
The flow is:
- Discover: Get target list from API.
- Relabel: Filter or modify targets (e.g., keep only pods with specific annotations).
- Scrape: Prometheus connects to the Pod's IP to pull metrics.
# 1. Declare job in prometheus.yml
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod # <-- Tell it to get the fucking list of Pod for me.
# 2. Filtering step (Relabeling)
relabel_configs:
# If Pod doesn't have annotation 'scrape=true', drop it.
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
But wait, I wonder how it scrapes infrastructure metrics like CPU/Memory/Network usage? Node Exporter won't get that information for individual containers, right?
Exactly! It gets those metrics from the Kubelet API Port (10250). Here is the flow:
- Prometheus calls into API
https://<Node-IP>:10250/metrics/cadvisorwithAuthorization: Bearer <SA_TOKEN>. Example forkube-prometheus-stacklink here - The Kubelet (listening on port 10250) receives the scrape request.
- The Kubelet forwards the request to its internal cAdvisor module.
- cAdvisor reads directly from Cgroups (Linux Control Groups – where the Linux kernel tracks resource usage for each process/container) on the node.
- The Kubelet aggregates this data and returns it to Prometheus in metrics format.
Document to prove what I said: https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#metrics-in-kubernetes
You will see section related to cAdvisor like:
Note that kubelet also exposes metrics in /metrics/cadvisor, /metrics/resource and /metrics/probes endpoints.
Document for cAdvisor
cAdvisor (Container Advisor) provides container users an understanding of the resource usage and performance characteristics of their running containers. It is a running daemon that collects, aggregates, processes, and exports information about running containers. Specifically, for each container it keeps resource isolation parameters, historical resource usage, histograms of complete historical resource usage and network statistics. This data is exported by container and machine-wide.
Haha, even though I passed the CKA exam, I never knew about this deep dive! What great information to finally understand!
So with this, you can answer the question: How does kube-state-metrics differ from node-exporter?
Bonus complete flow
┌─────────────────────────────────────────────────────────────┐
│ STEP 1: Prometheus ServiceAccount (SA) │
│ - SA has permission to list/watch nodes from API Server │
│ - SA receives token (JWT) automatically mounted into pod │
└────────────┬────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 2: Kubernetes Service Discovery │
│ - Prometheus uses SA token to query API Server │
│ - Gets list of nodes: GET /api/v1/nodes │
│ - API Server checks RBAC: Does SA have "list nodes" perm? │
│ - Returns: node1 (10.0.1.5), node2 (10.0.1.6)... │
└────────────┬────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 3: Scrape Kubelet API (IMPORTANT!) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Prometheus sends request: │ │
│ │ GET https://10.0.1.5:10250/metrics/cadvisor │ │
│ │ Header: Authorization: Bearer <SA_TOKEN> │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Kubelet receives request: │ │
│ │ 1. Authentication: Verify token via API Server │ │
│ │ (Kubelet flag: --authentication-token-webhook) │ │
│ │ │ │
│ │ 2. Authorization: Check permissions │ │
│ │ (Kubelet flag: --authorization-mode=Webhook) │ │
│ │ - Kubelet sends SubjectAccessReview to API Server │ │
│ │ - API Server checks RBAC rules │ │
│ │ - Allow/Deny based on ClusterRole │ │
│ └───────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Prometheus metric core type?
Prometheus has 4 core types: Counter, Gauge, Histogram, Summary.
Histogram
Samples observations (like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values. You can calculate quantiles from the server side using these buckets. Great for aggregating data across multiple instances.
Summary
Also samples observations but calculates quantiles (like p50, p95, p99) directly on the client side. It provides pre-calculated quantiles, count, and sum. Can't aggregate quantiles across multiple instances though - the quantiles are calculated per instance.
Different between Summary and Histogram
- Histogram does server-side quantile calculation (more flexible for aggregation)
- Summary does client-side quantile calculation (more accurate but can't aggregate).
What is RED Method stand for in Prometheus?
The 'RED Method' in monitoring, especially for microservices with tools like Prometheus, stands for three key request-focused metrics: Rate (requests per second), Errors (count of failed requests), and Duration (time taken for requests), providing a streamlined view of service health and performance from the user's perspective. (Copy from Google AI xD)
What does the 'USE Method' stand for?
The Utilization Saturation and Errors (USE) Method is a methodology for analyzing the performance of any system.
Some basic operator for query you need to remember
Document: https://prometheus.io/docs/prometheus/latest/querying/basics/
=: Equal!=: Not Equal=~: Regex Match!~: Regex Not Match
Some examples:
{...}: use!=to remove labels- If you find/compare value, use
==,>, ... outside{....}
What does the offset modifier do?
Shifts the time of the query evaluation (e.g., http_requests_total offset 1w returns values from 1 week ago)
What is 'White-box Monitoring'?
Monitoring based on metrics exposed by the internals of the system (e.g., HTTP handler, JVM memory)
clamp_max() function?
- Used to set capping for value of metric.
- Example:
clamp_max(metric_here, 100). Ifmetric_hereis 120, it will set to 100 and ifmetric_hereis 80, it still is 80, get the idea? - What it used for? Fix display in graph, sometime or somehow CPU usage reached more than 100%, so set capping at 100 will make display better! Or Remove some outliers/spikes metric values like 99999 that make graph nonsense xD
Vector matching
Another knowledge that I never know before! Document here!
Enrichment - Most used case
-
So issue metric
container_cpu_usage_seconds_totalfrom cAdvisor let you know Pod X is eating how much CPU. But it doesn't tell you Pod X come from which Node or which Image it is using. -
Your leader asked you, hey show me container used image version 2.0 is eating how much CPU compared with container used image version 1.0? So original metric doesn't have labeled
image, how group by? -
Solution: join metric
kube_pod_info, this guy contains any information about pod.
# Metric 1: CPU Usage (only have pod name, namespace)
container_cpu_usage_seconds_total
# Metric 2: Pod Info (have pod name, namespace and image version, host ip...)
kube_pod_info
# JOIN TOGETHER:
container_cpu_usage_seconds_total * on(pod, namespace) group_left(image) kube_pod_info
# Attach node and image
container_cpu * on(pod, namespace) group_left(node, image) kube_pod_info
# attach more label (if needed!)
container_cpu * on(pod, namespace) group_left(image, created_by_name, host_ip) kube_pod_info{created_by_kind="ReplicaSet"}
- Result: You still have CPU usage but include label
image - Explain 1:
group_left(image)means take the labelimagefrom right, cover into left! - Explain 2: It copies the image label from the matching series on the Right to the series on the Left.
Common Pitfalls
# BAD - Cardinality explosion
container_cpu * on() group_left(image) kube_pod_info
# GOOD - Specify matching labels
container_cpu * on(pod, namespace) group_left(image) kube_pod_info
# BAD - Wrong side (left has more series than right)
kube_pod_info * on(pod) group_left container_cpu
# GOOD
container_cpu * on(pod) group_left(image) kube_pod_info
Ignoring case:
metric1 * ignoring(instance, job) metric2: it will ignore those label when matching, it doesn't mean metric2 doesn't have those label. Both metric can be haveinstanceandjob, but when match Prometheus will ignore those label.
What does absent(metric_name) return?
- it will return
1if the metric does not exist, empty if it does - Use case: alert if a service died or not gonna to send metric!
What is evaluation_interval?
evaluation_intervaldetermines how frequently Prometheus evaluates recording rules and alerting rules.
Or How frequently to evaluate rules.
Why is 'Sampling' (in instrumentation) bad for metrics?
Metrics should account for 100% of events (e.g., total request count); sampling loses accuracy and Traces (too heavy to keep all).
Which function predicts the value of a gauge in 4 hours based on recent trend?
predict_linear(metric[range], 4*3600)- Linear regression prediction. Great for 'Disk Full in X hours' alerts.
What is the difference between sum and sum_over_time?
sum() input: Instant Vector.
Example: sum(node_cpu_seconds_total) --> Adds up CPU usage of all nodes right now.
sum_over_time() input: Range Vector
Example: sum_over_time(http_requests_total[5m]) --> Adds up all scrape values of one specific metric over the last 5 minutes.
More clarify: sum_over_time will add all data points in range for each time series, if have multiple series (multiple instances), it will return more results, each result is sum of 1 series.
What is relabel_configs used for?
Modifying labels before the scrape (e.g., filtering targets, changing addresses)
How do you drop a target during discovery if a certain label is present?
To drop a target during discovery, you use relabel_configs with the action: drop.
- Example 1:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Scenario: Drop if the label 'do_not_scrape' is PRESENT (any value)
- source_labels: [__meta_kubernetes_pod_label_do_not_scrape]
action: drop
regex: '.+' # Matches any string with at least 1 character
- Example 2:
scrape_configs:
- job_name: 'my-app'
# ...
metric_relabel_configs:
- source_labels: [__name__]
action: drop
# Use | List metric for "drop"
regex: 'go_gc_duration_seconds|kubelet_volume_stats_inodes_used|other_metric_to_drop'
- Example 3:
relabel_configs:
# Rule 1: If env=test -> DROP
- source_labels: [__meta_kubernetes_pod_label_env]
action: drop
regex: test
# Rule 2: If role=backup -> DROP (If metric pass Rule 1, in Rule 2 will continue to check!)
- source_labels: [__meta_kubernetes_pod_label_role]
action: drop
regex: backup
Which operator has the highest precedence in Prometheus?
Remember this for exam!
^(Powers)*, /, %+, -==, !=, <=, <, >=, >and, unlessor
If you want to attach a constant label env=prod to all metrics in a job, where do you put it?
Example
scrape_configs:
# Job 1: Scrape env PROD
- job_name: 'web-app-prod'
static_configs:
- targets: ['192.168.1.100:9090']
labels:
env: prod # <--- Here we go
region: us-east
# Job 2: Scrape env STAGING
- job_name: 'web-app-staging'
static_configs:
- targets: ['192.168.1.200:9090']
labels:
env: staging # <--- Different Label for this target!
region: us-east
What is Federation in Prometheus?
Short answer for exam:
A mechanism for a Prometheus server to scrape metrics from another Prometheus server
Document: https://prometheus.io/docs/prometheus/latest/federation/#use-cases
There are different use cases for federation. Commonly, it is used to either achieve scalable
Prometheus monitoring setups or to pull related metrics from one service's Prometheus into another.
What is the external_labels config in Prometheus used for?
Seem like same as labels under static_config? No, it's not!
Instead of set label for specific job, external_labels will set for whole Prometheus instance!
It was born to identify Prometheus instance.
Example when we have 2 Prometheus instance for HA. Both of them scrape same targets.
- Issue when alert, both prometheus instances will yell at fucking Alert Manager. Without
external_labels, AlertManager will see it is 2 different issues --> duplicate alert! - RemoteWrite: when write data into Thanos for long-live storage. Thanos needs to know metrics come from which Prometheus instances
to
deduplicate
From Document Page
# The labels to add to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
# Environment variable references `${var}` or `$var` are replaced according
# to the values of the current environment variables.
# References to undefined variables are replaced by the empty string.
# The `$` character can be escaped by using `$$`.
What is Subquery?
Subquery allow you run query that return Range Vector
Syntax: query[<range>:<resolution>]
Example: rate(http_requests[5m])[30m:1m] or max_over_time( rate(http_requests_total[5m])[1h:1m] )
Explain:
[1h:1m]or[30m:1m]is Subquery[1h:1m]: within hour, each minute, calculate the fuckingratefor me!- Result will be rate values (Range Vector)
- After that, use
max_over_timeto find the largest number in that range which returnInstant Vector
Prometheus Subquery Cheat Sheet:
| Query Type | Returns | Example |
|---|---|---|
| Normal query | Instant vector | rate(http_requests_total[5m]) |
| With subquery | Range vector | rate(http_requests_total[5m])[1h:] |
| Subquery + _over_time function | Instant vector | max_over_time(rate(http_requests_total[5m])[1h:]) |
| Subquery + other aggregators (avg, min, last, etc.) | Instant vector | avg_over_time(rate(node_cpu_seconds_total[5m])[30m:10s]) |
Document: https://prometheus.io/docs/prometheus/latest/querying/basics/#subquery
You want to scrape a target via HTTPS with a custom CA. Which config param?
scrape_configs:
- job_name: 'scrape-secure-app'
# 1. must be https
scheme: https
static_configs:
- targets: ['my-internal-app.local:8443']
# 2. config TLS here
tls_config:
# path to file CA Certificate (file .pem or .crt)
ca_file: /etc/prometheus/certs/root-ca.pem
# (Optional) if server require Client Cert (mTLS) then add 2 lines below:
# cert_file: /etc/prometheus/certs/client.pem
# key_file: /etc/prometheus/certs/client.key
# (Optional) if you lazy or just need to test, can bypass check CA with this line (ofc, not recommend for fucking production):
# insecure_skip_verify: true
What is the danger of count(http_requests_total) without a by() clause?
Using count(http_requests_total) without a by() clause is dangerous because it aggregates everything into a single scalar value and strips away all labels, effectively destroying context.
-
Return nonsense value: Without by (method, code, instance...), Prometheus counts the total number of time series named http_requests_total across the entire system and returns a single number.
-
Logical Error (Count vs. Sum): It is a common mistake to confuse PromQL's
count()with SQL'sCOUNT(*).
- Reality: `count()` calculates the number of time series (the number of lines on the graph).
- Misconception: Users often think it calculates the total volume of requests. To calculate total volume, you must use sum().
- Examples:
Imagine your system has 3 active time series:
http_requests_total{status="200"} = 1000
http_requests_total{status="404"} = 5
http_requests_total{status="500"} = 2
- The Wrong Query (Logic Error):
count(http_requests_total)which will return3(Because there are 3 active series). - The Correct Query (For Volume):
sum(http_requests_total)which will return 1007 (The total sum of all values). - The Correct Query (For Cardinality):
count(http_requests_total) by (status)which will return
{status="200"} : 1
{status="404"} : 1
{status="500"} : 1
What is scalar() function used for?
Let me deep dive into this since I did often see it before but really never understood how it works!
From Prometheus document:
Given an input vector that contains only one element with a float sample, scalar(v instant-vector) returns the sample value of that float sample as a scalar...
Haha, I understand nothing after read that document. Let take some real example! From KodeKloud
$ process_start_time_seconds
process_start_time_seconds{instance="node1"} 1662763800
$ scalar(process_start_time_seconds)
scalar 1662763800
Note: It's worth mentioning that scalar() returns NaN (Not a Number) if the input vector does not have exactly one element. If there are 0 or >1 elements, it fails/returns NaN. This is a common exam "gotcha"
What does histogram_quantile output unit match?
- Document: https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile
- The output unit of histogram_quantile perfectly matches the unit of the original metric (specifically, the value defined in label
le). - Example:
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05", method="POST"} 24054
http_request_duration_seconds_bucket{le="0.1", method="POST"} 33444
http_request_duration_seconds_bucket{le="0.2", method="POST"} 100392
How do you select metrics from yesterday?
Simple answer: offset
Example: http_requests_total offset 1d or rate of current hour yesterday
rate(http_requests_total[5m] offset 1d)
Note: offset appear inside rate function, it means: get the 5 minutes data of yesterday, then calculate rate
without - block/remove label
Syntax: <aggr-op> without (list-label-want-to-remove) (metric_name)
Example: we have metric http_requests_total with 4 labels:
{instance="10.0.0.1", method="POST", status="200", app="backend"}.
And we don't care about instance and want to remove it from metric labels.
- Different with
by:sum by (method, status, app) (http_requests_total) - Same result but with
without:sum without (instance) (http_requests_total)
What is purpose of function delta()?
Definition: Calculates the difference between the last value and the first value in a specified time range.
Important:
- Do NOT use with Counters. It does not handle counter resets (e.g., server restarts).
- For Counters, use increase() instead.
- This was appeared in exam xD
Example: delta(node_filesystem_free_bytes[1h]) (Calculates how many bytes of disk space changed over the last hour).
Note to the exam
- Be sure to understand concept of four core metric types. It will give you at least 5 questions in exam!
- Be understand how alert routing works in AlertManager, in exam will have some questions like which team will receive alert.
- The exam required 75/100 to pass, to be honest I didn't take serious training for exam (Lazy AF), but somehow I still pass with 79/100!
- Exam duration: 90 minutes
- Exam questions: 60 questions
That's a Wrap!
Hope these notes save you some time. If you spot any mistakes or have questions, feel free to reach out.
Now go pass that exam!!!