Agent skill
airflow-debugging
Investigate Mozilla Airflow DAG failures. Use when user asks about: failed DAGs, Airflow task logs, DAG run errors, bqetl failures, telemetry-airflow issues, or data pipeline debugging.
Install this agent skill to your Project
npx add-skill https://github.com/akkomar/mozdata-claude-plugin/tree/main/skills/airflow-debugging
SKILL.md
Airflow DAG Failure Investigation
You help users investigate and debug Mozilla Airflow DAG failures by fetching logs, identifying root causes, and suggesting fixes.
Helper Scripts
Two scripts are bundled in the scripts/ directory relative to this skill file. Use them as the primary investigation tools.
list-failed-dags
List DAGs that failed within a time window. Queries Cloud Logging for DagRun Finished.*state=failed events.
scripts/list-failed-dags # Last 24 hours (default)
scripts/list-failed-dags --since 12h # Last 12 hours
scripts/list-failed-dags --since 3d # Last 3 days
scripts/list-failed-dags --all # Show all failures with details
fetch-task-log
Fetch and explore task logs from GCS (gs://airflow-remote-logs-prod-prod).
# List recent runs for a DAG
scripts/fetch-task-log <dag_id> --list-runs
# List tasks in a specific run
scripts/fetch-task-log <dag_id> --list-tasks --run-id <run_id>
# Fetch a task log
scripts/fetch-task-log <dag_id> <task_id> <run_id>
# Fetch only the last N lines
scripts/fetch-task-log <dag_id> <task_id> <run_id> --tail 100
Related Repositories
When investigating failures, check these repos (all checked out locally):
bigquery-etl- Query definitions, metadata.yaml, DAG generationprivate-bigquery-etl- Confidential ETL codetelemetry-airflow- DAGs, operators, GKEPodOperatordataservices-infra- Infrastructure (GKE, Helm, logging config)
Where DAGs Are Defined
Most DAGs are auto-generated from bigquery-etl. The task ID tells you where to find the source.
Task ID Pattern: <dataset>__<table>__<version>
Example task ID: telemetry_derived__clients_daily__v6
Source query location:
bigquery-etl/sql/moz-fx-data-shared-prod/<dataset>/<table>/
├── query.sql # The SQL query
├── metadata.yaml # Scheduling config, owner, tags
└── schema.yaml # Table schema
For the example above:
bigquery-etl/sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/
DAG ID Pattern: bqetl_<name>
DAGs starting with bqetl_ are auto-generated. The DAG configuration is in bigquery-etl/dags.yaml.
Non-bqetl DAGs
DAGs not starting with bqetl_ are manually defined in:
telemetry-airflow/dags/<dag_name>.py
Private/Confidential DAGs
Some DAGs are in private-bigquery-etl with the same structure:
private-bigquery-etl/sql/<project>/<dataset>/<table>/
GCP Projects & Namespaces
Airflow runs across two GCP projects:
| Project | Purpose | Namespace |
|---|---|---|
moz-fx-dataservices-high-prod |
Airflow workers, scheduler | telemetry-airflow-prod |
moz-fx-data-airflow-gke-prod |
GKEPodOperator jobs (queries, scripts) | default |
Cloud Logging (Fallback)
Start with GCS logs via fetch-task-log. Fall back to Cloud Logging if you suspect infrastructure issues or if GCS logs are missing/incomplete.
| Aspect | GCS (fetch-task-log) |
Cloud Logging |
|---|---|---|
| Content | Complete Airflow task logs (same as UI) | Raw container stdout/stderr |
| Retention | 360 days | 30 days |
| Best for | Task failures (SQL errors, exceptions) | Pod-level issues (OOM kills, scheduling failures) |
Airflow scheduler/worker logs:
gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="telemetry-airflow-prod" AND textPayload=~"<DAG_ID>"' \
--project=moz-fx-dataservices-high-prod \
--limit=200
GKEPodOperator job logs (query execution errors):
gcloud logging read 'resource.type="k8s_container" AND resource.labels.namespace_name="default" AND textPayload=~"<DAG_ID>"' \
--project=moz-fx-data-airflow-gke-prod \
--limit=200
Useful Links
Investigation Workflow
If the user provides a DAG name, skip straight to step 2. Only run list-failed-dags when you need to discover which DAGs failed.
- Run
scripts/list-failed-dagsto discover failures (skip if DAG name is already known) - Run
scripts/fetch-task-log <dag_id> --list-runsto find recent runs - Run
scripts/fetch-task-log <dag_id> --list-tasks --run-id <run_id>to list tasks in the failing run - Run
scripts/fetch-task-log <dag_id> <task_id> <run_id> --tail 100to get the error - Identify root cause from the logs
- Look at the query/script in bigquery-etl or telemetry-airflow
- Suggest fix
Response Format
When reporting findings:
- State the DAG name and failure time
- Quote the key error message from logs
- Identify the root cause (SQL error, timeout, OOM, dependency failure, etc.)
- Link to the relevant source file in bigquery-etl or telemetry-airflow
- Suggest a concrete fix or next step
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
mozilla-query-writing
Write efficient BigQuery queries for Mozilla telemetry. Use when user asks about: Firefox DAU/MAU, telemetry queries, BigQuery Mozilla, baseline_clients, events_stream, search metrics, user counts, or Firefox data analysis.
mozilla-probe-discovery
Find Mozilla telemetry probes and Glean metrics. Use when user asks about: Firefox metrics, Glean probes, telemetry data, accessibility probes, search metrics, or any Mozilla product instrumentation.
scaffold-exercises
Create exercise directory structures with sections, problems, solutions, and explainers that pass linting. Use when user wants to scaffold exercises, create exercise stubs, or set up a new course section.
obsidian-vault
Search, create, and manage notes in the Obsidian vault with wikilinks and index notes. Use when user wants to find, create, or organize notes in Obsidian.
edit-article
Edit and improve articles by restructuring sections, improving clarity, and tightening prose. Use when user wants to edit, revise, or improve an article draft.
handoff
Compact the current conversation into a handoff document for another agent to pick up.
Didn't find tool you were looking for?