Agent skill
bigquery-etl-core
The core skill for working within the bigquery-etl repository. Use this skill when understanding project structure, conventions, and common patterns. Works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills.
Install this agent skill to your Project
npx add-skill https://github.com/mozilla/bigquery-etl-skills/tree/main/skills/bigquery-etl-core
SKILL.md
BigQuery ETL Core
Composable: Foundation skill that works with model-requirements, query-writer, metadata-manager, sql-test-generator, and bigconfig-generator skills When to use: Understanding project structure, conventions, common patterns, and finding schema descriptions for construction
Project Overview
The bigquery-etl project manages BigQuery table definitions, queries, and associated metadata for Mozilla. Similar to dbt, the repository maintains query definitions with associated metadata and schemas.
Each table/query typically consists of three files:
query.sqlORquery.py- The query definition (SQL or Python)metadata.yaml- Metadata about scheduling, ownership, and dependencies (see metadata-manager skill)schema.yaml- BigQuery schema definition with field types and descriptions (see metadata-manager skill)
Note: Most tables use query.sql (~95%). Use query.py for API calls, multi-project queries, or complex Python operations. See query-writer skill for details.
🚨 REQUIRED READING - Start Here
When starting work in bigquery-etl, READ these foundational references:
-
Naming Conventions: READ
references/naming_conventions.md- Table naming patterns
- Dataset organization
- Version suffix conventions
-
Dataset Organization: READ
references/dataset_naming_conventions.md- Common dataset suffixes (_derived, _stable, _live)
- When to use each dataset type
- Dataset naming rules
-
Schema Resources: READ
references/discovery_resources.md- Schema description sources (Glean Dictionary, ProbeInfo API, DataHub)
- Priority order for schema lookup during construction
- Common mozfun UDFs
-
Privacy Guidelines: READ
references/privacy_guidelines.md- Data handling requirements
- PII considerations
- Workgroup access patterns
Directory Structure
sql/{project}/{dataset}/{table_name}/
├── query.sql OR query.py
├── metadata.yaml
└── schema.yaml
See assets/directory_structure_example.txt for detailed examples.
Key principles:
- Always flat:
sql/{project}/{dataset}/{table_name}/ - Never use subdirectories within table directories
- Table names always include version suffix (
_v1,_v2, etc.)
Schema & Description Resources for Construction
Finding Schema Descriptions
Priority order for schema lookup during construction:
-
Local files first: Check
sql/*/schema.yamlandmetadata.yamlfiles- Most reliable and up-to-date source
- Contains field descriptions written by table owners
-
Glean Dictionary: For
_liveand_stabletables- URL: https://dictionary.telemetry.mozilla.org/
- Contains metric descriptions from Glean schema definitions
- Use WebFetch with targeted prompts to extract specific field descriptions
-
ProbeInfo API: For Glean metric metadata
- Endpoints:
https://probeinfo.telemetry.mozilla.org/glean/{product}/metrics - Provides metric definitions and descriptions programmatically
- Use for validating metric references in queries
- Endpoints:
-
DataHub MCP: Only as last resort
- MUST READ
references/datahub_best_practices.mdBEFORE any DataHub queries - Use for schema lookup when not available in local files or Glean Dictionary
- Extract ONLY necessary fields (column names, types, descriptions)
- Use for downstream impact analysis when modifying tables
- MUST READ
See references/discovery_resources.md for:
- Detailed guidance on each schema source
- ProbeInfo API endpoints and usage patterns
- Glean Dictionary URL patterns for different products
- DataHub MCP best practices for construction
- Common mozfun UDFs
- Key documentation links
Naming Conventions
Table Names:
- Use snake_case with version suffix:
clients_daily_event_v1 - Common suffixes:
_daily,_hourly,_aggregates,_summary
Field Names:
- Use snake_case:
submission_date,client_id,n_total_events - Prefix counts with
n_:n_events,n_sessions - Standard Mozilla fields:
submission_date,client_id,sample_id,normalized_channel,normalized_country_code,app_version
See references/naming_conventions.md for:
- Complete naming patterns and conventions
- Reserved/common patterns to avoid
- BigQuery project naming conventions
Dataset Organization
See references/dataset_naming_conventions.md for:
- Dataset naming patterns by suffix (
_derived,_external, etc.) - Common dataset prefixes by product/source
- Table versioning patterns
- Incremental vs full refresh query patterns
Privacy & Data Handling
Mozilla follows strict data privacy policies:
- No PII in derived tables
- Use client-level identifiers (
client_id) not individual identifiers - Respect data retention policies (~2 years for client-level data)
- Label client-level tables with
table_type: client_levelin metadata.yaml
See references/privacy_guidelines.md for:
- Key principles from Mozilla's data platform
- Geo IP lookup and user agent parsing policies
- Best practices for data handling
- Deletion request support
- Sample ID usage for sampling
BigQuery & Mozilla Conventions
Partitioning & Clustering
- Most tables use day partitioning on
submission_date - Clustering improves query performance for filtered/joined fields
- See metadata-manager skill for detailed partitioning and clustering configuration
Common UDFs (mozfun)
Browse available functions: https://mozilla.github.io/bigquery-etl/mozfun/
Common functions:
mozfun.map.get_key()- Extract values from key-value mapsmozfun.norm.truncate_version()- Normalize version stringsmozfun.stats.mode_last()- Statistical mode calculation
UDF source code in sql/mozfun/ directory.
Glean Overview
Glean is Mozilla's product analytics & telemetry solution, providing consistent measurement across all Mozilla products.
Key concepts:
- Metric types: Counter, boolean, string, event, etc.
- Pings: Collections of metrics (e.g.,
baseline,events,metrics) - Applications: Products using Glean (Fenix, Focus, Firefox iOS, etc.)
Common Glean datasets in BigQuery:
- Pattern:
{app_id}.{ping_name}(e.g.,org_mozilla_fenix.baseline) - All have auto-generated schemas based on metric definitions
See references/glean_overview.md for:
- What is Glean and how it differs from Firefox Desktop Telemetry
- Glean SDK and metric type details
- Common Glean datasets in BigQuery
- When to use Glean Dictionary
bigquery-etl CLI Commands
See references/bqetl_cli_commands.md for:
- Key bqetl CLI commands for query creation, validation, schema updates
- How to find the right DAG for scheduling
- Backfill creation commands
Best Practices
General principles:
- Always include field descriptions in schema.yaml (see metadata-manager skill)
- Add header comments explaining query purpose (see query-writer skill)
- Reference bug/ticket numbers for context
- Document any data exclusions or filtering logic
See assets/query_structure_example.sql for standard query structure.
Version migration:
- Create new
_v2table when making breaking schema changes - Keep
_v1running during migration period - Update views to point to new version
- Coordinate with downstream consumers before deprecating old version
For detailed best practices, see:
- Query writing: query-writer skill
- Metadata configuration: metadata-manager skill
- Performance optimization: https://docs.telemetry.mozilla.org/cookbooks/bigquery/optimization.html
- Recommended practices: https://mozilla.github.io/bigquery-etl/reference/recommended_practices/
Integration with Other Skills
bigquery-etl-core serves as the foundation skill that other skills build upon:
Works with model-requirements
- Provides naming conventions for new tables and datasets
- Supplies common field naming patterns for requirements gathering
- Offers privacy guidelines for data model planning
Works with query-writer
- Provides project structure and naming conventions
- Supplies common patterns and mozfun UDF references
- Offers schema description lookup guidance for construction
Works with metadata-manager
- Provides DAG naming patterns and scheduling conventions
- Supplies partitioning and clustering best practices
- Offers ownership and labeling patterns
Works with sql-test-generator
- Provides test structure and fixture naming conventions
- Supplies common table patterns for test creation
- Offers query parameter conventions
Works with bigconfig-generator
- Provides table naming conventions for Bigeye monitoring configuration
- Supplies dataset organization patterns
- Offers field naming standards for data quality checks
This skill is always available and does not need to be explicitly invoked - it provides foundational knowledge that other skills reference.
Reference Examples
Real query examples in the repository:
- Simple query:
sql/moz-fx-data-shared-prod/mozilla_vpn_derived/users_v1/query.sql - Aggregation with GROUP BY:
sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_event_v1/query.sql - Complex query with CTEs:
sql/moz-fx-data-shared-prod/telemetry_derived/event_events_v1/query.sql - Python ETL (INFORMATION_SCHEMA):
sql/moz-fx-data-shared-prod/monitoring_derived/bigquery_table_storage_v1/query.py - Python ETL (External API):
sql/moz-fx-data-shared-prod/bigeye_derived/user_service_v1/query.py
For more examples, explore the sql/moz-fx-data-shared-prod/ directory.
Bundled Resources
References
references/discovery_resources.md- Schema description sources (Glean Dictionary, ProbeInfo API, DataHub MCP), priority order for construction, documentation linksreferences/naming_conventions.md- Complete naming patterns for tables, fields, and projectsreferences/dataset_naming_conventions.md- Dataset organization and versioning patternsreferences/privacy_guidelines.md- Mozilla data privacy policies and best practicesreferences/glean_overview.md- Glean SDK concepts and BigQuery dataset structuresreferences/bqetl_cli_commands.md- Key CLI commands and DAG discovery
DataHub Usage (CRITICAL for Token Efficiency)
BEFORE using any DataHub MCP tools (mcp__datahub-cloud__*), you MUST:
- READ
references/datahub_best_practices.md- Comprehensive token optimization strategies - Follow priority order: local files → documentation → DataHub (only as last resort)
- Use search-first patterns and extract minimal fields from responses
Assets
assets/query_structure_example.sql- Standard query.sql structure with common patternsassets/directory_structure_example.txt- File organization examples
Recommended Agent Skills
Expand your agent's capabilities with these related and highly-rated skills.
model-requirements
Use this skill when gathering requirements for new BigQuery data models OR when asked to edit existing queries in bqetl. For new models, guides structured requirements interviews. For existing queries, understands current model, checks downstream dependencies, and gathers requirements for changes. Works as pre-planning before query-writer skill.
metadata-manager
Use this skill when creating or updating DAG configurations (dags.yaml), schema.yaml, and metadata.yaml files for BigQuery tables. Handles creating new DAGs when needed and coordinates test updates when queries are modified (invokes sql-test-generator as needed). Works with bigquery-etl-core, query-writer, and sql-test-generator skills.
bigconfig-generator
Use this skill when creating or updating Bigeye monitoring configurations (bigconfig.yml files) for BigQuery tables. Works with metadata-manager skill.
query-writer
Use this skill when writing or updating SQL queries (query.sql) or Python ETL scripts (query.py) following Mozilla BigQuery ETL conventions. ALWAYS checks for and updates existing tests when modifying queries. Coordinates downstream updates to schemas and tests. Works with bigquery-etl-core, metadata-manager, and sql-test-generator skills.
schema-readme-generator
Use this skill to create or update README.md files for BigQuery ETL tables in the mozilla bigquery-etl repository. Follows layout conventions derived from comparing README files across the repo — rich style with emoji headings, Mermaid data flow diagram, graduated example queries, and concise metadata overview table. Requires schema.yaml with complete descriptions (run schema-enricher first if needed) and a complete metadata.yaml.
sql-test-generator
ALWAYS use this skill when users ask to create, generate, or write UNIT TESTS for BigQuery SQL queries. Invoke proactively whenever the request includes "test" or "tests" with a query/table name. This skill is for unit testing ONLY (not data quality checks - use bigconfig-generator for Bigeye monitoring). Works with bigquery-etl-core skill to understand query patterns.
Didn't find tool you were looking for?