Agent skill
google-cloud-configs
Google Cloud Platform configuration templates for BigQuery ML and Vertex AI training with authentication setup, GPU/TPU configs, and cost estimation tools. Use when setting up GCP ML training, configuring BigQuery ML models, deploying Vertex AI training jobs, estimating GCP costs, configuring cloud authentication, selecting GPUs/TPUs for training, or when user mentions BigQuery ML, Vertex AI, GCP training, cloud ML setup, TPU training, or Google Cloud costs.
Install this agent skill to your Project
npx add-skill https://github.com/majiayu000/claude-skill-registry/tree/main/skills/devops/google-cloud-configs
SKILL.md
Use when:
- Setting up BigQuery ML for SQL-based machine learning
- Configuring Vertex AI custom training jobs
- Setting up GCP authentication for ML workflows
- Selecting appropriate GPU/TPU configurations
- Estimating costs for GCP ML training
- Deploying models to Vertex AI endpoints
- Configuring distributed training on GCP
- Optimizing cost vs performance for cloud ML
Platform Overview
BigQuery ML
What it is: SQL-based machine learning directly in BigQuery Best for:
- Quick ML prototypes using existing data warehouse data
- Classification, regression, forecasting on structured data
- Users familiar with SQL but not Python/ML frameworks
- Large-scale batch predictions
Available Models:
- Linear/Logistic Regression
- XGBoost (BOOSTED_TREE)
- Deep Neural Networks (DNN)
- AutoML Tables
- TensorFlow/PyTorch imported models
Pricing:
- Based on data processed (same as BigQuery queries)
- $5 per TB processed for analysis
- AutoML: $19.32/hour for training
Vertex AI Training
What it is: Fully managed ML training platform Best for:
- Custom PyTorch/TensorFlow training
- Large-scale distributed training
- GPU/TPU-accelerated workloads
- Production ML pipelines
Available Compute:
- CPUs: n1-standard, n1-highmem, n1-highcpu
- GPUs: NVIDIA T4, P4, V100, P100, A100, L4
- TPUs: v2, v3, v4, v5e (8 cores to 512 cores)
Pricing:
- CPU: $0.05-0.30/hour depending on machine type
- GPU T4: $0.35/hour
- GPU A100: $3.67/hour (40GB) or $4.95/hour (80GB)
- TPU v3: $8.00/hour (8 cores)
- TPU v4: $11.00/hour (8 cores)
GPU/TPU Selection Guide
GPU Selection (Vertex AI)
T4 (16GB VRAM):
- Use case: Inference, light training, small models
- Cost: $0.35/hour
- Good for: BERT-base, small CNNs, inference serving
V100 (16GB VRAM):
- Use case: Mid-size training, mixed precision training
- Cost: $2.48/hour
- Good for: ResNet training, medium transformers
A100 (40GB/80GB VRAM):
- Use case: Large model training, distributed training
- Cost: $3.67/hour (40GB), $4.95/hour (80GB)
- Good for: GPT-style models, large vision models, multi-GPU training
L4 (24GB VRAM):
- Use case: Modern alternative to T4, better performance
- Cost: $0.66/hour
- Good for: Mid-size models, efficient inference
TPU Selection (Vertex AI)
TPU v2 (8 cores):
- Use case: TensorFlow/JAX training, matrix operations
- Cost: $4.50/hour
- Memory: 8GB per core (64GB total)
- Good for: Legacy TensorFlow models
TPU v3 (8 cores):
- Use case: Standard TPU training
- Cost: $8.00/hour
- Memory: 16GB per core (128GB total)
- Good for: BERT, T5, image classification
TPU v4 (8 cores):
- Use case: Latest generation, best performance
- Cost: $11.00/hour
- Memory: 32GB per core (256GB total)
- Good for: Large language models, cutting-edge research
TPU v5e (8 cores):
- Use case: Cost-optimized TPU
- Cost: $2.50/hour
- Good for: Development, training at scale on budget
Multi-node TPU Pods:
- v3-32: 32 cores, $32/hour
- v3-128: 128 cores, $128/hour
- v4-128: 128 cores, $176/hour
- Use for: Massive distributed training (GPT-3 scale)
Usage
Setup BigQuery ML Environment
bash scripts/setup-bigquery-ml.sh
Prompts for:
- GCP Project ID
- BigQuery dataset name
- Service account credentials
- Default model type preference
Creates:
bigquery_config.json- Project configuration.bigqueryrc- CLI configuration- Example training SQL in examples/
Setup Vertex AI Training Environment
bash scripts/setup-vertex-ai.sh
Prompts for:
- GCP Project ID
- Region (us-central1, europe-west4, etc.)
- Service account credentials
- Default machine type
- GPU/TPU preference
Creates:
vertex_config.yaml- Training job configurationvertex_requirements.txt- Python dependencies- Training script template
Configure GCP Authentication
bash scripts/configure-auth.sh
Prompts for:
- Authentication method (service account, user account, workload identity)
- Service account key path (if applicable)
- IAM roles needed
Creates:
.gcp_auth_config- Authentication configuration- Sets GOOGLE_APPLICATION_CREDENTIALS environment variable
- Validates permissions
Required IAM Roles:
- BigQuery ML:
roles/bigquery.dataEditor,roles/bigquery.jobUser - Vertex AI:
roles/aiplatform.user,roles/storage.objectAdmin - Both:
roles/serviceusage.serviceUsageConsumer
Estimate GCP Training Costs
bash scripts/estimate-gcp-cost.sh
Interactive prompts:
- Platform: BigQuery ML or Vertex AI
- If BigQuery ML: Data size to process
- If Vertex AI:
- Machine type (CPU/GPU/TPU)
- Number of machines
- Training duration estimate
- Storage requirements
Output:
- Estimated compute cost
- Storage cost
- Data transfer cost (if applicable)
- Total estimated cost
- Cost comparison with other GCP options
Templates
BigQuery ML Training Template (templates/bigquery_ml_training.sql)
SQL template for creating and training models:
- Model creation syntax
- Feature engineering examples
- Training options (L1/L2 reg, learning rate, etc.)
- Evaluation queries
- Prediction queries
Supported model types:
- LINEAR_REG, LOGISTIC_REG
- BOOSTED_TREE_CLASSIFIER, BOOSTED_TREE_REGRESSOR
- DNN_CLASSIFIER, DNN_REGRESSOR
- AUTOML_CLASSIFIER, AUTOML_REGRESSOR
Vertex AI Training Job Template (templates/vertex_training_job.py)
Python template for custom training:
- Training loop structure
- Distributed training setup (PyTorch DDP)
- Checkpointing and model saving
- Metrics logging to Vertex AI
- Hyperparameter tuning integration
Includes:
- Single GPU training
- Multi-GPU training (DataParallel, DistributedDataParallel)
- TPU training with PyTorch/XLA
- Cloud Storage integration
GPU Configuration Template (templates/vertex_gpu_config.yaml)
YAML configuration for GPU training jobs:
- Machine type selection
- GPU type and count
- Disk configuration
- Network configuration
- Environment variables
Presets included:
- Single T4 (budget)
- Single A100 (standard)
- 4x A100 (distributed)
- 8x A100 (large-scale)
TPU Configuration Template (templates/vertex_tpu_config.yaml)
YAML configuration for TPU training jobs:
- TPU type and topology
- TPU version selection
- JAX/TensorFlow runtime
- XLA compilation flags
Presets included:
- v3-8 (single TPU)
- v4-32 (TPU pod slice)
- v5e-8 (cost-optimized)
GCP Authentication Template (templates/gcp_auth.json)
Service account configuration template:
- Project ID
- Service account email
- Key file path
- Required scopes
- IAM role assignments
Security notes:
- Uses placeholders only (never real keys)
- Documents how to create service accounts
- Includes
.gitignoreprotection
Examples
BigQuery ML Regression Example (examples/bigquery-regression-example.sql)
Complete example:
- Dataset: NYC taxi trip data
- Task: Predict trip duration
- Model: BOOSTED_TREE_REGRESSOR
- Includes feature engineering, training, evaluation
Demonstrates:
- CREATE MODEL syntax
- TRANSFORM clause for feature engineering
- MODEL evaluation
- Batch predictions
Vertex AI PyTorch Training Example (examples/vertex-pytorch-training.py)
Complete training script:
- Dataset: IMDB sentiment analysis
- Model: DistilBERT fine-tuning
- Training: Single GPU
- Logging: Vertex AI experiments
Demonstrates:
- Loading data from GCS
- Training loop with mixed precision
- Checkpointing to GCS
- Metrics logging
- Model export to Vertex AI
Vertex AI Distributed Training Example (examples/vertex-distributed-training.py)
Multi-GPU training example:
- Dataset: ImageNet subset
- Model: ResNet-50
- Training: 4x A100 with DDP
- Scaling: Linear scaling rule
Demonstrates:
- PyTorch DistributedDataParallel
- Gradient accumulation
- Learning rate scaling
- Synchronized batch norm
- Multi-node coordination
Hugging Face Fine-tuning on Vertex AI (examples/vertex-huggingface-finetuning.py)
Production fine-tuning template:
- Dataset: Custom text classification
- Model: BERT/RoBERTa/DeBERTa
- Training: Hugging Face Trainer API
- Deployment: Vertex AI endpoint
Demonstrates:
- Hugging Face Trainer integration
- Hyperparameter tuning with Vertex AI
- Model versioning
- Endpoint deployment
- Online predictions
Cost Optimization Tips
BigQuery ML
Reduce data processed:
- Use partitioned tables
- Filter data in WHERE clause before training
- Use table sampling for experimentation
- Cache intermediate results
Use appropriate model types:
- Start with LINEAR_REG/LOGISTIC_REG (cheapest)
- Use BOOSTED_TREE for better accuracy at moderate cost
- Reserve AutoML for when simpler models fail
Optimize queries:
- Avoid SELECT * (specify columns)
- Use clustering on filter columns
- Materialize views for repeated training
Vertex AI
Machine type selection:
- Start with CPU for prototyping
- Use T4 for small models (cheapest GPU)
- Use A100 only for large models that need it
- Consider TPU v5e for TensorFlow/JAX (very cost-effective)
Training optimization:
- Use preemptible instances (60-70% cheaper, can be interrupted)
- Enable automatic checkpoint/resume for preemptible
- Use mixed precision training (FP16/BF16) for faster training
- Profile to eliminate CPU bottlenecks
Storage optimization:
- Store datasets in Cloud Storage (cheaper than persistent disk)
- Use Filestore only if needed for POSIX filesystem
- Clean up old model artifacts
- Use lifecycle policies to archive old data
Multi-GPU efficiency:
- Ensure near-linear scaling before adding more GPUs
- Profile inter-GPU communication
- Use gradient accumulation instead of larger batch sizes
- Consider 2x GPUs instead of 1x larger GPU (often same cost, better availability)
Integration with ML Training Plugin
This skill integrates with other ml-training components:
- training-patterns: Provides GCP configs for generated training scripts
- cost-calculator: Uses GCP pricing data for budget planning
- monitoring-dashboard: Integrates with Vertex AI TensorBoard
- validation-scripts: Validates GCP credentials and permissions
- integration-helpers: Deploys trained models to Vertex AI endpoints
Common Workflows
Workflow 1: Quick BigQuery ML Prototype
- Run
bash scripts/setup-bigquery-ml.sh - Copy
templates/bigquery_ml_training.sqlto your project - Modify SQL for your dataset and features
- Run training query in BigQuery console
- Evaluate with built-in ML.EVALUATE()
- Export predictions with ML.PREDICT()
Time: 30 minutes setup + training time Cost: $5 per TB of data processed
Workflow 2: Custom PyTorch Training on Vertex AI
- Run
bash scripts/configure-auth.sh - Run
bash scripts/setup-vertex-ai.sh - Copy
templates/vertex_training_job.py - Customize training loop for your model
- Copy
templates/vertex_gpu_config.yaml - Submit job:
gcloud ai custom-jobs create ... - Monitor in Vertex AI console
Time: 1 hour setup + training time Cost: Depends on GPU/TPU selection
Workflow 3: Large-Scale Distributed Training
- Setup Vertex AI (workflow 2)
- Copy
examples/vertex-distributed-training.py - Adapt for your model architecture
- Test locally with 1 GPU
- Test with 2 GPUs to verify scaling
- Scale to 4-8 GPUs for full training
- Use preemptible instances with checkpointing
Time: 2-4 hours setup + training time Cost: $15-60/hour depending on GPU count
Troubleshooting
BigQuery ML Issues
"Insufficient permissions":
- Verify
roles/bigquery.dataEditorandroles/bigquery.jobUser - Check dataset-level permissions
- Ensure billing is enabled
"Model training failed":
- Check for NULL values in features
- Verify data types match model expectations
- Review feature engineering TRANSFORM clause
- Check for sufficient training data
Vertex AI Issues
"Service account lacks permissions":
- Verify
roles/aiplatform.user - Add
roles/storage.objectAdminfor GCS access - Check project-level IAM policies
"GPU/TPU quota exceeded":
- Request quota increase in GCP console
- Use different region with availability
- Start with smaller GPU/TPU configuration
- Use preemptible instances (separate quota)
"Training job crashes":
- Check for CUDA OOM (reduce batch size)
- Verify dependencies in requirements.txt
- Review logs in Cloud Logging
- Test locally before submitting to Vertex
Security Best Practices
Credentials Management
DO:
- ✅ Use service accounts with minimal permissions
- ✅ Store credentials in Secret Manager
- ✅ Use Workload Identity for GKE deployments
- ✅ Rotate service account keys regularly
- ✅ Add
.gitignorefor*.jsonkey files
DON'T:
- ❌ Hardcode credentials in code
- ❌ Commit service account keys to git
- ❌ Use overly permissive roles (e.g., Owner)
- ❌ Share service account keys across projects
- ❌ Use personal credentials for production
IAM Best Practices
- Use separate service accounts for training vs serving
- Grant roles at resource level, not project level when possible
- Use Workload Identity Federation instead of keys when possible
- Enable Cloud Audit Logs for ML API usage
- Review IAM permissions quarterly
Performance Benchmarks
BigQuery ML vs Vertex AI
BigQuery ML:
- Best for: Structured data, SQL users, quick prototypes
- Training time: Minutes to hours (depends on data size)
- Scalability: Automatic (serverless)
- Cost: $5/TB processed
Vertex AI Custom Training:
- Best for: Deep learning, custom architectures, GPU/TPU workloads
- Training time: Hours to days (configurable hardware)
- Scalability: Manual (choose machine type)
- Cost: $0.35-20/hour depending on hardware
Rule of thumb:
- Use BigQuery ML for tabular data with < 100M rows
- Use Vertex AI for images, text, audio, or custom models
- Use Vertex AI for models requiring GPU/TPU acceleration
Additional Resources
- GCP ML Documentation: https://cloud.google.com/vertex-ai/docs
- BigQuery ML Reference: https://cloud.google.com/bigquery-ml/docs
- Pricing Calculator: https://cloud.google.com/products/calculator
- TPU Best Practices: https://cloud.google.com/tpu/docs/best-practices
- Vertex AI Samples: https://github.com/GoogleCloudPlatform/vertex-ai-samples
Didn't find tool you were looking for?