Agent skill

aws-troubleshoot

Troubleshoot AWS services using the AWS CLI. Focus on EKS, S3, ECR, EC2, SSM, networking, site-to-site VPNs, IAM Identity Center, and IAM.

Stars 0
Forks 0

Install this agent skill to your Project

npx add-skill https://github.com/timbuchinger/loadout/tree/main/skills/aws-troubleshoot

SKILL.md

AWS Troubleshooting Skill

General Guidance

Use AWS CLI commands for all AWS service interactions, logs, metrics, and audit events.

Important: When constructing AWS CLI commands, always append --profile <profile-name> to the end of the command if a specific AWS profile is needed. This ensures commands can be easily added to auto-allow lists with wildcard prefixes.

Example: aws s3 ls s3://my-bucket --profile production

All investigations should:

  1. Scope log queries by log group and time window
  2. Check CloudTrail for failed API calls
  3. Use service-specific metrics before guessing
  4. Recommend minimal corrections

Core Services Covered

EKS

Common issues:

  • Image pull errors
  • Pod pending (CNI/IP exhaustion)
  • CrashLoopBackOff
  • Node NotReady

Investigations:

  • Query pod logs
  • Query pod events
  • Inspect node status and cluster metrics

S3

Common issues:

  • AccessDenied
  • Incorrect KMS key
  • BlockPublicAccess conflicts

Investigations:

  • Query S3 server access logs
  • Inspect CloudTrail for denied events

ECR

Common issues:

  • Token expiration
  • Missing permissions
  • Architecture mismatch

Investigations:

  • Search CloudTrail for ecr:* denied actions
  • Inspect repository push/pull failures

EC2

Common issues:

  • Failed instance boot
  • ENI/network issues
  • IMDSv2 access failures

Investigations:

  • Check EC2 instance status checks
  • Inspect system logs and VPC configuration

SSM (Systems Manager)

Common issues:

  • Agent not running
  • Missing IAM permissions
  • Instance not registered
  • Command execution failures

Investigations:

  • Check SSM agent status on instances
  • Query command execution history
  • Inspect CloudTrail for SSM API failures
  • Validate instance profile permissions

Networking & VPN

Common issues:

  • Route mismatches
  • NACL/Security Group blocks
  • VPN tunnel down

Investigations:

  • Query CloudWatch metrics for VPN TunnelState
  • Validate routing tables and security groups

IAM Identity Center (SSO)

Common issues:

  • User not assigned
  • Permission set mismatch

Investigations:

  • Inspect activity logs for SSO authentication issues
  • Validate permission sets

IAM

Common issues:

  • AccessDenied
  • Incorrect role assumption

Investigations:

  • Query CloudTrail for denied API events
  • Identify missing permissions

Workflow

  1. Identify service
  2. Query scoped logs
  3. Query CloudTrail for denied API calls
  4. Query metrics when relevant
  5. Diagnose using AWS-specific heuristics
  6. Provide safe remediation steps

Expand your agent's capabilities with these related and highly-rated skills.

timbuchinger/loadout

brainstorming

Use when creating or developing, before writing code or implementation plans - refines rough ideas into fully-formed designs through collaborative questioning, alternative exploration, and incremental validation. Don't use during clear 'mechanical' processes

0 0
Explore
timbuchinger/loadout

add-note

Use this skill whenever important information is learned during a task or when the user explicitly asks to store something. Use when users ask to remember. Triggers on "remember this", "update memory", "share" or any persistent storage request.

0 0
Explore
timbuchinger/loadout

user-story

Creates well-structured user stories for software development and project management. Use when the user asks to write, create, or format a user story, or needs to document requirements, features, or tasks in user story format.

0 0
Explore
timbuchinger/loadout

test-driven-development

Use when implementing any feature or bugfix, before writing implementation code - write the test first, watch it fail, write minimal code to pass; ensures tests actually verify behavior by requiring failure first

0 0
Explore
timbuchinger/loadout

kubernetes-troubleshoot

Troubleshoot and manage Kubernetes clusters, including resource inspection, debugging, pod logs, events, and cluster operations. Use when the user needs to diagnose issues, inspect workloads, analyze pod failures, or perform Kubernetes cluster operations.

0 0
Explore
timbuchinger/loadout

writing-plans

Use when design is complete and you need detailed implementation tasks - creates comprehensive implementation plans with exact file paths, complete code examples, and verification steps assuming minimal codebase familiarity

0 0
Explore

Didn't find tool you were looking for?

Be as detailed as possible for better results