Feb 17, 2026 · 12 min read

The Script That Grew Legs: Building a Deterministic AWS Discovery Tool

How a 500-line bash script evolved into an automated WAF assessment engine - and why deterministic tooling beats giving AI the wheel.

awsaiautomationwell-architected

TL;DR

I built a bash script to automate the painful parts of AWS presales discovery. It started at 500 lines and grew into a modular WAF assessment engine. AI wrote the code; the tool produces the reliable output.

Part 2 of 2 in Claude Skills

Claude Skills: Context is King

Because sometimes the best use of AI isn't giving it the wheel - it's using it to build a better car.

If you've ever spent three days clicking through the AWS console region by region, copying resource counts into a spreadsheet, then trying to piece together how everything connects before you can even start writing a report - This article is for you.

In Part 1 I talked about Claude Skills and why context is everything when working with AI. This post is about what happens when you take that philosophy and point it at a real problem - one that's been bugging me for years.

The Problem: Presales Discovery is a Pain

Here's the typical flow when scoping a Well-Architected Review for a customer. You talk them through creating a read-only IAM role, wait for approvals, chase up, wait some more. Then you log in, pick a region, click through services, take notes, repeat for every region they use, miss things, go back. You try to map dependencies - which load balancers point to which targets, what's that Lambda talking to, why is there an RDS instance with no connections. Ask the customer. Wait for answers. Get different answers from different people.

Then you piece it all together across regions and services. Cross-reference tagging. Work out what's production vs dev vs "I forgot that was there." Write the report. Hope the puzzle makes sense. Present it with confidence you don't entirely feel.

That's days of work. Sometimes a week if the account is large or the customer is slow to respond. And honestly ? Most of it is mechanical. It's not the analysis that takes the time - it's gathering the information to analyse.

The Shiny Object Detour

I'm going to be honest - my first instinct wasn't to write a bash script. That felt sooo 2020, and AWS didnt seem to have a clean user friendly approach to get everything out of the system from the UI.

MCPs were the new hotness. I built a DevOps agent that could interrogate AWS accounts directly, triage issues, and produce reports. It worked. Sort of. The scaffolding was messy, the output was inconsistent, and I spent more time debugging the agent than it would have taken to just do the work manually. AWS has since published some solid MCPs that would make this much easier, but at the time it was held together with duct tape and optimism.

The core issue wasn't the tooling though. It was the non-deterministic nature of it all. You give an LLM tools and a carefully crafted prompt, then hope it follows instructions and doesn't hallucinate. For internal experiments ? Fine. For customer deliverables ? Not fine.

So I went back to basics.

Starting Simple: 500 Lines of Bash

The first version of the AWS Discovery Tool was exactly what you'd expect. A bash script that called the AWS CLI, dumped some JSON, and gave me resource counts. 503 lines. Text output. EC2, RDS, Lambda, S3 - the basics.

I built this on my own, partly to solve a real problem, partly because my brain needed something to chew on, and partly because I'm fundamentally lazy and will spend three weeks automating something that takes two hours.

Was it sophisticated ? No. Did it save me 30 minutes of clicking around ? Yes. That was enough.

# The entire philosophy in four lines
DATA=$(aws ec2 describe-instances --region "$REGION" --output json)
echo "$DATA" > "$REGION_DIR/ec2-instances.json"
COUNT=$(echo "$DATA" | jq '.Reservations[].Instances | length')
log "  EC2 Instances: $COUNT"

Collect. Store. Count. Move on. Nothing clever. Nothing non-deterministic. Just data.

The "Actually, Could It Also..." Phase

Version 2 doubled in size because it turns out once you have a script that collects inventory, you immediately want it to do more. Dependency mapping (which load balancers talk to which instances). IAM credential reports. CloudFormation detection. Tagging compliance. Structured JSON output instead of text.

By version 2.3 the monolith was north of 2,000 lines and covering analytics, security services, and Lightsail. Still one file. Still sequential. Still working.

Still painfully slow. Running all of this sequentially across every region was bad enough. Running it in CloudShell was worse - sessions timing out mid-collection, losing everything, starting again. Nothing quite like watching 15 minutes of API calls evaporate because AWS decided your session had been open too long. That was the moment "one big script" stopped being viable.

Then I ran it on a real customer account.

The Customer Account Moment

This is where it went from "useful side project" to "I'm not doing this manually ever again."

The customer ran the tool themselves and provided the output. What came back aligned well with the notes I'd taken during the discovery call - the deployed architecture mapped closely to what they'd described, which was reassuring. But it also surfaced gaps. Services nobody mentioned. Resources in regions that "aren't used." The kind of things you'd normally find three weeks into an engagement when someone casually says "oh, that old thing."

More importantly, it identified gaps in the tooling itself. Services I hadn't thought to check. Dependency paths I hadn't mapped. Edge cases that only show up in production accounts with five years of accumulated decisions. Those gaps drove most of the v4 features.

Report generation was still manual at that point, but I had the context I needed. I was starting from a position of confidence rather than guesswork. I knew what was there. I knew what was tagged. I knew what was connected to what. The puzzle was already half assembled.

That experience drove everything that came next.

Going Modular: v4.0

The monolith had to go. Not because it didn't work, but because adding anything new meant scrolling through 2,000 lines and hoping you didn't break something three functions away.

Version 4.0 split everything into 18 regional modules:

lib/modules/
├── networking.sh      # VPCs, subnets, NAT, VPN, Direct Connect
├── compute.sh         # EC2, EBS, snapshots, ASGs
├── containers.sh      # ECS, EKS, ECR
├── serverless.sh      # Lambda, API Gateway, Step Functions
├── databases.sh       # RDS, Aurora, DynamoDB, ElastiCache
├── security.sh        # KMS, Secrets Manager, WAF, GuardDuty
├── ... and 12 more

Each module has the same interface. Collect data, write JSON, output a summary fragment. The orchestrator runs regions in parallel (3 workers by default), and a build script assembles everything back into a single file for CloudShell deployment.

The architecture decision here was deliberate: region-level parallelism, not module-level. AWS API rate limits are per-region. Running modules in parallel within the same region would just cause throttling. Running regions in parallel uses separate quota buckets - essentially free parallelism. Most customers only use a handful of regions anyway, but I wanted full coverage. Partly because thoroughness matters for a WAF review. Partly because I wanted to be the one to tell the customer about their 17 default VPCs sitting there doing nothing.

Physics over clever.

And because bash subshells don't share variables with their parents, all communication goes through the filesystem. No shared state. No race conditions. Just files on disk.

Here's how the same simple pattern from v1 evolved to handle pagination and error tracking:

# v4: Same philosophy - collect, store, count - but now handles
# pagination automatically and tracks what went wrong
DATA=$(safe_aws_paged aws ec2 describe-instances \
    --region "$REGION" --output json \
    '.Reservations[].Instances')
echo "$DATA" > "$REGION_DIR/ec2-instances.json"
COUNT=$(echo "$DATA" | jq 'length' 2>/dev/null || echo "0")
log "    EC2 Instances: $COUNT"

And the batch parallelism pattern within modules:

# Collect independent services in parallel, wait when batch is full
safe_aws_paged aws ecs list-clusters --region "$REGION" \
    --output json '.clusterArns' > "$REGION_DIR/ecs-clusters.json" &
safe_aws_paged aws eks list-clusters --region "$REGION" \
    --output json '.clusters' > "$REGION_DIR/eks-clusters.json" &
wait_if_batch_full 5  # Don't exceed 5 concurrent API calls per region

Same four-step pattern. safe_aws_paged handles multi-page responses and records warnings if something fails. The batch limiter keeps concurrent calls within a module from tripping throttling. The complexity is hidden; the philosophy hasn't changed.

The Problems You Don't See Coming

Pagination

Here's a fun one. The AWS CLI paginates by default. Some APIs return 20 results per page, some return 100. If you don't handle pagination, your script silently tells you the customer has 20 S3 buckets when they actually have 200.

I didn't notice this until I ran it on a large account and the numbers looked... optimistic.

safe_aws_paged now handles this automatically - detects pagination tokens in responses, fetches all pages, merges the results. Safety limit of 50 pages because infinite loops in production are bad for your health.

Silent Failures

The original safe_aws wrapper suppressed errors and returned empty on failure. Great for not crashing. Terrible for knowing why your Lambda count is zero - is it because there are no Lambdas, or because you don't have permission to list them ?

Error classification and warning tracking now records every failed API call: what service, what operation, what error type. The executive summary surfaces access issues so you know the difference between "nothing there" and "couldn't look."

Performance

Running 18 modules sequentially across 15 regions is slow. Batched parallel API calls within modules, parallel tag fetching, and reusing data between modules (the cost module reads EBS data that compute already collected) brought things from "go make a coffee" to "wait a moment."

The WAF Scoring Moment

Here's where it gets interesting.

Each version solved a real problem, and each solution made the next step obvious. Inventory leads to dependency mapping. Dependency mapping leads to tagging compliance. Tagging compliance leads to "well, I'm already collecting all these WAF-relevant signals..."

So I mapped all 59 Well-Architected Framework questions across six pillars to the signals the tool was already collecting. Backup retention ? I have that. Multi-AZ deployment ? Checking it. Encryption at rest ? Collecting KMS data. Current-generation instances ? Classifying them.

The scoring engine evaluates each question against the evidence, produces per-pillar scores, and flags questions where the automated data isn't sufficient - where a human reviewer needs to step in with judgement.

{
  "pillar": "Reliability",
  "score": 3,
  "max_score": 5,
  "questions_assessed": 10,
  "insufficient_data": 2,
  "evidence_summary": "Multi-AZ: 60% of RDS instances. Backup: 85% coverage..."
}

This was the moment the tool stopped being just inventory collection. It started producing assessments. Not replacing the human review - augmenting it. Giving the reviewer a starting position that's grounded in evidence rather than gut feel.

And remember those Claude Skills from Part 1 ? This is where they come back. The structured, deterministic JSON output from the AWS Discovery Tool feeds directly into a WAF assessment skill that generates consistently formatted reports. Deterministic collection, AI-assisted analysis. Best of both worlds.

What I Actually Learned

Did I need the MCP agent ? No. Did I need the 500-line monolith ? Absolutely - because it showed me what the 2,000-line version needed to be. Did I need to over-engineer a bash script into a modular assessment engine because I was too lazy to keep doing things manually ? Apparently yes.

The MCP agent was technology looking for a use case. The bash script was a use case looking for the simplest solution. Guess which one I'm still using.

When someone's paying you for an assessment, "it usually produces the right output" isn't good enough. The tool produces the same output for the same input, every time. The AI layer sits on top for analysis, where non-determinism is acceptable because a human reviews it. That's the split that matters.

Each version solved the problem I had that week. Not the problem I might have next month. 500 lines became 2,000 became 18 modules because each step earned the next one. Progressive enhancement isn't just a web development pattern.

Claude helped write every module, every test, every plan document. But the decisions - what to collect, what to skip, what matters for a WAF review - those are human calls informed by the frustration of doing this work manually. The AI doesn't know that a customer's RDS instance running MySQL 5.7 is an urgent conversation, not just a data point. It doesn't know that a security group with 0.0.0.0/0 on port 3389 in a production account means someone needs a phone call, not a report footnote.

What's Next

The tool now handles configurable sampling, three-wave region probing, coverage gap analysis, and as of this week, multi-account orchestration across an entire AWS Organisation. But that's a story for the next post.

What I will say is this: the script that started as "let me see what's in this account" now produces structured evidence for automated Well-Architected assessments across entire organisations. Each version was a direct response to a real problem encountered on a real engagement.

No roadmap. No product vision. Just "this would have been useful yesterday."

Is it perfect ? No. Is it saving me days of work per engagement ? Yes. Would I go back to clicking through the console ? Absolutely not.

If you're doing WAFR assessments or presales discovery and want to try it, I plan to open source this after some final polish on the multi account orchestration. Contributions, feedback, and "you missed this service" messages are all welcome.

Next up: multi-account orchestration, coverage ledgers, and turning structured evidence into customer-ready reports with Claude Skills.

✦ Key Takeaways

Deterministic tooling produces trustworthy customer deliverables - non-deterministic agents don't (yet)
Start with 500 lines that solve today's problem. Tomorrow's problem tells you what to add next
AI is brilliant at writing code. Humans are still needed for 'is this actually useful?'