The most expensive sentence in machine learning is "we'll figure out the cost when the bill comes." By then, the GPU cluster has been idle for three days, the inference endpoint has scaled to 32 instances during a 30-second traffic blip, and a single training job has burned through your monthly budget over the weekend.

Every ML team that ships to production eventually meets a cost spike they didn't see coming. Not a slow drift. A sudden, unexplained 5x or 50x line item that prompts an emergency Slack thread on Monday morning. Then a postmortem. Then a guardrail. Then, a few months later, the next spike from a completely different cause.

The frustrating part is that these spikes follow a small number of recognizable patterns. After working with dozens of ML teams running on AWS, we see the same eight categories show up again and again. Below is a field guide to each one — how it happens, what it costs, and how to catch it before you're explaining a $40,000 bill to your CFO.

10–100x The typical magnitude of a runaway training spike. The detection lag is hours to days — usually long enough that the damage is already done by the time anyone notices.

1. Training Job Failures and Runaway Processes

The silent killer of ML budgets is the training job that looks like it's running. The GPUs are warm, CloudWatch shows activity, the dashboard is green. But the loss has been NaN for six hours, the data pipeline is blocked, and the only thing those eight p4d.24xlarge instances are doing is converting electricity into heat at $256 an hour.

A typical scenario looks something like this:

job = sagemaker.estimator.Estimator(
    instance_type="ml.p4d.24xlarge",  # $32/hr
    instance_count=8,                 # × 8 = $256/hr
    max_run=86400                     # 24hr timeout... maybe
)
job.fit(data)

# What actually happens:
# Hour 1:   Training starts normally       $256
# Hour 6:   NaN loss, job "stalls"         $1,536
# Hour 24:  Timeout fires... or does it?   $6,144
# Hour 72:  Someone checks Slack           $18,432
# Hour 168: Weekend is over                $43,008

The failure modes are mundane and almost endless. A timeout that's set in seconds when the engineer thought it was hours. A checkpoint write that hangs and pins the cluster. A distributed job where one node dies and the others wait forever. A spot instance that gets preempted and silently retries on on-demand. We once worked with a team where a researcher set max_run=24 thinking "24 hours" — SageMaker interpreted it as 24 seconds, the job restarted on auto-retry, and 100 attempts later they had spent $12,000 on zero useful work.

The common thread: GPUs default to keep running, and "running" is not the same as "making progress."

2. Inference Autoscaling Gone Wrong

Autoscaling is supposed to save money. When configured incorrectly, it does the opposite — loudly. A 30-second traffic spike triggers a scale-up to 32 instances. The scale-in cooldown is 60 minutes. You just paid $128 for two seconds of demand.

Four traps appear over and over:

  • Cooldown mismatch. Default scale-in cooldown is 300 seconds. Your traffic spikes last 10 seconds. Net result: instances stay at full scale for 5 minutes after demand has vanished — a 30x cost multiplier on what you actually needed.
  • Wrong scaling metric. The autoscaler watches CPU utilization on a GPU workload. The GPU is at 95%, the CPU is at 15%, and the autoscaler concludes everything is fine. Meanwhile, the endpoint is timing out.
  • Min instances set to zero. A "cost optimization" that backfires when a viral spike hits and cold start for a large model is 8 to 12 minutes. Users retry, the request queue explodes, and the system scales to 50 instances to clear the backlog.
  • Multi-model endpoints. One model goes viral and the entire shared endpoint scales for it — including all the other models that are still serving 10 requests a minute on now-expensive instances.

Autoscaling is dangerous because the failure mode is silent. The endpoint stays up. The metrics look fine. The bill is the only place the problem shows up, and it shows up two weeks late.

3. The Egress Trap and Storage Creep

Most ML teams budget for compute and storage. Almost none budget for data transfer, which is where AWS quietly bills $0.02/GB for cross-region traffic, $0.09/GB for egress to the internet, and $0.01/GB even between availability zones in the same region.

Here's the math that catches teams off guard. A training job that reads 10TB of data over 50 epochs is moving 500TB. If your data lives in eu-west-1 and your dev environment spins up in us-east-1 because someone forgot to change the default region, that 500TB at $0.02/GB cross-region transfer is $10,000 — on top of the compute cost, and not counted in any cost estimate the team produced before the run.

$2,755 Typical monthly waste from storage creep on a productive ML team after six months: 50TB of abandoned checkpoints, 20TB of duplicate datasets, 5TB of verbose logs, 10TB of "temporary" files nobody touched in four months — plus EBS snapshots no one deletes.

Storage creep is the slowest of the cost spikes, but it's also the most reliable. Six months in, a team with no S3 lifecycle policies will have multiple TB of duplicate datasets ("just in case"), checkpoints from 800 dead experiments, and logs from a verbose framework default that nobody turned off. The spike doesn't happen in one month — it happens across all of them.

4. Forgotten Notebook Instances

The Friday afternoon ritual: a data scientist spins up an ml.p3.8xlarge for a "quick experiment" at $14.69/hour. Lunch happens. Then a meeting. Then another meeting. Then the weekend.

When they log in Monday morning, that one notebook has cost $1,058 to do exactly nothing. Multiply by ten engineers, average one forgotten instance per month, average 40 hours of overrun, and you're looking at $70,000 a year in pure waste from a behavior that no one would defend if you asked them about it directly.

The deeper issue is right-sizing. Engineers reach for GPU notebooks for tasks that don't need them. ml.p3.2xlarge at $3.06/hour for pandas operations that would run faster on an ml.t3.medium at $0.05/hour — a 61x ratio for the same outcome. Most teams don't have a right-sizing policy because no one has ever shown them the cost of not having one.

5. API Token Explosions

This category is newer than the others, but it's growing the fastest. As teams add LLM calls to production pipelines — via Bedrock, OpenAI, or Anthropic — they're discovering that the cost model for token-based APIs punishes carelessness in ways that EC2 doesn't.

The most common spike comes from a missing truncation. A developer writes a function that passes {entire_document} into a Bedrock call — fine for a 50,000-token test document, less fine when it's running against 100,000 documents a day. At $0.008 per 1K input tokens, that's $40,000 a day. The intended behavior — chunked input at ~2,000 tokens — would have cost $1,600. The difference: $38,400 a day from one missing line of code.

Three other token-cost patterns to watch:

  • Retry storms. A slow API response triggers a client timeout. With no exponential backoff and no jitter, the client retries five times. Every retry succeeds and is billed. 100 users × 5 retries × $0.01 = $5 per incident; at scale, $5,000 per API hiccup.
  • Cache-busting prompts. Putting datetime.now() or a user ID in the system prompt means every request is a unique cache key — you pay full price instead of the 90% cached-prefix discount. At 10M requests/day, that's a $27,000/day swing from prompt caching done right vs. wrong.
  • No token budgets. Most teams monitor token spend; few enforce it in code. By the time the dashboard alerts, the damage is hours old.

6. Spot Instance Cascades

Spot instances are the right answer most of the time — until they're not. Eight p4d.24xlarge spot instances at $10/hour vs. $32/hour on-demand is a 69% savings. Until AWS hits a capacity crunch, three of your nodes get preempted, the job fails, the auto-retry requests eight new spots, spot is unavailable, and the system silently falls back to on-demand.

You're now paying $256/hour instead of $80/hour. The alert threshold is set high enough that nobody notices for two and a half hours. The job completes — at three times the expected cost — and the spike never shows up in any dashboard because the job "succeeded." This is the failure mode that makes spot a hidden risk, not just a discount.

A cascade like this can also trigger checkpoint I/O bursts that produce surprise EBS throughput charges, which compound the spike further.

7. The Cost of Cost Monitoring

There's a particular irony to this category: the infrastructure you build to watch your costs has costs of its own, and they grow faster than most teams realize.

CloudWatch custom metrics emitted every second — common in ML training scripts — produce 2.6 million datapoints per metric per month. At $0.02 per 1,000 datapoints, that's $52 per metric per month. Fifty custom metrics later, you're paying $2,600 a month just for the metrics. Verbose framework logging from PyTorch Lightning across ten training nodes can hit 240GB/day in CloudWatch ingestion at $0.50/GB — $3,600 a month before you've stored anything. X-Ray distributed tracing on a high-volume inference endpoint adds segment storage costs that often run 10x the trace cost itself.

None of these are bugs. They're defaults that were reasonable in isolation and ruinous in aggregate. The lesson is that observability is itself a cost center that needs the same scrutiny you apply to compute.

8. Experiment Proliferation

Six months into a productive team's life, the MLflow or W&B artifact bucket tells a story. 847 experiments. Two gigabytes of metadata each. Ten checkpoints per experiment at 15GB. Total: 127TB of model artifacts, costing roughly $2,921 a month in S3 storage that grows every week and is almost never deleted.

The contributing behaviors are familiar:

  • Saving a checkpoint every epoch instead of just the best one.
  • Never deleting failed experiments — just in case.
  • Storing full model weights inside the experiment tracker rather than referencing them.
  • Logging raw data samples as artifacts.
  • No shared cleanup policy across teams, so no one feels responsible.

Like storage creep, this isn't a single cost event. It's a slow accretion that becomes load-bearing on the budget without anyone noticing.

The Spike Taxonomy at a Glance

If you map the eight categories by magnitude and detection lag, the picture is clarifying. The dangerous spikes are the ones that combine high magnitude with long detection lag — runaway training, autoscaling bugs, and API token explosions all show up here. The most common ones — forgotten notebooks and storage creep — are slow drains rather than sudden spikes.

Spike Type Magnitude Detection Lag
Runaway training10–100xHours to days
Autoscaling bug5–30xMinutes to hours
Data egress2–10xDays (invoice)
Forgotten notebookSlow drainWeeks
API token explosion10–50xHours
Retry storm5–20xMinutes to hours
Spot → on-demand fallback3–5xMinutes (if alerted)
Storage creep1.5–3x/monthMonths

The Guardrails That Actually Work

Every cost spike, regardless of category, ultimately comes down to the same root cause: cloud resources default to "keep running," and billing is always a lagging indicator. The fix is hard limits and automated cleanup, not better dashboards.

Cost spike detection should be multi-layered. At the real-time layer (minutes), you want budget alerts at 50%, 80%, and 100% of daily spend, anomaly detection on per-service cost, and alerts on instance count exceeding a threshold. Near-real-time (hours): hourly cost reports to a Slack channel, GPU utilization below 20% for more than 30 minutes, and alerts when a training job runs more than 2x its expected duration. Reactive (daily): a digest with week-over-week comparison, top five cost drivers highlighted, and an untagged-resource report.

Beyond alerting, here's the short list of guardrails that actually prevent spikes rather than just announcing them after the fact:

Training jobs. Always set a max_run timeout. Cap retry attempts — the default is often unlimited. Tag every job with team, project, and experiment ID. Set up failure alerts, not just success alerts.

Inference. Tune scale-in/scale-out cooldowns to your actual traffic shape. Define a hard maximum instance count. Load-test autoscaling behavior before launch. Set request and token quotas per user or API key.

Development. Auto-stop on every notebook (30–60 minutes idle). A right-sizing policy — no GPU instances for data exploration. Scheduled shutdown for non-prod environments. A regular zombie-instance audit.

Storage. Lifecycle policies on every S3 bucket. Keep the best N checkpoints, not all of them. Cross-region replication only for production artifacts.

APIs. Token budgets enforced in code, not just monitored. Exponential backoff with jitter on every retry. A circuit breaker with a max retry count. Prompt caching enabled and verified working.

The Underlying Problem: You Can't Prevent What You Can't See

Every guardrail above assumes you know which resources belong to which model, team, and pipeline. In practice, that's where most ML cost programs fall apart. Tagging compliance for ML resources at typical organizations sits at 40–60%. Cross-team tag conventions don't match. EKS-managed GPU instances often don't propagate tags to the underlying EC2. The Cost and Usage Report shows you what AWS charged — but not which model spiked, or why.

This is the gap MLCostIntel was built to close. We ingest your CUR data, automatically map AWS resources to ML workloads using metadata from SageMaker, EKS, and your CI/CD pipeline, and surface anomalies at the model and team level rather than the service level. When a training job goes runaway, you find out in minutes. When a notebook gets forgotten, it shows up on the next morning's digest. When the egress bill on next month's invoice is going to be $10,000 higher than expected, you see it before the invoice arrives.

If you're tired of finding out about ML cost spikes the same week your AWS bill clears, try MLCostIntel free. We connect to your AWS account, ingest your CUR data, and show you what every model, experiment, and pipeline is actually costing — before the next surprise.

MLCostIntel gives fintech and healthcare teams running ML on AWS real-time visibility into runaway training jobs, autoscaling cost overruns, and forgotten resources — before they hit your invoice. Start free →