DAY 016 / AWS / GPUs

Cloud Infrastructure for AI: Compute, Cost, and Carbon

Bill Shock & Resource Starvation

AWS

GPUs

FinOps

Green AI

Spot Instances

Abstract

AI engineering is not just code; it is logistics. The transition from a laptop CPU to a cloud-based GPU cluster is where most prototypes die, either due to technical bottlenecks (Out of Memory errors) or financial exhaustion (leaving a $32/hour instance running over the weekend). This post establishes the discipline of AI FinOps. We dissect the architecture of cloud compute, defining exactly when to use expensive GPUs, how to leverage Spot instances without losing data, and how to automate infrastructure to prevent "zombie instances" from bankrupting your project or needlessly burning carbon.

1. Why This Topic Matters

The cloud is an infinite resource, but your budget is not.

The "Weekend" Failure: A P4d.24xlarge instance on AWS costs approximately $32/hour, while custom Trainium instances offer up to 50% better price-performance. Leaving it running from Friday 5 PM to Monday 9 AM costs $2,048 for zero work. This is a fireable offense in many startups.
The VRAM Wall: You cannot "download more RAM." If your model requires 24GB of VRAM and you provision an A10G (24GB) but load a 32-bit float version of Llama-3, it will crash immediately. Understanding hardware constraints is a prerequisite for deployment.
Carbon Responsibility: Training a single large model can emit as much carbon as five cars in their lifetimes. Efficient infrastructure is not just cheaper; it is an ethical imperative.

2. Core Concepts & Mental Models

CPU vs. GPU: The Parallelism Shift

CPU (Central Processing Unit): Few powerful cores. Great for sequential logic (databases, web servers, loops).
GPU (Graphics Processing Unit): Thousands of weaker cores. Great for matrix multiplication (deep learning).
Analogy: A CPU is a Ferrari (fast for one person). A GPU is a fleet of 5,000 buses (slow individually, but moves 200,000 people at once).

VRAM (Video RAM)

This is the single most critical constraint in AI hardware.

Model Weights: FP32 (4 bytes/param), FP16 (2 bytes/param), INT8 (1 byte/param).
Overhead: Activation memory (gradients) during training requires 2x-4x the memory of the weights alone.
Rule of Thumb: To inference a 7B parameter model in FP16, you need ~14GB VRAM. A 16GB GPU (like a T4 or A10g) is the bare minimum.

Pricing Models

On-Demand: Pay full price, available instantly. (Use for: Inference, dev environments).
Reserved/Savings Plans: Commit to 1-3 years for ~40% discount. (Use for: Base load production inference).
Spot Instances: Bid on unused capacity for ~90% discount. (Use for: Batch training, experiments).

3. Theoretical Foundations (The Trade-off)

The Spot Instance Trade-off

Cost vs. Availability. Spot instances are cheap because the cloud provider can reclaim them with a 2-minute warning.

The Engineering Challenge: If your training takes 4 days and the instance dies on day 3, you lose everything.
The Solution: Checkpointing. You must save the model weights to persistent storage (S3/GCS) every epoch or every $N$ steps. If the instance dies, the next instance resumes from the last checkpoint, not from scratch.

4. Production-Grade Implementation

We adopt a "Self-Destruct" Architecture for experimental workloads. The infrastructure code should explicitly define the lifespan of the resource.

The Stack:

Compute & Accelerators: AWS EC2 instances with NVIDIA GPUs (e.g., A10G for inference, H100/A100 for training), or AWS Inferentia2 / Trainium2 chips (purpose-built custom silicon for high-efficiency, cost-effective inference and training). GCP offers TPU v5e for scaled neural net training.
Inference Engines: Groq LPUs (Language Processing Units) for ultra-low latency, real-time inference on open-weight models.

Sustainability Lens:

Region Selection: Pick regions powered by low-carbon energy.
Good: us-west-2 (Oregon - Hydro), eu-north-1 (Stockholm - Hydro/Wind).
Bad: Regions powered primarily by coal grids.
Utilization: A GPU sitting idle at 0% utilization still draws significant power. Shut it down.

5. Hands-On Project / Exercise

Objective: Create a Python script using boto3 (AWS SDK) that launches a GPU instance for a specific task and guarantees its termination, even if the script crashes.

Constraint: We simulate the "Task" as a sleep command, but in production, this would be your training job.

The "Suicide Script" Pattern

Instead of relying on an external monitor to shut down the server, we inject a "User Data" script (startup script) into the server itself. The server manages its own death.

import boto3
import base64

# Configuration
AMI_ID = "ami-0123456789abcdef0" # Deep Learning AMI
INSTANCE_TYPE = "g4dn.xlarge"    # NVIDIA T4 GPU
KEY_NAME = "my-ssh-key"
MAX_RUNTIME_SECONDS = 3600       # Hard limit: 1 hour

# The script that runs ON the server immediately upon boot
user_data_script = f"""#!/bin/bash
echo "Starting Training Job..."

# 1. Pull Code & Data (Simulation)
# aws s3 cp s3://my-bucket/train.py .
# python3 train.py

echo "Simulating work..."
sleep 300 # Run task for 5 minutes

# 2. Upload Results
# aws s3 cp model.bin s3://my-bucket/results/

# 3. SELF-DESTRUCT
echo "Work complete. Terminating instance to save money."
shutdown -h now
"""

# Encode for AWS
encoded_script = base64.b64encode(user_data_script.encode("utf-8")).decode("utf-8")

def launch_ephemeral_worker():
    ec2 = boto3.client('ec2', region_name='us-west-2')

    print(f"Launching {INSTANCE_TYPE} in us-west-2 (Hydro-powered)...")

    instance = ec2.run_instances(
        ImageId=AMI_ID,
        InstanceType=INSTANCE_TYPE,
        KeyName=KEY_NAME,
        MinCount=1,
        MaxCount=1,
        InstanceInitiatedShutdownBehavior='terminate', # Crucial: Stop = Terminate
        UserData=encoded_script,
        TagSpecifications=[{
            'ResourceType': 'instance',
            'Tags': [{'Key': 'Project', 'Value': 'Day16-FinOps'}]
        }]
    )

    instance_id = instance['Instances'][0]['InstanceId']
    print(f"Launched Instance: {instance_id}")
    print("Instance will auto-terminate after the script completes.")

if __name__ == "__main__":
    launch_ephemeral_worker()

Why this is Production-Grade:

Inversion of Control: The script running on the instance controls the shutdown. If your laptop loses internet, the cloud instance still shuts down.
Hard Limits: By setting InstanceInitiatedShutdownBehavior='terminate', the command shutdown -h now deletes the instance, stopping billing immediately.

6. Ethical, Security & Safety Considerations

Quota Management: New cloud accounts usually have a quota of 0 GPUs to prevent fraud. You must request a quota increase (Service Quota) days before you need it.
Zombie Disks: Terminating an EC2 instance deletes the boot volume, but additional EBS volumes attached during runtime might persist. Ensure your Terraform/Script deletes attached storage, or you will pay for "Ghost Storage."
Carbon Transparency: Use tools like Cloud Carbon Footprint to visualize the environmental impact of your training runs.

7. Business & Strategic Implications

The "Buy vs. Rent" Decision:

Rent (Cloud): High OpEx, low CapEx. Best for volatile workloads, retraining, and experimentation.
Buy (On-Prem): High CapEx, low OpEx. If you have a GPU running 24/7/365 at full utilization, buying an H100 cluster is cheaper than AWS after ~12-18 months.
Warning: Buying requires a dedicated Infra team to manage cooling, networking, and drivers. For most teams, Cloud is the correct starting point.

8. Code Examples / Pseudocode

Spot Instance Interruption Handler (Python): If you use Spot, you must listen for the 2-minute warning.

import requests
import time

def check_for_spot_termination():
    try:
        # AWS metadata endpoint for termination notice
        url = "http://169.254.169.254/latest/meta-data/spot/instance-action"
        r = requests.get(url, timeout=0.5)
        if r.status_code == 200:
            print("SPOT INTERRUPTION DETECTED! Saving checkpoint...")
            save_checkpoint_now()
            return True
    except:
        pass
    return False

# In your training loop
# for epoch in range(epochs):
#     train_step()
#     if check_for_spot_termination():
#         break

9. Common Pitfalls & Misconceptions

"CPUs are too slow for everything."

Correction: CPUs are perfectly fine (and cheaper) for inference on small models (<100MB) or batch data processing. Don't waste a GPU on data cleaning.

Using fp32 everywhere.

Correction: Almost no modern AI training requires 32-bit precision. Use bf16 (Brain Float 16) or fp16 (Mixed Precision) to halve your VRAM usage and double your speed on Tensor Cores.

Ignoring Data Transfer Costs.

Correction: Moving data out of the cloud (Egress) is expensive. Moving data in is free. Keep your compute where your data is.

10. Prerequisites & Next Steps

Prerequisites:

An active AWS/GCP/Azure account.
pip install boto3
Basic understanding of Bash.

Next Steps:

We have the compute (Day 16). We have the API knowledge (Day 15).
Now we need to combine them to build the most popular GenAI architecture.
Move to Day 17: CI/CD for ML.

11. Further Reading & Resources

Tool: AWS EC2 On-Demand Instance Pricing (Bookmark this).
Tool: Cloud Carbon Footprint.
Paper: Patterson et al. (2021). Carbon Emissions and Large Neural Network Training.
Concept: Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better.