#python#devops#automation

Python scripting for SREs: automating incident response

May 18, 2026·9 min read

Every SRE has done it — woken up at 3 AM to a PagerDuty alert, SSH'd into a box, and spent 20 minutes running the same diagnostic commands. What if your scripts could do the first pass for you?

Why Python for SRE?

Python is the lingua franca of infrastructure automation for good reason:

Every cloud SDK has a Python client — AWS boto3, GCP google-cloud, Azure azure-identity
Kubernetes API is first-class — the kubernetes Python package covers every resource type
Fast to write, easy to read — your on-call rotation needs to understand these scripts at 3 AM

Automated incident triage

Here's a real script that runs when an alert fires:

from kubernetes import client, config
from kubernetes.client import V1Pod
 
config.load_kube_config()
api = client.CoreV1Api()
 
def get_crashing_pods(namespace: str, label_selector: str) -> list[V1Pod]:
    """Get pods that are in CrashLoopBackOff or Error state."""
    pods = api.list_namespaced_pod(
        namespace=namespace,
        label_selector=label_selector,
    )
    return [
        p for p in pods.items
        if p.status.phase in ('Failed', 'Pending')
        or any(
            cs.state.terminated
            and cs.state.terminated.exit_code != 0
            for cs in (p.status.container_statuses or [])
            if cs.state
        )
    ]
 
def collect_logs(pod_name: str, namespace: str, tail_lines: int = 100) -> str:
    """Grab the last N lines of logs from a crashing pod."""
    return api.read_namespaced_pod_log(
        name=pod_name,
        namespace=namespace,
        tail_lines=tail_lines,
    )
 
if __name__ == '__main__':
    crashing = get_crashing_pods('production', 'app=my-service')
    for pod in crashing:
        logs = collect_logs(pod.metadata.name, 'production')
        print(f"--- {pod.metadata.name} ---")
        print(logs[-500:])  # Last 500 chars

Key patterns

Pattern 1: Idempotent remediation

Always write remediation scripts that are safe to run multiple times:

def scale_up(namespace: str, deployment: str, replicas: int) -> None:
    """Scale up a deployment — safe to call repeatedly."""
    apps = client.AppsV1Api()
    body = {
        "spec": {"replicas": replicas}
    }
    apps.patch_namespaced_deployment_scale(
        name=deployment,
        namespace=namespace,
        body=body,
    )

Pattern 2: Circuit breaker for remediation

Don't let your automation make things worse:

import time
 
MAX_RETRIES = 3
RETRY_WINDOW = 300  # 5 minutes
 
def safe_restart(pod_name: str, namespace: str, retry_log: list[dict]) -> bool:
    """Only restart if we haven't tried too many times recently."""
    recent = [
        r for r in retry_log
        if time.time() - r['timestamp'] < RETRY_WINDOW
    ]
 
    if len(recent) >= MAX_RETRIES:
        print(f"Circuit breaker open: {MAX_RETRIES} retries in {RETRY_WINDOW}s")
        return False
 
    api.delete_namespaced_pod(name=pod_name, namespace=namespace)
    retry_log.append({'pod': pod_name, 'timestamp': time.time()})
    return True

Pattern 3: Structured output for Slack

Send actionable messages, not walls of text:

def format_incident_block(pod: V1Pod, logs: str) -> dict:
    return {
        "type": "section",
        "text": {
            "type": "mrkdwn",
            "text": (
                f"*🔴 Pod Crash: `{pod.metadata.name}`*\n"
                f"*Namespace:* `{pod.metadata.namespace}`\n"
                f"*Restarts:* `{pod.status.container_statuses[0].restart_count}`\n"
                f"```\n{logs[-300:]}\n```"
            ),
        },
    }

Setting up the alert pipeline

PagerDuty webhook triggers a Lambda function
Lambda runs the triage script and collects diagnostic data
Slack message is sent with findings + one-click remediation button
Runbook link is included for human follow-up

The whole thing takes about 30 seconds from alert to actionable Slack message. No more SSH at 3 AM just to run kubectl logs.

What's next

Add ** anomaly detection** with simple statistical thresholds
Build a runbook generator from pod specifications
Create a post-incident report template that auto-fills from the diagnostic data

TL;DR: A few hundred lines of Python can eliminate most of the repetitive toil from incident response. Start with triage, add remediation carefully, and always include circuit breakers.