Devsjam
๐Ÿšง This is a demo site โ€” content is placeholder. The real blog is coming soon.
Back to posts
#python#devops#automation

Python scripting for SREs: automating incident response

ยท9 min read

Every SRE has done it โ€” woken up at 3 AM to a PagerDuty alert, SSH'd into a box, and spent 20 minutes running the same diagnostic commands. What if your scripts could do the first pass for you?

Why Python for SRE?

Python is the lingua franca of infrastructure automation for good reason:

  • Every cloud SDK has a Python client โ€” AWS boto3, GCP google-cloud, Azure azure-identity
  • Kubernetes API is first-class โ€” the kubernetes Python package covers every resource type
  • Fast to write, easy to read โ€” your on-call rotation needs to understand these scripts at 3 AM

Automated incident triage

Here's a real script that runs when an alert fires:

from kubernetes import client, config
from kubernetes.client import V1Pod
 
config.load_kube_config()
api = client.CoreV1Api()
 
def get_crashing_pods(namespace: str, label_selector: str) -> list[V1Pod]:
    """Get pods that are in CrashLoopBackOff or Error state."""
    pods = api.list_namespaced_pod(
        namespace=namespace,
        label_selector=label_selector,
    )
    return [
        p for p in pods.items
        if p.status.phase in ('Failed', 'Pending')
        or any(
            cs.state.terminated
            and cs.state.terminated.exit_code != 0
            for cs in (p.status.container_statuses or [])
            if cs.state
        )
    ]
 
def collect_logs(pod_name: str, namespace: str, tail_lines: int = 100) -> str:
    """Grab the last N lines of logs from a crashing pod."""
    return api.read_namespaced_pod_log(
        name=pod_name,
        namespace=namespace,
        tail_lines=tail_lines,
    )
 
if __name__ == '__main__':
    crashing = get_crashing_pods('production', 'app=my-service')
    for pod in crashing:
        logs = collect_logs(pod.metadata.name, 'production')
        print(f"--- {pod.metadata.name} ---")
        print(logs[-500:])  # Last 500 chars

Key patterns

Pattern 1: Idempotent remediation

Always write remediation scripts that are safe to run multiple times:

def scale_up(namespace: str, deployment: str, replicas: int) -> None:
    """Scale up a deployment โ€” safe to call repeatedly."""
    apps = client.AppsV1Api()
    body = {
        "spec": {"replicas": replicas}
    }
    apps.patch_namespaced_deployment_scale(
        name=deployment,
        namespace=namespace,
        body=body,
    )

Pattern 2: Circuit breaker for remediation

Don't let your automation make things worse:

import time
 
MAX_RETRIES = 3
RETRY_WINDOW = 300  # 5 minutes
 
def safe_restart(pod_name: str, namespace: str, retry_log: list[dict]) -> bool:
    """Only restart if we haven't tried too many times recently."""
    recent = [
        r for r in retry_log
        if time.time() - r['timestamp'] < RETRY_WINDOW
    ]
 
    if len(recent) >= MAX_RETRIES:
        print(f"Circuit breaker open: {MAX_RETRIES} retries in {RETRY_WINDOW}s")
        return False
 
    api.delete_namespaced_pod(name=pod_name, namespace=namespace)
    retry_log.append({'pod': pod_name, 'timestamp': time.time()})
    return True

Pattern 3: Structured output for Slack

Send actionable messages, not walls of text:

def format_incident_block(pod: V1Pod, logs: str) -> dict:
    return {
        "type": "section",
        "text": {
            "type": "mrkdwn",
            "text": (
                f"*๐Ÿ”ด Pod Crash: `{pod.metadata.name}`*\n"
                f"*Namespace:* `{pod.metadata.namespace}`\n"
                f"*Restarts:* `{pod.status.container_statuses[0].restart_count}`\n"
                f"```\n{logs[-300:]}\n```"
            ),
        },
    }

Setting up the alert pipeline

  1. PagerDuty webhook triggers a Lambda function
  2. Lambda runs the triage script and collects diagnostic data
  3. Slack message is sent with findings + one-click remediation button
  4. Runbook link is included for human follow-up

The whole thing takes about 30 seconds from alert to actionable Slack message. No more SSH at 3 AM just to run kubectl logs.

What's next

  • Add ** anomaly detection** with simple statistical thresholds
  • Build a runbook generator from pod specifications
  • Create a post-incident report template that auto-fills from the diagnostic data

TL;DR: A few hundred lines of Python can eliminate most of the repetitive toil from incident response. Start with triage, add remediation carefully, and always include circuit breakers.

Stay in the loop

New posts, tools & scripts โ€” no spam, unsubscribe anytime.

Comments