Python scripting for SREs: automating incident response
Every SRE has done it โ woken up at 3 AM to a PagerDuty alert, SSH'd into a box, and spent 20 minutes running the same diagnostic commands. What if your scripts could do the first pass for you?
Why Python for SRE?
Python is the lingua franca of infrastructure automation for good reason:
- Every cloud SDK has a Python client โ AWS boto3, GCP google-cloud, Azure azure-identity
- Kubernetes API is first-class โ the
kubernetesPython package covers every resource type - Fast to write, easy to read โ your on-call rotation needs to understand these scripts at 3 AM
Automated incident triage
Here's a real script that runs when an alert fires:
from kubernetes import client, config
from kubernetes.client import V1Pod
config.load_kube_config()
api = client.CoreV1Api()
def get_crashing_pods(namespace: str, label_selector: str) -> list[V1Pod]:
"""Get pods that are in CrashLoopBackOff or Error state."""
pods = api.list_namespaced_pod(
namespace=namespace,
label_selector=label_selector,
)
return [
p for p in pods.items
if p.status.phase in ('Failed', 'Pending')
or any(
cs.state.terminated
and cs.state.terminated.exit_code != 0
for cs in (p.status.container_statuses or [])
if cs.state
)
]
def collect_logs(pod_name: str, namespace: str, tail_lines: int = 100) -> str:
"""Grab the last N lines of logs from a crashing pod."""
return api.read_namespaced_pod_log(
name=pod_name,
namespace=namespace,
tail_lines=tail_lines,
)
if __name__ == '__main__':
crashing = get_crashing_pods('production', 'app=my-service')
for pod in crashing:
logs = collect_logs(pod.metadata.name, 'production')
print(f"--- {pod.metadata.name} ---")
print(logs[-500:]) # Last 500 charsKey patterns
Pattern 1: Idempotent remediation
Always write remediation scripts that are safe to run multiple times:
def scale_up(namespace: str, deployment: str, replicas: int) -> None:
"""Scale up a deployment โ safe to call repeatedly."""
apps = client.AppsV1Api()
body = {
"spec": {"replicas": replicas}
}
apps.patch_namespaced_deployment_scale(
name=deployment,
namespace=namespace,
body=body,
)Pattern 2: Circuit breaker for remediation
Don't let your automation make things worse:
import time
MAX_RETRIES = 3
RETRY_WINDOW = 300 # 5 minutes
def safe_restart(pod_name: str, namespace: str, retry_log: list[dict]) -> bool:
"""Only restart if we haven't tried too many times recently."""
recent = [
r for r in retry_log
if time.time() - r['timestamp'] < RETRY_WINDOW
]
if len(recent) >= MAX_RETRIES:
print(f"Circuit breaker open: {MAX_RETRIES} retries in {RETRY_WINDOW}s")
return False
api.delete_namespaced_pod(name=pod_name, namespace=namespace)
retry_log.append({'pod': pod_name, 'timestamp': time.time()})
return TruePattern 3: Structured output for Slack
Send actionable messages, not walls of text:
def format_incident_block(pod: V1Pod, logs: str) -> dict:
return {
"type": "section",
"text": {
"type": "mrkdwn",
"text": (
f"*๐ด Pod Crash: `{pod.metadata.name}`*\n"
f"*Namespace:* `{pod.metadata.namespace}`\n"
f"*Restarts:* `{pod.status.container_statuses[0].restart_count}`\n"
f"```\n{logs[-300:]}\n```"
),
},
}Setting up the alert pipeline
- PagerDuty webhook triggers a Lambda function
- Lambda runs the triage script and collects diagnostic data
- Slack message is sent with findings + one-click remediation button
- Runbook link is included for human follow-up
The whole thing takes about 30 seconds from alert to actionable Slack message.
No more SSH at 3 AM just to run kubectl logs.
What's next
- Add ** anomaly detection** with simple statistical thresholds
- Build a runbook generator from pod specifications
- Create a post-incident report template that auto-fills from the diagnostic data
TL;DR: A few hundred lines of Python can eliminate most of the repetitive toil from incident response. Start with triage, add remediation carefully, and always include circuit breakers.
Stay in the loop
New posts, tools & scripts โ no spam, unsubscribe anytime.