Beyond the Red Build

Building a Self-Healing CI/CD Pipeline with AI

Feb 05, 2026

As an engineer, one of the most persistent “broken windows” I see in in our day to day job is CI/CD fatigue. More than once I’ve been in the situation where a test fails, the build turns red, and—even though I am already deep in the next task—I have to context-switch, dig through logs, find the trivial syntax error, and push a fix.

What if the pipeline could heal itself?

In this article, we walk through a Proof of Concept (POC) that intercepts CI failures, uses an LLM to diagnose the root cause, and automatically opens a Pull Request with the fix.

The Problem: The High Cost of Trivial Failures

We often treat all build failures the same, but they aren’t. There’s a massive difference between a complex architectural regression and a simple error caused by a missed type. The latter is “mechanical work” that drains senior engineering time.

For the record: every time a developer context-switches from a complex feature to fix a trivial CI failure, we lose about 20 minutes of deep-work state. Over a year, that’s a massive tax on the team’s velocity.

Goal is to improve the Mean Time to Recovery (MTTR) by creating an autonomous “Front line Responder.”

The Core Architecture: The “Healer” Loop

A robust self-healing system isn’t just a wrapper around an LLM. It requires a controlled loop that follows the Analyze → Patch → Verify → Propose workflow.

Step 1: The Environment

We start with a simple Flask application designed to fail:

@app.route('/add/<int:a>/<int:b>')
def add(a, b):
    result = a + b
    return "The result is: " + result  # Intentional TypeError: str + int

Step 2: Capturing the “Crime Scene”

To heal a bug, the AI needs to “see.” so we pipe our test results into a log file. In a production environment, this on a higher level could be a call to a logging provider like Datadog or CloudWatch.


pytest > test_log.txt 2>&1

Why 2>&1? Because Python stack traces are sent to stderr (Standard Error). If you only pipe stdout, the AI gets a blank log and can’t diagnose the issue.

Step 3: The Diagnosis (LLM Integration)

The “brain” of the operation is the ask_ai_for_fix function. We provide the LLM with two crucial pieces of context: the failing logs and the original source code.

def ask_ai_for_fix(logs, source_code):
    """
    The Brain: Sends the failed logs and source code to the LLM.
    We instruct the AI to return ONLY code to avoid parsing errors.
    """
    print("🤖 AI is analyzing the crash...")
    prompt = f"""
    The following Python code failed tests. 
    SOURCE CODE:
    {source_code}
    
    ERROR LOG:
    {logs}
    
    Return ONLY the corrected code. No markdown formatting, no explanations.
    """
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Step 4: The Verification Agent (The Gatekeeper)

The most common critique of AI in DevOps is trust. An AI that opens broken PRs is just “automated noise.” To solve this, we implemented a Verification Agent—a local sandbox gatekeeper.

Before the script touches the GitHub API, it performs a “hot-swap” of the code in a local sandbox to verify the fix. Only if the tests pass does the orchestrator proceed to create a Pull Request.

def verify_fix(suggested_code):
    """
    The Gatekeeper: We use a 'Sandbox' approach here.
    1. Backup original code.
    2. Hot-swap the AI's fix.
    3. Run the tests.
    4. Restore original code.
    """
    print("🔬 Verification Agent: Testing the proposed fix...")
    with open("app.py", "r") as f:
        original_code = f.read()
    try:
        # Step 2: Write the AI fix to the actual file temporarily
        with open("app.py", "w") as f:
            f.write(suggested_code)
        
        # Step 3: Trigger pytest and check the OS exit code (0 = Success)
        result = subprocess.run(["pytest"], capture_output=True)
        if result.returncode == 0:
            print("✅ Verification Passed!")
            return True
        else:
            print("❌ Verification Failed.")
            return False
    finally:
        # Step 4: CRITICAL - Restore original code so we don't leave the app broken
        with open("app.py", "w") as f:
            f.write(original_code)

This ensures that we only talk to the GitHub API if we have clear proof (the test passing) that the fix works.

Step 5: Testing the Gatekeeper

After some years of sioftware development you know you can’t just “trust” your automation logic—you test it. To ensure the Verification Agent is actually working, we built a separate test utility (test_verification.py).

This script simulates the AI by feeding the verify_fix function two different scenarios: a “Known Good” fix and a “Known Broken” fix.

from healer import verify_fix

# 1. MOCK DATA: A fix that SHOULD pass
VALID_FIX = """
from flask import Flask

app = Flask(__name__)

@app.route('/add/<int:a>/<int:b>')
def add(a, b):
    result = a + b
    return f"The result is: {result}"

if __name__ == '__main__':
    app.run(debug=True)
"""

# 2. MOCK DATA: A fix that SHOULD fail (Syntax Error)
BROKEN_FIX = """
from flask import Flask
app = Flask(__name__)
@app.route('/add/<int:a>/int:b')
def add(a, b):
    # This will fail because of the undefined variable 'z'
    return f"Result: {a + b + z}"
"""

def run_test_suite():
    print("--- Starting Verification Agent Test ---")
    
    # Test Case 1: Valid Fix
    print("\nScenario 1: Testing a valid AI suggestion...")
    success = verify_fix(VALID_FIX)
    if success:
        print("RESULT: Test Passed. The agent correctly identified a working fix.")
    else:
        print("RESULT: Test Failed. The agent rejected a working fix.")
    
    print("-" * 40)
    
    # Test Case 2: Broken Fix
    print("\nScenario 2: Testing a broken AI suggestion...")
    failure = verify_fix(BROKEN_FIX)
    if not failure:
        print("RESULT: Test Passed. The agent correctly caught the error and blocked the PR.")
    else:
        print("RESULT: Test Failed. The agent allowed a broken fix through!")

if __name__ == "__main__":
    run_test_suite()

This utility proves that our finally block works correctly—the system can “test drive” a bad fix, reject it, and restore the original code without breaking the local file permanently.

Step 6: The Human-in-the-Loop Model

Why not just let the AI push directly to main? The AI does the heavy lifting of finding and fixing the bug, but a human performs the final code review. This balances speed with safety.

The script uses the PyGithub library to orchestrate the Git flow:

Dynamic Branching: It detects if your repo uses main or master automatically
Conflict Prevention: It deletes old fix branches before starting a new one
Human-in-the-Loop: It opens a Pull Request instead of pushing to the main branch

The Final Orchestration Flow

Here’s how it all ties together:

if __name__ == "__main__":
    # Gather Context
    logs = get_failed_log()
    with open("app.py", "r") as f: 
        code = f.read()

    # Actor: Suggest a fix
    fixed_code = ask_ai_for_fix(logs, code)
    clean_code = fixed_code.replace("```python", "").replace("```", "").strip()
    
    # Critic: Verify in a sandbox
    if verify_fix(clean_code):
        # Diplomat: Create the PR
        create_repair_pr(clean_code)
    else:
        print("🛑 PR aborted. The AI fix failed local verification.")

The Result

When the CI fails, the “Healer”:

Identifies the error
Generates a fix
Validates the fix
Creates a new branch
Submits a PR for human review

This reduces the Mean Time to Recovery (MTTR) from minutes (or hours of context switching) to seconds.

Disclaimer

This is not a silver bullet or a complete solution but a starting point. Not all failures are created equal, and this example is designed specifically for mechanical bugs—the low-hanging fruit like: missing imports, simple syntax errors, undefined variables. It is not designed to handle complex architectural issues, security vulnerabilities that need deep domain knowledge or business logic errors that require product context.

This starting as my own experimenting and thought of sharing in mind with the idea thar if I simply demonstrate a concept maybe I inspire someone outhere to build their own context-specific healing agents. Taking it to the next step and beyopnd to be successful in your project you’ll need to:

Tune the prompts for your specific tech stack
Adjust the verification strategy based on your test suite complexity
Add guardrails for your team’s specific failure patterns
Consider integrating with your existing observability tools

Conclusion: The Path to Agentic DevOps

This POC is a small step toward Agentic DevOps, created just to showcase an option. Of course the possibilites of development and improvement further are endless but main idea doesn’t change:by reducing the friction of trivial bug fixing, we can allow engineering teams to focus on more exciting tasks like system design and product features—rather than chasing missing a type in logs.

We’ve effectively created a “vigillent colleague“ who lives in our CI pipeline. It watches for errors, tries to fix them, proves they are fixed, and then politely asks for a review.

The result? We move from “The build is broken, can someone fix it already?” to “The build was broken, here you have a PR with a fix.”

Mihaela

Ready for more?