Building a Collaborative Agent Framework: Automating Trajectory Analysis with GitHub Copilot

Overview

If you’ve ever found yourself repeating the same intellectual grind—pouring over thousands of lines of agent trajectory data to spot patterns—you know the itch to automate. That’s exactly what sparked eval-agents, a project that transforms how a research team analyzes coding agent performance. By combining GitHub Copilot’s assistance with a reusable, shareable agent framework, the team moved from poring over hundreds of thousands of JSON lines each day to letting agents do the heavy lifting.

Building a Collaborative Agent Framework: Automating Trajectory Analysis with GitHub Copilot — Source: github.blog

This guide walks you through the same approach: creating your own set of evaluation agents that automate the tedious parts of benchmark analysis, making it easy for you and your colleagues to focus on the creative, high-level insights. The core philosophy is simple: make agents easy to share, easy to author, and make them the primary vehicle for contributions.

Prerequisites

Before diving in, you should be comfortable with:

GitHub Copilot: Familiarity with using Copilot in your editor (e.g., VS Code) for code suggestions and chat.
Python and JSON: Understanding of basic Python scripting and reading/writing JSON files.
Agent evaluation concepts: Knowing what agent trajectories are (lists of thought processes and actions) and benchmarks like SWEBench-Pro or TerminalBench2.
Git and GitHub: Basic version control and repository management for sharing your agents.

You don’t need to be an AI researcher—this framework is meant to be accessible to any developer who works with agent logs.

Step-by-Step Guide

1. Identify Your Repetitive Analysis Patterns

Open a directory of trajectory JSON files. Look for the queries you repeat: “How many tasks failed due to timeout?” “Which actions are most common in successful runs?” “Extract all tool-call sequences.” That’s your automation target. In the original project, the author noticed they were constantly using Copilot to surface patterns, then manually investigating—a loop ripe for automation.

Action: List three analysis tasks you perform on every new benchmark run. For example:

Count trajectories exceeding a file-size threshold.
Identify all tasks where the agent’s final action was a “fail” status.
Summarize the average number of steps per task.

2. Build Your First Eval Agent with Copilot

Open a fresh Python file. Using Copilot Chat, start a conversation: “I have a list of JSON trajectory files. I need a script that reads each file, extracts the steps array, and prints a summary of step counts per task.” Copilot will suggest a for loop using json.load(). Accept and refine. Then ask: “Now, for each task, calculate the success rate by checking if status equals ‘success’.”

Pro tip: Use Copilot’s inline suggestions to build modular functions—one for reading, one for parsing, one for summarizing. This makes the agent reusable. Name your main function analyze_trajectories and let Copilot fill in the rest.

Wrap your logic in a class TrajectoryAgent with methods like run_analysis() that take a file pattern as argument. This becomes the skeleton for your first eval agent. Test it on a few files.

Now that your agent works, package it for your team. Create a GitHub repository named eval-agents (or whatever fits your project). Structure it with a agents/ directory containing one file per agent, e.g., trajectory_summarizer.py. Write a short README.md explaining how to run each agent and what it does. Use Copilot to generate the README from the code.

Important: Add a requirements.txt listing dependencies (likely none beyond standard libraries). Then ask a colleague to try your agent. Did they need to edit anything? The goal is zero friction. In the original team, the author designed these agents so that anyone could run them on new benchmark runs without understanding the internals.

4. Iterate and Extend Agent Capabilities

Once the base agent is shared, encourage contributions. Create an issue template for new agent ideas. Use Copilot to help write the next agent—maybe one that visualizes trajectory step distributions. Pair program: let Copilot suggest code, then you refine the logic. Over time, the agent library grows organically.

To maintain quality, include unit tests for each agent. Ask Copilot: “Write a test for TrajectoryAgent using a sample JSON file.” It will generate a test with dummy data. Push to the repo and set up a simple CI to run tests on pull requests.

Common Mistakes

Over-automation: Don’t try to automate everything upfront. Start with the most painful repetitive task. The original author iterated on a single loop—extracting patterns—before expanding.
Ignoring edge cases: Trajectory files can have inconsistent fields or missing data. Build error handling from the start: wrap file reads in try/except blocks. Copilot can help generate these.
For instance, ask: “Handle JSONDecodeError gracefully and log the filename.”
Not sharing early: The breakthrough came from enabling the whole team, not just one person. Avoid building a personal Swiss Army knife; instead, push a minimal viable agent to a shared repo. Then refine based on feedback.

Summary

By following this blueprint, you can transform the way your team analyzes agent benchmarks—cutting the time from hours to minutes. The key is to identify repetitive intellectual toil, build a simple agent with GitHub Copilot’s help, share it without friction, and iterate collaboratively. Your new role? The maintainer of an ever-growing toolkit that lets everyone do more creative work.

Building a Collaborative Agent Framework: Automating Trajectory Analysis with GitHub Copilot

Overview

Prerequisites

Step-by-Step Guide

1. Identify Your Repetitive Analysis Patterns

2. Build Your First Eval Agent with Copilot

4. Iterate and Extend Agent Capabilities

Common Mistakes

Summary

Related Articles

Recommended

Discover More

Building a Collaborative Agent Framework: Automating Trajectory Analysis with GitHub Copilot

Overview

Prerequisites

Step-by-Step Guide

1. Identify Your Repetitive Analysis Patterns

2. Build Your First Eval Agent with Copilot

3. Share and Collaborate on Agent Libraries

4. Iterate and Extend Agent Capabilities

Common Mistakes

Summary

Related Articles

Recommended

Discover More