AI Engineer, Agents & Evaluation

  • Guild.ai
  • San Francisco, California
  • 04/02/2026
Full time Information Technology Telecommunications Python Cisco Testing

Job Description

AI Engineer, Agents & Evaluation

Guild.ai

San Francisco, CA

The Opportunity

We're looking for our first AI Engineer focused on agents and evaluation-a foundational hire who will shape how we build, measure, and scale intelligent systems.

Help developers understand, evolve, and operate complex systems using autonomous and event driven AI. Build evaluation frameworks, task harnesses, and orchestration strategies that make our agents reliable, testable, and genuinely useful. Create reusable benchmarks and artifacts that will inspire new approaches and push forward the broader foundation model ecosystem.

Enjoy designing experiments, building systems, and iterating tightly between theory and code in a 0 1 research engineering style role.

What You Will Do
  • Create Task Evaluations That Matter: Design and implement task specific evaluations that measure and improve agent quality. Each evaluation should both drive concrete iteration on our agents and spark broader innovation around the task itself.
  • Define Tasks, Datasets, and Harnesses: Clearly specify tasks, collect and curate balanced datasets, and build robust evaluation harnesses that can be used across agents and modeling approaches. There is ample room for architectural design and systems thinking here.
  • Build and Use a Reusable Evaluation Framework: Develop frameworks and tools for running evaluations at scale. Use these frameworks to tune existing agents and to guide the development of new ones in our environment.
  • Explore Agent Orchestration Strategies: Investigate and implement orchestration patterns (tooling, routing, decomposition, multi agent setups, etc.) that allow agents to tackle increasingly complex, multi step, and long horizon tasks.
  • Apply Post Training Techniques: Experiment with post training approaches (e.g., fine tuning, preference optimization, reward shaping, distillation) to produce high performance models tailored to specific tasks and workflows.
  • Run Experiments End to End: Design, run, and analyze experiments with rigor. Turn experimental results into clear recommendations and concrete changes to model configurations, prompts, and system design.
  • Collaborate Deeply Across the Stack: Work closely with founders, product, and infrastructure engineers to ensure evaluations, agents, and platform primitives all reinforce each other.
What You Will Bring
  • MS or Ph.D. in a relevant field (e.g., Computer Science, Machine Learning, NLP) or equivalent practical experience.
  • Strong background in machine learning and large language models, ideally including both research and hands on implementation.
  • 2-5 years working with LLM technology, with familiarity across:
    • Prompting and interaction patterns
    • Agent and tool orchestration strategies
    • Evaluation strategies for complex, open ended tasks
  • Proficiency writing production quality code, especially in Python; comfort working with TypeScript or modern web/backend stacks.
  • Experience designing and running experiments, and interpreting results in messy, real world settings.
  • Self motivated, comfortable operating in an unstructured, high ambiguity environment.
  • Strong communication skills and the ability to translate vague goals into concrete, testable setups.
Bonus Points
  • Experience building agentic systems (tool using agents, workflows, or multi agent systems) in real products.
  • Prior work on model evaluation frameworks, benchmarking, or reliability/robustness testing.
  • Familiarity with modern ML tooling (training/inference stacks, experiment tracking, data pipelines).
  • Contributions to open source LLM, tooling, or evaluation projects.
  • Experience at an early stage startup or research lab where you owned projects end to end.
Benefits & Perks
  • Significant equity in an early stage, venture backed startup.
  • Comprehensive Health Benefits (Medical, Dental, Vision).
  • Flexible PTO to ensure you have the time you need to recharge.

Referrals increase your chances of interviewing at Guild.ai by 2x.

Seniority level: Mid Senior level

Employment type: Full time

Job function: Engineering and Information Technology

Industries: Software Development

Get notified about new Artificial Intelligence Engineer jobs in San Francisco, CA.