A growing AI startup in San Francisco is seeking a Data Scientist to build the data and evaluation backbone for AI-native developer workflows. You'll establish the company's truth layer and develop metrics to gauge product quality and efficacy. The ideal candidate has a strong background in statistics and experimentation, along with proficiency in SQL and Python. This is a full-time role for a mid-senior level candidate, offering significant equity and comprehensive health benefits.
04/02/2026
Full time
A growing AI startup in San Francisco is seeking a Data Scientist to build the data and evaluation backbone for AI-native developer workflows. You'll establish the company's truth layer and develop metrics to gauge product quality and efficacy. The ideal candidate has a strong background in statistics and experimentation, along with proficiency in SQL and Python. This is a full-time role for a mid-senior level candidate, offering significant equity and comprehensive health benefits.
Join to apply for the Data Scientist role at Guild.ai Get AI-powered advice on this job and more exclusive features. Opportunity Build the Data & Evaluation Backbone for AI-Native Developer Workflows. This isn't a typical DS role focused on optimizing a mature funnel. As the first Data Scientist at Guild.ai, you'll establish the company's "truth layer"-from product instrumentation and decision metrics to evaluation frameworks for autonomous, event-driven AI systems. We're tackling one of the hardest-and most important-problems in software engineering: helping developers understand, evolve, and operate complex systems using autonomous and event-driven AI. Your work will ensure we ship the right things, know whether they're working, and continuously improve quality, reliability, and user trust. If you thrive in ambiguity, love turning messy signals into crisp insight, and want to build the measurement culture for a 0 1 product with real technical depth, this role is for you. What You Will Do Define What "Good" Means: Partner with founders, engineering, and design to define product KPIs and quality metrics-especially around AI behaviors (helpfulness, correctness, reliability, latency, cost, user trust). Build the Measurement Foundation: Establish event taxonomy, instrumentation standards, and core datasets. Ensure we can answer product questions quickly and confidently. Create AI Evaluation & Monitoring Systems: Develop offline/online evaluation approaches for agentic workflows (e.g., golden sets, human review loops, heuristic + model-based scoring, regression detection, error taxonomies). Run Experiments That Change Decisions: Design and analyze A/B tests and quasi-experiments; bring statistical rigor to iteration speed. Turn Insight into Action: Produce analyses, narratives, and recommendations that directly shape roadmap tradeoffs and product direction. Enable Self-Serve Analytics: Build dashboards and lightweight tooling that help the entire team understand usage, performance, and customer outcomes. Be a Cross-Functional Glue Layer: Work tightly with engineering on logging/telemetry, with PM on prioritization, and with GTM/customer conversations to connect product behavior to real-world impact. Define Data Science at Guild.ai: Establish best practices for metrics, experimentation, and decision-making frameworks that scale as the team grows. What You Will Bring Strong foundations in statistics, experimentation, and causal reasoning, with a track record of driving product decisions through data. Fluency in SQL and Python, and comfort working across the data stack (from raw events to analysis-ready datasets). Experience building analytics and measurement systems in a fast-moving environment (startup and/or high-ownership teams). Ability to translate ambiguous questions into well-scoped analyses and clear recommendations. High judgment and crisp communication-especially when data is incomplete or messy. A founder's mentality: comfortable building from scratch, prioritizing ruthlessly, and owning outcomes end-to-end. Bonus Points Experience evaluating or monitoring LLMs / agentic systems (quality measurement, human-in-the-loop evals, regression testing, safety/reliability metrics) Familiarity with developer tools, infrastructure, observability, or Git-based workflows Comfort with modern data tooling (warehouses, dbt, orchestration, BI) and event-driven architectures Experience establishing experimentation and analytics culture at an early-stage startup Benefits & Perks Significant equity in an early-stage, venture-backed startup Comprehensive Health Benefits (Medical, Dental, Vision) Flexible PTO to ensure you have the time you need to recharge Thank you for your interest-we can't wait to meet you. Referrals increase your chances of interviewing at Guild.ai by 2x. Seniority Level Mid-Senior level Employment Type Full-time Job Function Engineering and Information Technology Industry Software Development Sign in to set job alerts for "Data Scientist" roles.
04/02/2026
Full time
Join to apply for the Data Scientist role at Guild.ai Get AI-powered advice on this job and more exclusive features. Opportunity Build the Data & Evaluation Backbone for AI-Native Developer Workflows. This isn't a typical DS role focused on optimizing a mature funnel. As the first Data Scientist at Guild.ai, you'll establish the company's "truth layer"-from product instrumentation and decision metrics to evaluation frameworks for autonomous, event-driven AI systems. We're tackling one of the hardest-and most important-problems in software engineering: helping developers understand, evolve, and operate complex systems using autonomous and event-driven AI. Your work will ensure we ship the right things, know whether they're working, and continuously improve quality, reliability, and user trust. If you thrive in ambiguity, love turning messy signals into crisp insight, and want to build the measurement culture for a 0 1 product with real technical depth, this role is for you. What You Will Do Define What "Good" Means: Partner with founders, engineering, and design to define product KPIs and quality metrics-especially around AI behaviors (helpfulness, correctness, reliability, latency, cost, user trust). Build the Measurement Foundation: Establish event taxonomy, instrumentation standards, and core datasets. Ensure we can answer product questions quickly and confidently. Create AI Evaluation & Monitoring Systems: Develop offline/online evaluation approaches for agentic workflows (e.g., golden sets, human review loops, heuristic + model-based scoring, regression detection, error taxonomies). Run Experiments That Change Decisions: Design and analyze A/B tests and quasi-experiments; bring statistical rigor to iteration speed. Turn Insight into Action: Produce analyses, narratives, and recommendations that directly shape roadmap tradeoffs and product direction. Enable Self-Serve Analytics: Build dashboards and lightweight tooling that help the entire team understand usage, performance, and customer outcomes. Be a Cross-Functional Glue Layer: Work tightly with engineering on logging/telemetry, with PM on prioritization, and with GTM/customer conversations to connect product behavior to real-world impact. Define Data Science at Guild.ai: Establish best practices for metrics, experimentation, and decision-making frameworks that scale as the team grows. What You Will Bring Strong foundations in statistics, experimentation, and causal reasoning, with a track record of driving product decisions through data. Fluency in SQL and Python, and comfort working across the data stack (from raw events to analysis-ready datasets). Experience building analytics and measurement systems in a fast-moving environment (startup and/or high-ownership teams). Ability to translate ambiguous questions into well-scoped analyses and clear recommendations. High judgment and crisp communication-especially when data is incomplete or messy. A founder's mentality: comfortable building from scratch, prioritizing ruthlessly, and owning outcomes end-to-end. Bonus Points Experience evaluating or monitoring LLMs / agentic systems (quality measurement, human-in-the-loop evals, regression testing, safety/reliability metrics) Familiarity with developer tools, infrastructure, observability, or Git-based workflows Comfort with modern data tooling (warehouses, dbt, orchestration, BI) and event-driven architectures Experience establishing experimentation and analytics culture at an early-stage startup Benefits & Perks Significant equity in an early-stage, venture-backed startup Comprehensive Health Benefits (Medical, Dental, Vision) Flexible PTO to ensure you have the time you need to recharge Thank you for your interest-we can't wait to meet you. Referrals increase your chances of interviewing at Guild.ai by 2x. Seniority Level Mid-Senior level Employment Type Full-time Job Function Engineering and Information Technology Industry Software Development Sign in to set job alerts for "Data Scientist" roles.
An innovative AI startup in San Francisco seeks a Mid-Senior AI Engineer to build production agents that will help developers understand and operate complex systems through cutting-edge AI technology. You'll design workflows, integrate with developer tools, and ensure reliability and observability of agents in real-world applications. Candidates should have a strong software engineering background, hands-on LLM experience, and proficiency in Python. The role offers significant equity and flexible PTO.
04/02/2026
Full time
An innovative AI startup in San Francisco seeks a Mid-Senior AI Engineer to build production agents that will help developers understand and operate complex systems through cutting-edge AI technology. You'll design workflows, integrate with developer tools, and ensure reliability and observability of agents in real-world applications. Candidates should have a strong software engineering background, hands-on LLM experience, and proficiency in Python. The role offers significant equity and flexible PTO.
A pioneering AI company in San Francisco seeks a Mid-Senior AI Engineer to design and implement evaluation frameworks for intelligent systems. The ideal candidate will have a strong background in machine learning and large language models, with 2-5 years of relevant experience. Responsibilities include building robust evaluation harnesses and collaborating across teams to enhance AI models. This role offers significant equity and comprehensive health benefits in a dynamic startup environment.
04/02/2026
Full time
A pioneering AI company in San Francisco seeks a Mid-Senior AI Engineer to design and implement evaluation frameworks for intelligent systems. The ideal candidate will have a strong background in machine learning and large language models, with 2-5 years of relevant experience. Responsibilities include building robust evaluation harnesses and collaborating across teams to enhance AI models. This role offers significant equity and comprehensive health benefits in a dynamic startup environment.
AI Engineer, Production Agents Guild.ai is looking for a founding engineer focused on building production agents-someone who will push our platform to its limits by creating some of the first real-world agents that developers rely on. Overview We're tackling one of the hardest-and most important-problems in software engineering: helping developers understand, evolve, and operate complex systems using autonomous and event-driven AI. Your agents will be among the first proof points that this new way of building is not only possible, but better. What you will do Build the first production agents: design, implement, and ship some of the earliest agents built on the Guild.ai platform-agents that developers will use to understand, debug, and evolve complex software systems. Push the platform by using it: act as both power user and core contributor, feed experience back into the platform's APIs, abstractions, and UX. Own agent workflows end-to-end: take agents from idea prototype production-task scoping, architecture, prompts, tools, integrations, logging, and iteration based on real-world behavior. Integrate deeply with real developer environments: connect agents to source control, CI/CD, observability, and other components of modern engineering stacks so they can operate on real code and real systems. Make agents reliable, safe, and observable: implement guardrails, monitoring, and debugging tooling. Collaborate closely with product and evaluation teams: define agent behaviors, success metrics, and iteration loops. Use evaluation harnesses and telemetry to guide improvements. Shape the agent engineering practice at Guild.ai: help define patterns, libraries, and best practices for building agents on the platform. What you will bring Strong software engineering background and experience owning complex features or systems end-to-end. Hands on experience building with LLMs (prompting, tool calling, function calling, RAG, workflows) in a production or high stakes environment. Proficiency in Python and comfort with TypeScript or modern web/backend stacks; ability to design and reason about distributed or event driven systems, APIs, and integrations. A practical mindset around reliability: logging, observability, debugging, and iterative hardening of systems in production. Comfort operating in a high ambiguity, high ownership startup environment. Clear communication and a strong product sense-you care that what you build solves real problems for engineers. Bonus points Experience building agentic systems (tool using agents, workflow engines, multi step or multi agent setups). Familiarity with developer tools, infrastructure, observability, or platform products. Experience integrating with Git based workflows, CI/CD, cloud services, or internal tooling used by engineering teams. Prior work with evaluation or monitoring of LLM based systems in production. Experience at an early stage startup or in a role where you were the primary builder for a new product area. Benefits & perks Significant equity in an early stage, venture backed startup. Comprehensive health benefits (medical, dental, vision). Flexible PTO to ensure you have time to recharge. Seniority level Mid Senior level Employment type Full time Job function Engineering and Information Technology Location San Francisco, CA
04/02/2026
Full time
AI Engineer, Production Agents Guild.ai is looking for a founding engineer focused on building production agents-someone who will push our platform to its limits by creating some of the first real-world agents that developers rely on. Overview We're tackling one of the hardest-and most important-problems in software engineering: helping developers understand, evolve, and operate complex systems using autonomous and event-driven AI. Your agents will be among the first proof points that this new way of building is not only possible, but better. What you will do Build the first production agents: design, implement, and ship some of the earliest agents built on the Guild.ai platform-agents that developers will use to understand, debug, and evolve complex software systems. Push the platform by using it: act as both power user and core contributor, feed experience back into the platform's APIs, abstractions, and UX. Own agent workflows end-to-end: take agents from idea prototype production-task scoping, architecture, prompts, tools, integrations, logging, and iteration based on real-world behavior. Integrate deeply with real developer environments: connect agents to source control, CI/CD, observability, and other components of modern engineering stacks so they can operate on real code and real systems. Make agents reliable, safe, and observable: implement guardrails, monitoring, and debugging tooling. Collaborate closely with product and evaluation teams: define agent behaviors, success metrics, and iteration loops. Use evaluation harnesses and telemetry to guide improvements. Shape the agent engineering practice at Guild.ai: help define patterns, libraries, and best practices for building agents on the platform. What you will bring Strong software engineering background and experience owning complex features or systems end-to-end. Hands on experience building with LLMs (prompting, tool calling, function calling, RAG, workflows) in a production or high stakes environment. Proficiency in Python and comfort with TypeScript or modern web/backend stacks; ability to design and reason about distributed or event driven systems, APIs, and integrations. A practical mindset around reliability: logging, observability, debugging, and iterative hardening of systems in production. Comfort operating in a high ambiguity, high ownership startup environment. Clear communication and a strong product sense-you care that what you build solves real problems for engineers. Bonus points Experience building agentic systems (tool using agents, workflow engines, multi step or multi agent setups). Familiarity with developer tools, infrastructure, observability, or platform products. Experience integrating with Git based workflows, CI/CD, cloud services, or internal tooling used by engineering teams. Prior work with evaluation or monitoring of LLM based systems in production. Experience at an early stage startup or in a role where you were the primary builder for a new product area. Benefits & perks Significant equity in an early stage, venture backed startup. Comprehensive health benefits (medical, dental, vision). Flexible PTO to ensure you have time to recharge. Seniority level Mid Senior level Employment type Full time Job function Engineering and Information Technology Location San Francisco, CA
AI Engineer, Agents & Evaluation Guild.ai San Francisco, CA The Opportunity We're looking for our first AI Engineer focused on agents and evaluation-a foundational hire who will shape how we build, measure, and scale intelligent systems. Help developers understand, evolve, and operate complex systems using autonomous and event driven AI. Build evaluation frameworks, task harnesses, and orchestration strategies that make our agents reliable, testable, and genuinely useful. Create reusable benchmarks and artifacts that will inspire new approaches and push forward the broader foundation model ecosystem. Enjoy designing experiments, building systems, and iterating tightly between theory and code in a 0 1 research engineering style role. What You Will Do Create Task Evaluations That Matter: Design and implement task specific evaluations that measure and improve agent quality. Each evaluation should both drive concrete iteration on our agents and spark broader innovation around the task itself. Define Tasks, Datasets, and Harnesses: Clearly specify tasks, collect and curate balanced datasets, and build robust evaluation harnesses that can be used across agents and modeling approaches. There is ample room for architectural design and systems thinking here. Build and Use a Reusable Evaluation Framework: Develop frameworks and tools for running evaluations at scale. Use these frameworks to tune existing agents and to guide the development of new ones in our environment. Explore Agent Orchestration Strategies: Investigate and implement orchestration patterns (tooling, routing, decomposition, multi agent setups, etc.) that allow agents to tackle increasingly complex, multi step, and long horizon tasks. Apply Post Training Techniques: Experiment with post training approaches (e.g., fine tuning, preference optimization, reward shaping, distillation) to produce high performance models tailored to specific tasks and workflows. Run Experiments End to End: Design, run, and analyze experiments with rigor. Turn experimental results into clear recommendations and concrete changes to model configurations, prompts, and system design. Collaborate Deeply Across the Stack: Work closely with founders, product, and infrastructure engineers to ensure evaluations, agents, and platform primitives all reinforce each other. What You Will Bring MS or Ph.D. in a relevant field (e.g., Computer Science, Machine Learning, NLP) or equivalent practical experience. Strong background in machine learning and large language models, ideally including both research and hands on implementation. 2-5 years working with LLM technology, with familiarity across: Prompting and interaction patterns Agent and tool orchestration strategies Evaluation strategies for complex, open ended tasks Proficiency writing production quality code, especially in Python; comfort working with TypeScript or modern web/backend stacks. Experience designing and running experiments, and interpreting results in messy, real world settings. Self motivated, comfortable operating in an unstructured, high ambiguity environment. Strong communication skills and the ability to translate vague goals into concrete, testable setups. Bonus Points Experience building agentic systems (tool using agents, workflows, or multi agent systems) in real products. Prior work on model evaluation frameworks, benchmarking, or reliability/robustness testing. Familiarity with modern ML tooling (training/inference stacks, experiment tracking, data pipelines). Contributions to open source LLM, tooling, or evaluation projects. Experience at an early stage startup or research lab where you owned projects end to end. Benefits & Perks Significant equity in an early stage, venture backed startup. Comprehensive Health Benefits (Medical, Dental, Vision). Flexible PTO to ensure you have the time you need to recharge. Referrals increase your chances of interviewing at Guild.ai by 2x. Seniority level: Mid Senior level Employment type: Full time Job function: Engineering and Information Technology Industries: Software Development Get notified about new Artificial Intelligence Engineer jobs in San Francisco, CA.
04/02/2026
Full time
AI Engineer, Agents & Evaluation Guild.ai San Francisco, CA The Opportunity We're looking for our first AI Engineer focused on agents and evaluation-a foundational hire who will shape how we build, measure, and scale intelligent systems. Help developers understand, evolve, and operate complex systems using autonomous and event driven AI. Build evaluation frameworks, task harnesses, and orchestration strategies that make our agents reliable, testable, and genuinely useful. Create reusable benchmarks and artifacts that will inspire new approaches and push forward the broader foundation model ecosystem. Enjoy designing experiments, building systems, and iterating tightly between theory and code in a 0 1 research engineering style role. What You Will Do Create Task Evaluations That Matter: Design and implement task specific evaluations that measure and improve agent quality. Each evaluation should both drive concrete iteration on our agents and spark broader innovation around the task itself. Define Tasks, Datasets, and Harnesses: Clearly specify tasks, collect and curate balanced datasets, and build robust evaluation harnesses that can be used across agents and modeling approaches. There is ample room for architectural design and systems thinking here. Build and Use a Reusable Evaluation Framework: Develop frameworks and tools for running evaluations at scale. Use these frameworks to tune existing agents and to guide the development of new ones in our environment. Explore Agent Orchestration Strategies: Investigate and implement orchestration patterns (tooling, routing, decomposition, multi agent setups, etc.) that allow agents to tackle increasingly complex, multi step, and long horizon tasks. Apply Post Training Techniques: Experiment with post training approaches (e.g., fine tuning, preference optimization, reward shaping, distillation) to produce high performance models tailored to specific tasks and workflows. Run Experiments End to End: Design, run, and analyze experiments with rigor. Turn experimental results into clear recommendations and concrete changes to model configurations, prompts, and system design. Collaborate Deeply Across the Stack: Work closely with founders, product, and infrastructure engineers to ensure evaluations, agents, and platform primitives all reinforce each other. What You Will Bring MS or Ph.D. in a relevant field (e.g., Computer Science, Machine Learning, NLP) or equivalent practical experience. Strong background in machine learning and large language models, ideally including both research and hands on implementation. 2-5 years working with LLM technology, with familiarity across: Prompting and interaction patterns Agent and tool orchestration strategies Evaluation strategies for complex, open ended tasks Proficiency writing production quality code, especially in Python; comfort working with TypeScript or modern web/backend stacks. Experience designing and running experiments, and interpreting results in messy, real world settings. Self motivated, comfortable operating in an unstructured, high ambiguity environment. Strong communication skills and the ability to translate vague goals into concrete, testable setups. Bonus Points Experience building agentic systems (tool using agents, workflows, or multi agent systems) in real products. Prior work on model evaluation frameworks, benchmarking, or reliability/robustness testing. Familiarity with modern ML tooling (training/inference stacks, experiment tracking, data pipelines). Contributions to open source LLM, tooling, or evaluation projects. Experience at an early stage startup or research lab where you owned projects end to end. Benefits & Perks Significant equity in an early stage, venture backed startup. Comprehensive Health Benefits (Medical, Dental, Vision). Flexible PTO to ensure you have the time you need to recharge. Referrals increase your chances of interviewing at Guild.ai by 2x. Seniority level: Mid Senior level Employment type: Full time Job function: Engineering and Information Technology Industries: Software Development Get notified about new Artificial Intelligence Engineer jobs in San Francisco, CA.