1) Agent Engineering & Productionization (40%)
-
Prototype, iterate, and productionize domain-aligned agent modules (plans, tool-use, task execution flows) that operate reliably within defined workflows. (Execute)
-
Build and maintain versioned agent assets (prompts, policies, tool schemas, configs) with clear change logs and reproducibility. (Execute)
-
Optimize agent performance for latency and token efficiency within defined constraints (especially for edge-targeted scenarios when applicable). (Execute)
2) Evaluation, Testing & Quality Signals (25%)
-
Implement an AI system testing harness for assigned agents: regression suites, golden test sets (where applicable), and comparison reports for prompt/model variants. (Execute)
-
Maintain evaluation metadata (test versioning, metrics, correlation IDs) to support traceability and repeatability. (Execute)
-
Contribute to safety/quality checks (hallucination, toxicity, policy compliance) as part of evaluation workflows defined by the program. (Execute/Consult)
3) Integration with Tools and MCPs (20%)
-
Implement or extend MCP clients/connectors for internal data products and approved enterprise apps using standardized interfaces, scopes, and audit patterns. (Execute)
-
Validate integration behaviour with sandbox credentials, representative test data, and end-to-end workflow tests with stakeholders. (Execute/Consult)
4) Operational Readiness & Collaboration (15%)
-
Ensure owned components meet operational readiness expectations: logging/telemetry coverage, runbook notes, basic SLI/SLO alignment for agent health and integration reliability. (Execute/Consult)
-
Collaborate with platform and transformation teams to clarify requirements, triage issues, and incorporate feedback from internal/external teams into improvements. (Execute/Consult/Informed)
-
Identify and implement small process improvements that increase repeatability (evaluation templates, prompt versioning conventions, integration test scaffolds). (Execute)
Decision-Making Autonomy: Moderate — owns technical implementation for assigned agents/evals/integrations within established patterns; escalates cross-domain/security/policy decisions.
Supervision Required: Moderate — receives design review and direction from L09/L10 AI leads for evaluation approach, routing standards, and sensitive integrations.
Complexity of Role: High (for L08) — requires balancing quality/latency, integrating multiple enterprise tools, and ensuring reproducible evaluation under evolving requirements.
Cross-Functional Interactions: Yes — frequent interactions with platform, product/domain, security, SRE/observability, and enterprise app owners.