AI QA Tester

-
Full-Time
Remote

Job Description:

Job Overview

We are looking for an AI QA Engineer to join one of our projects and help ensure the quality, reliability, and safety of advanced AI systems. In this role, you will test and validate LLM-based applications, agentic workflows, and AI-powered systems, working across software, infrastructure, and hardware components.

You will play a critical role in identifying issues, validating model behavior, and building automated testing frameworks to support the deployment of high-quality, production-ready AI solutions.

Key Responsibilities

LLM & Agent Evaluation

Evaluate advanced language models across areas such as hallucination detection, factual consistency, prompt-injection resistance, bias/fairness auditing, and reasoning reliability.
Design and execute test plans for single-agent and multi-agent systems, validating reasoning paths, memory behavior, and tool usage.
Assess semantic correctness of model outputs, beyond simple pass/fail validation.
Validate RAG pipelines and retrieval accuracy within AI workflows.
Document failure modes and propose improvements to model performance and reliability.

Test Automation & Frameworks

Perform manual and exploratory testing for complex AI-driven applications to identify bugs and unexpected behaviors.
Build and maintain automated test suites for LLM-based features using tools such as PyTest, OpenAI Evals, and Weights & Biases.
Develop regression testing frameworks to detect model performance changes across releases.
Implement and maintain CI/CD-integrated testing pipelines to monitor AI model regressions.

Collaboration & Quality Assurance

Work closely with AI engineers, product teams, and data scientists to ensure system reliability and performance.
Contribute to improving testing methodologies, evaluation metrics, and QA processes for AI-driven applications.
Ensure AI systems meet quality, reliability, and safety standards before production deployment.

Required Skills & Experience

Experience

3–5+ years of experience in QA engineering
1–2+ years testing AI/ML or LLM-based systems
Experience delivering QA validation for at least one production AI system (chatbot, agentic workflow, RAG pipeline, or similar)
Hands-on experience with prompt engineering and generative AI output evaluation

Technical Skills

Strong proficiency in Python for test scripting and automation
Experience with LLM evaluation tools such as:
- OpenAI Evals
- RAG evaluation frameworks
- Weights & Biases
Familiarity with agent frameworks (LangChain, LangGraph, CrewAI, AutoGen) to trace and test agent behavior
Experience with API testing
Understanding of NLP concepts such as intent recognition and entity extraction
Familiarity with vector databases and RAG architectures for retrieval accuracy testing
Experience with version control systems (Git) and issue tracking tools (Jira, Linear)

Preferred Qualifications

Experience in AI safety testing, red-teaming, or adversarial testing
Skills in evaluation rubric design, bias and fairness auditing, and grounding verification
Experience testing AI systems in regulated industries (healthcare, fintech, legal)
Background in NLP, computational linguistics, or data science
Familiarity with RLHF (Reinforcement Learning from Human Feedback) and model alignment concepts
Security certifications such as OSCP or OSWA
Coursework or experience related to AI safety