At TrustLab, we have worked for several years to address the problem of high quality content moderation. During this time, the emergence of large language models has been a game changer for content moderation given lower costs and wellness benefits from automating moderation work, but technical and non-technical challenges remain. Our work on content moderation has led to some key learnings:
- Have a well-defined, inter-rater reliable policy
- Plan to iteratively refine the policy
- Separate policy and implementation (for LLM based automation)
- Leverage human-in-the-loop for edge cases (for LLM based automation)
Below we discuss each of these learnings in turn, then discuss how they apply to the emerging world of AI and agents.
Learnings from Content Moderation
Have a well-defined, inter-rater reliable policy: In our experience, many (smaller) platforms start with informal policies, or even no documented policies at all, answering with only reactive enforcement in response to user complaints. This creates an environment of inconsistent enforcement leading to user confusion. A well-defined policy set tested for inter-rater reliability is a big step forward. The importance of sound guidelines and inter-rater alignment is also recognized in other domains.
Plan to iteratively refine the policy: Building a content moderation policy is not a one-and-done process. As the platform and the world evolve, new issues surface. One example of this is handling AI generated content, the emergence of which forced many platforms to reconsider their policies. But policy changes can also cover more mundane “edge cases” discovered over time, such as Meta’s restricted goods and services policy changelog shows an update on Feb 27, 2025 that covers recalled goods. As much as they are necessary, however, policy changes are tricky to handle, as they can cause confusion internally and externally, and require careful process and system design.
Separate policy and implementation: LLMs make it possible to implement content moderation by simply using the policy as a prompt. However, this leads to organizational issues such as policy and machine learning teams stepping into each others’ shoes while lacking the necessary expertise, as observed by the Spotify team. In our experience, separating the policy from implementation allows each team to operate unencumbered. Specifically, the policy team can focus on tailoring the policy language to platform needs without having to overfit the policy language to a specific version of a specific LLM and/or without getting to an impasse because no amount of language changes achieves the desired precision/recall. At the same time, the machine learning team can focus on using the latest available techniques (including DSPY, fine tuning, RL, agentic approaches) to translate the policy to the best performing implementation.
Human-in-the-loop for edge-cases: From a trust and governance perspective, the boundaries of what is or is not allowed should be set by humans, with AI systems operating within those human-defined constraints. This is probably automated content moderation’s version of Asimov’s second rule on Robotics: “A Robot must obey humans.” The way to achieve this outcome is to ensure that cases that are borderline or ambiguous according to the policy are escalated to humans (even though humans aren’t free of bias either), rather than AI making arbitrary calls without proper societal context.
How/Why these Learnings Apply to AI/Agents
Content moderation is also an issue for generative AI. Indeed, one aspect that makes today’s LLMs impressive is they are much less likely to output harmful content than previous generations of AI. More commonly, AI / agent systems can still fail in many other ways, so AI / agent evaluation has become a key focus area recently. Agent evaluation is similar to content moderation because both involve evaluating content based on criteria (rubric/policy) so the learnings from content moderation also apply to the AI eval problem as detailed below.
Have a well-defined inter-rater reliable policy: Human evaluation is often the “gold standard” for AI systems, but if the rubric for evaluation is too subjective, then humans themselves can have low alignment. As a result, the evaluation results can be too noisy, and change every time the human agents are changed, which can be frustrating and lead to regression or longer development cycles. While there are situations where the task is inherently subjective (e.g., AI systems generating artwork), in many other cases lack of inter-rater alignment could be a sign of a poorly clarified rubric. The sooner this is detected and fixed, the faster the performance iteration cycle can be once development starts.
Plan to iteratively refine the policy: The need for rubric refinement for AI has already been recognized and frameworks have also been proposed for this purpose, but the importance and practical impact of this issue is likely still underestimated in the AI/agent development world. In our experience, policy iteration causes significant friction in content moderation (e.g., models learning from outdated labels from previous iterations of the policy leads to suboptimal behavior), so documentation, training materials, training data, machine learning models all need to be versioned and various subsystems involved in the overall moderation system need to be able to handle multiple versions. The need for rubric evolution will similarly bring significant challenges when developing AI / agentic systems.
Note that the below two learnings were from automating content moderation, and they apply in the context of automating AI evals with approaches such as LLM-As-Judge.
Separate policy and implementation: Often LLM-As-Judge frameworks simply ask the user to make a prompt for the eval that aligns with the evaluation goal. But in a real world use case with multiple stakeholders, this is not the task of one person, and instead we need a framework where multiple stakeholders can collaborate effectively to produce a high-performing LLM-As-Judge. The framework needs to make it possible to independently iterate on (a) the rubric, (b) examples / edge-cases, and (c) prompts. This allows non-technical stakeholders / business owners to manage the rubric while AI engineers find the best way to map the rubric to a prompt / implementation.
Human-in-the-loop for edge-cases: The recent AI evaluation blog post from Anthropic mentions that LLM-As-Judge evaluators should be closely calibrated with human experts. Doing so requires identifying cases where the LLM’s assessment differs from human assessment, and then fixing the LLM evaluator to align with human assessment. As the LLM evaluator improves, cases where LLM and humans disagree become rarer, so harder to identify. But when such discovery happens through user / customer complaints or adversarial attacks, the business may have a significant cost to pay.
Parting Thoughts
Per the references quoted in the previous section, we see some indication that those working on AI and agent evals have also made observations similar to our learnings from content moderation, but we hypothesize that these issues will gain more prominence in the world of AI and agent evals. Hopefully, emerging AI eval frameworks will translate these learnings into innovative solutions that simplify the problem of building and evaluating AI systems and agents.
If you’re deploying AI agents in production and want to move beyond ad-hoc evals toward defensible, scalable, and continuously improving evaluation, we’d love to compare notes. We’re working with early design partners on SuperviseAI to operationalize these ideas end-to-end—from policy definition to LLM-As-Judge to human calibration.
👉 Get in touch to see how this approach would work for your agents.
Let's Get Started
See how TrustLab helps you maximize the business ROI of your AI systems with smarter monitoring of decisions, labeling, and licensed content detection.
Get a Demo
