top of page
  • Writer's pictureTom Siegel

AI in Trust & Safety: Panacea or Peril for Online Safety Efforts?


Every couple of years a new technology emerges that captures the world's imagination, promising transformative user experiences and vast new business opportunities.


A few years ago, it was Crypto. Then there was Virtual Reality. Now, it’s AI.


With recent advances in generative AI technology, computers are as good (and often better) than humans in creating and processing language. 


Machines are now capable of generating and classifying huge amounts of content much faster and cheaper and provide human-like insights. And they will quickly get even better, reducing the gap with human cognitive ability, and soon surpass it. For certain tasks, AI will be much better than any human ever could be.  


How will this technological progress affect Trust & Safety? 


AI is almost certain to become the most profound game changer since Online Safety tech became a ‘thing’ 15 years ago. 


With GenAI we can monitor and moderate more effectively, create better policies, and improve rater wellness dramatically. Technology demonstrations by TrustLab and others have shown up to 10x improvements in speed, cost, and accuracy for dealing with harmful and high-risk content.


But when a technology has great promise it often comes with risks that need to be managed carefully. AI for Online Safety is no exception. 


The same models that classify harmful content so effectively can also create it - lots of it in a short amount of time and more human-like than ever before. Quick iteration and adaptation by bad actors will not only result in more damaging abuse schemes but also in huge amounts of low-quality content that create noise, drowning out high-quality and human-created content. 


The big risk: bad actors will use generative AI tools to intentionally create misleading, manipulative, or otherwise harmful content and increasingly automate online harm.


In this article, I’ll discuss some of the key opportunities and risks that AI presents for Trust & Safety and how it will become the panacea rather than the peril to create safer online experiences. 


These are: 


  1. AI for Content Moderation

  2. AI Model Safety

  3. AI Safety Regulation

  4. AI Industry Standards

  5. AI Bias


Let’s dive in. 


1) AI for content moderation


Historically, AI in content moderation relied on painstakingly collecting and manually rating a large sample data set, constantly monitoring performance, and regularly retraining to create custom classifiers. 


The approach relies on a static underlying policy definition and is specific to each abuse category, format, and language. It is slow, expensive and it didn’t scale very well.  


The risks are high, but the opportunities for creating a safer web are much higher. 

Generative AI dramatically improves content policy and classifier generation in a few key areas: 


Speed: The process of defining policy language is taking hours, not weeks. Generative AI-based classifiers provide immediate results, and with prompt engineering, a new classifier rivaling human accuracy can be the work of an afternoon, not months. 


Precision and Recall: For many standard moderation categories like hate, violence and sexual content, the measured performance is on par with humans, only limited by a high-quality ground truth dataset used for alignment (a constant constraint for AI and non-AI moderation workflows).


Cost: Not surprisingly, classifiers are more economical than humans. With tuning automation rates often exceeding 50% of currently manually reviewed content. Overnight, content moderation efficiency doubles. 


Customization: One-off policies for each client and use case - no problem.


Wellness: One of the most important benefits of increased automation in Content Moderation is that human moderators no longer have to be exposed to content that can affect their health negatively. 


While these benefits are exciting, risk mitigation is real and important.


Cost: GenAI models are expensive. Running GenAI over the entire corpus of content is often cost-prohibitive. Distillation is a promising technique to reduce this constraint. 


Hallucination: Models don’t always get it right, and sometimes by a lot. Additional safety nets and mitigation steps are required.


Model Selection: General AI models are still largely untested; there are so many that choosing the right one can be confusing, and capacity constraints for broad-scale use are real. Technology development is progressing quickly. Picking the right foundational model, and the proper fine-tuned model on top of it,  requires expertise and constant testing. 


Our compliance automation software is powered by AI and has everything your team needs to be DSA-compliant.



2) AI Model Safety


As many researchers and social media influencers have demonstrated, it is very easy to make a GenAI model say harmful and bad things. Most efforts on the model provider side include rule-based approaches to restrict keywords, which create their own problems with over-filtering and limited coverage. 


While red teaming and testing are helpful, they don’t provide an effective safety net. 


Big Tech, GenAI companies, and startups including TrustLab conduct a lot of research on how to better safeguard models from generating harmful results. But so far solutions are falling short, requiring everyone who uses these models to pay extra attention and be cautious. 


One of the most important benefits of increased automation in Content Moderation is that human moderators no longer have to be exposed to content that can affect their health negatively

3) AI safety regulation 


Governments, including the European Union and the US, have recognized the need to put guardrails around AI Technology. This cautious and safety-conscious approach has also been echoed by many business leaders and researchers in the field. 


The EU AI Act was announced a couple of weeks ago, making Europe the very first jurisdiction to propose specific rules for the development and use of AI. 


How these rules shape up and how they are enforced will have a major impact on future developments and safety of the technology. It’s a space we all need to watch closely. However, due to the understandably slow process that regulation moves, the effects are unlikely to be felt before 2025.





Policy changes in response to calls to label AI are gaining traction with creator platforms, underscoring the growing concern about the influence of AI in digital content generation.


Companies operating in this space must be prepared to navigate through the complex and sometimes conflicting regulatory demands, which can be challenging but ultimately necessary to operate in this space. 



4) AI industry standards


Relying on governments to define speech can be problematic and carries many risks. This is something I’ve written about before and will continue to as we navigate these changes in the online landscape. 


AI regulation is no different. The best outcome would be transparent industry standards with broad participation by many stakeholders. New algorithms should be vetted and monitored by an independent board. It’s a great opportunity for academic institutions and think tanks to take charge.



5) AI Bias 


With so much focus on the capabilities of GenAI, the risks and dangers of bias in AI don’t receive enough attention. 


Content creation and recommendations are the fundamental drivers of our user experience and influence our beliefs and actions. Filter bubbles and echo chambers are just some of the examples of how they can negatively influence society and negatively bias against certain groups, beliefs, opinions, or viewpoints. 


This bias is difficult to quantify, as it would be inherent in the data and the algorithms. 

The data collection, training, and monitoring of AI is highly susceptible to bias in many forms. 


The emerging field of AI fairness has been trying to tackle this challenge for years, but unfortunately, not enough tangible progress has been made to deliver generalizable solutions for the industry. My observation is that most platforms don’t have good answers to how to mitigate bias effectively, and much more needs to be done. Transparency in those efforts would be an important first step.


The best outcome would be transparent industry standards with broad participation by many stakeholders. New algorithms should be vetted and monitored by an independent board. It’s a great opportunity for academic institutions and think tanks to take charge.

Panacea or Peril? There is a tool for that, too.


Reflecting on my long tenure in the Trust & Safety field, GenAI provides one of the most exciting opportunities our field has ever experienced. 


From allowing us to mitigate risk in a faster, more cost-effective way to protecting content moderations from the psychological harms of their work. However, we can’t allow AI tech to kick down doors without understanding its risks and flaws. Inadequate safeguards and AI bias, for instance, need to be addressed. 


Reliance on government definitions can also be problematic and carries risks for freedom of speech, which is why establishing clear standards across all industries could lead to more responsible AI development and use.


The risks are high, but the opportunities for creating a safer web are much higher. 


It won’t come easy, but it’s something we need to embrace. AI is transforming Trust & Safety at its core, with all its exciting possibilities, and risks. Let’s make the most of it.


PS: Interested in finding out more about how you can leverage Gen AI tech for better content moderation? Tom and the team at TrustLab are focused on helping companies take full advantage of the new capabilities and managing the risks. Let’s chat :) 


63 views1 comment

Recent Posts

See All
bottom of page