From the course: Making Your AI Results More Predictable
Using moderation components
From the course: Making Your AI Results More Predictable
Using moderation components
- What if you could hire someone to check the inputs and perhaps the outputs of your AI model? Moderation APIs let you do something similar by using AI to classify text and language. You can use moderation APIs to reduce malicious use of your AI system and perhaps to filter model output for inappropriate language. It's important to keep in mind that while these components are powerful, some malicious users can surpass them using jailbreaks. Here's one flow we can have where an inappropriate input would be picked up by moderation API, and conditionally, it would be returned, whereas an appropriate input would make it through the moderation component onto the language model. And in this case, we don't check the language model's response. And this is really up to you whether you trust the output of your model or not and what the stakes are. To look at moderation in action, let's head over to OpenAI's documentation. Here, you'll notice that the moderation component is pretty straightforward. You pass in an input and receive this detailed report with different categories of inappropriate content. Now I'm going to go ahead and take this example, which contains a pretty harmless text, and I'll bring it to Visual Studio Code. I'll also bring in pretty print. This will let me format the output in a slightly more readable way. So I will pretty print the output and I will print it in the dictionary format. Next, I'll want to install the OpenAI package. So python3-m pip install openai. It's important to always double check the name of packages that we install, and you'll notice it's already installed here. I'll clear my terminal. Next, I'll want to set an environment variable. And for this you'll need to head over to your OpenAI account and grab an API key. So I'll go ahead and export openai_ API_key. And unless you want to change the configuration, it's important to use this exact format. I'm not actually going to use my key since it's a secret. And now I should be ready to run this code. So python3 moderation.pi. Now I have this detailed report and it's basically telling me that this is pretty clean text. You can see false, false, false for all sort of inappropriate things, and you can see different scores that can help you set even more fine tuned rules for this. Now, it's important to note that sometimes these mechanisms can be surpassed, so you shouldn't rely solely on them, but it is a nice layer of protection to have and it's fairly accessible using APIs such as the OpenAI API.