As enterprises build out Generative Artificial Intelligence (GenAI) capabilities, one of the biggest security challenges we hear from customers is understanding data integrity used by large language models (LLM) for GenAI applications. Customers tell us they face this challenge both with private data used to fine tune foundational models (FM) as well as with output like text or chats they pass onto their end users or applications.
We call this the “garbage in / garbage out” problem where if the data going into GenAI models contains malicious code or the AI output passes along sensitive data unintentionally, businesses can unwittingly put themselves at risk simply by adopting new technology. If enterprises are using GenAI, so are malicious actors and businesses need to take steps to prevent abuse. This concern is especially relevant in the ‘wild west’ of malware and AI— a frontier that is defined by great opportunity, and increasing vulnerabilities.
To protect applications integrated with GenAI, organizations should scan data for malicious code as well as sensitive information beginning with training GenAI models and ending with delivered output in text or chat served downstream for use.
GenAI Secure by Cloud Storage Security (CSS) allows customers to protect both model data and output from GenAI applications, guarding against risks such as data poisoning and data loss, without disrupting workflows.
GenAI Secure can be used with AWS’ suite of artificial intelligence-focused tools such as Amazon Bedrock, Amazon SageMaker, and AWS Trainium. As long as an AI service uses CSS-supported storage services, you can scan for both malicious code and sensitive data using GenAI Secure. CSS automates data security with support for Amazon Simple Storage Service (Amazon S3), Amazon Elastic Block Storage (Amazon EBS), Amazon Elastic File Service (Amazon EFS) and Amazon File Server (Amazon FSx).
This article reviews how enterprises can use GenAI Secure to scan data in Amazon S3 and create a quarantine pipeline that doesn’t disrupt data cleared for training GenAI models or output to end users when using Amazon Bedrock.
GenAI Secure for AWS
GenAI Secure is built using modern serverless architecture and installs within your AWS account so data never leaves your environment. Deployment via AWS CloudFormation template or HashiCorp Terraform module make it easy to get up and running so that you can secure AI data rapidly.
Once deployed, Amazon S3 buckets are automatically discovered and cataloged for scanning, whether in one or multiple regions or accounts, giving you confidence that AI model data in S3 is secure.
To scan data for sensitive information, you can enable default classification rule sets that include Payment Card Industry (PCI) standards, Health Insurance Portability and Accountability Act (HIPAA), and other predefined policies for common personally identifiable information (PII) items across a variety of regions. Perhaps more importantly, GenAI Secure allows you to create your own custom regular expression (RegEx) policies to detect sensitive data in user inputs and FM responses and prevent specific prompts or outputs from being exposed. Crafting RegEx policies can be challenging because the syntax can be complex and dense. Leveraging Bedrock, GenAI Secure allows you to enter simple text to identify patterns or text you want to identify and will return the exact value you create for that rule.
You can scan over 300 different file types as large as 5 TB (the maximum file size permitted by S3) using the following scan models:
- Event-based - scan data in real time when pushed to custom data models.
- Retro - scan existing data on demand or via schedules.
- API - scan content and receive a verdict before delivering chat/text output to end users or another application.
You can enable multiple virus detection engines for increased detection efficacy. When a verdict is returned and an object is found to be infected or contain sensitive information, you may quarantine it, delete it or decide to keep it in place.
Additionally, we’ve harnessed the power of the SophosLabs to offer Static Analysis and Dynamic Analysis. Further, GenAI Secure leverages Bedrock to perform forensic analysis on files and provide recommendations for remediation. Many times, malicious files are cryptic or require additional analysis to understand what the malware does. Oftentimes, security analysts use additional tools like Google search or VirusTotal to get context about malicious files, which can slow down an investigation or increase risk for manual error. GenAI Secure allows for the investigation to take place within your AWS environment so you don’t have to worry about data transferring via the public internet or being sent to a third party application.
GenAI Secure’s user-friendly console displays findings, but also integrates with services such as AWS Security Hub or your own Security Information and Event Management (SIEM) for consolidated reporting. Refer to Proactive Notifications in the CSS Help Docs for more information.
It only takes a few clicks of a mouse to implement the scan model that best fits into your GenAI pipeline. Below we explore how to protect GenAI models and secure AI data using CSS’s event-based scanning model.
Protect GenAI Model Data Inputs
When it comes to securing model data, malware scanning is best applied before the AI data is used to create or enhance models. Additionally, if you wish to restrict the use of sensitive data within your models, that information should be removed before data is used. For example, to scan external LLM training data when it’s received, before uploading a custom AI model to Amazon Bedrock (e.g., data input pipeline), you could use the event-based model with a staging Amazon S3 bucket. The input pipeline example in Figure 2 below comprises the following steps:
- Customers/applications upload objects to a staging S3 bucket
- As soon as an object is uploaded, Amazon SNS sends an event notification to Amazon SQS
- Amazon SQS queues the data for scanning
- Amazon SNS tells the agents to start scanning
- The agents kick off the scan task
- Results are posted to DynamoDB, which sends results to the GenAI Secure console
- Once scans are complete, the agents move malicious objects or objects that contain sensitive data to a quarantine bucket
- Safe objects are marked with appropriate tags
- Lambda moves safe objects to the S3 bucket used for custom models
- The data can then be used by Amazon Bedrock for training
Figure 1: GenAI Secure Input Pipeline
Protect GenAI Application Outputs
If sensitive data is used to train your model, but you need to prevent sharing confidential information via the chat / text output produced by your GenAI application (e.g., data output pipeline), it’s best to classify the data before it is delivered to the end user.
Scanning AI/ML outputs via the event-based model uses a similar process as the data input pipeline above, however, you would designate the Bedrock output bucket as a staging bucket and use Amazon S3 as the custom model output bucket. The output pipeline example in Figure 3 below comprises the following steps:
- Customers/applications interact with Bedrock to produce chat/text output
- That output goes into a staging bucket
- As soon as an object is uploaded, Amazon SNS sends an event notification to Amazon SQS
- Amazon SQS queues the data for scanning
- Amazon SNS tells the agents to start scanning
- The agents kick off the scan task
- Results are posted to DynamoDB, which sends results to the GenAI Secure console
- Once scans are complete, the agents move malicious objects or objects that contain sensitive data to a quarantine bucket
- Safe objects are marked with appropriate tags
- Lambda moves safe objects to the S3 bucket used for custom models
- The resulting chat/text is free of sensitive data and can be used
Figure 2: GenAI Secure Output Pipeline
Built for and Powered by AWS to Secure AI Data
GenAI Secure is built for and powered by AWS using AWS Services in Scope of AWS assurance programs, which comprise control mappings to help customers establish the compliance of their environments running on AWS. When software sits on authorized infrastructure, its controls are inherited from that authorized system. CSS is built on and integrates with a variety of AWS services for administration and scanning. The AWS services listed below are required to deploy and run GenAI Secure.
Figure 2: AWS Services Required to Deploy and Run GenAI Secure
Additional AWS services may be used depending on your deployment needs and how you would like to process findings. For example, if you would like to deploy GenAI Secure in multi-region architecture using private subnets, you would need to designate an Amazon Virtual Private Cloud (Amazon VPC). Optionally, you can send scan results to services like AWS Security Hub, AWS CloudTrail Lake, and Amazon EventBridge for forensic use cases or to alert teams / applications for other workflows.
Conclusion
In this article, we discussed how to use GenAI Secure with Amazon Bedrock to protect GenAI models and secure AI outputs. We covered integrating the serverless solution with your workflow, discussed the AWS services used by GenAI Secure, and showed you how to build a quarantine workflow for malicious files and sensitive data.
GenAI Secure can be used with any AI service that uses CSS-supported AWS storage and file services to house and serve model data. When you integrate GenAI Secure by Cloud Storage Security (CSS) into your GenAI application workflow, you can ensure the data is safe for processing, transformation, and use.
Contact us for a demo and to learn more about scanning LLMs for malicious code and sensitive data using GenAI Secure.
About Cloud Storage Security
Cloud Storage Security (CSS) is an AWS Public Sector Partner and AWS Marketplace Seller with an AWS qualified software offering, AWS Security Competency, and an AWS Global Security & Compliance Acceleration (ATO on AWS) designation. CSS helps customers around the world automate and secure data transfers and data stored in multiple AWS storage services including Amazon S3, Amazon EFS, Amazon EBS, and Amazon FSx.