Amazon Nova 2 Multimodal Embeddings with Amazon S3 Vectors and AWS Java SDK - Part 3 Create and store audio and video embeddings

#aws #java #amazonnova #amazons3vectors

Introduction

In part 2, we covered creating text and image embeddings with Amazon Nova 2 Multimodal Embeddings and storing them in Amazon S3 Vectors using AWS Java SDK. We'll take a look at audio and video embeddings in this part of the series.

Create and store audio embeddings

We'll reuse many parts from the process of creating and storing text and image embeddings described in part 2. The relevant business logic of our sample application can still be found in the AmazonNovaMultimodalEmbeddings.

For demo purposes, I converted the official AWS video about AWS Lambda function into a .mp3 file with the name defined in private static final String[] AUDIO_NAMES = { "AWS-Lambda-explained-in-90-seconds-audio" } and uploaded it into the S3 bucket defined as private final static String S3_BUCKET = "s3://vk-amazon-nova-2-mme/" (please use your unique bucket name for it).

As the duration of this audio file is longer than 15 seconds (more than 90 seconds), we need to use the asynchronous Bedrock API. This asynchronous API splits the audio into 15-second parts and creates embeddings for each. The relevant part is in the method createAndStoreAudioEmbeddings :

private static void createAndStoreAudioEmbeddings() throws Exception {
  for (String audioName : AUDIO_NAMES) {
  asyncInvokeBedrockModelAndPutVectorsToS3(prepareAudioDocument(S3_BUCKET +   
  audioName + AUDIO_EXTENSION), audioName, "embedding-audio.jsonl");
   }
}

Let's look at what's happening here. First, in the prepareAudioDocument method, we use software.amazon.awssdk.core.document.Document API to create a JSON request for creating audio embedding. We set taskType as SEGMENTED_EMBEDDING (because we split the audio file), durationSeconds of the segmentationConfig to 15 seconds. Then we set embeddingPurpose as GENERIC_INDEX and embeddingDimension as 384. We can use 4 dimension sizes to trade off embedding accuracy and vector storage cost: 3072, 1024, 384, and 256. Then we define the audio format as mp3 and the source and its s3Location of our audio file. For the complete embeddings request and response schema, I refer to the following article. Below is the complete source code of this method:

private static Document prepareAudioDocument(String s3_audio_uri) {
   var s3locationConfig = Document.mapBuilder()
      .putString("uri", s3_audio_uri).build();

   var sourceConfig = Document.mapBuilder()
      .putDocument("s3Location", s3locationConfig).build();

   var durationConfig = Document.mapBuilder().putNumber("durationSeconds", 15).build();

   var audioConfig = Document.mapBuilder().putString("format","mp3")
      .putDocument("source", sourceConfig)
      .putDocument("segmentationConfig", durationConfig).build();

   var singleEmbeddingParams = Document.mapBuilder().putString("embeddingPurpose", "GENERIC_INDEX")
     .putNumber("embeddingDimension",EMBEDDING_DIMENSION)
     .putDocument("audio", audioConfig).build();

   var request = Document.mapBuilder().putString("taskType", "SEGMENTED_EMBEDDING")
    .putDocument("segmentedEmbeddingParams",singleEmbeddingParams).build();

   return request;
}

The generated Document looks like this:

{
   "taskType": "SEGMENTED_EMBEDDING",
   "segmentedEmbeddingParams": 
     {
       "embeddingPurpose": "GENERIC_INDEX",
       "embeddingDimension": 384,       
       "audio": {
            "format": "mp3",
            "source": {
                "s3Location": {
                    "uri": "s3://vk-amazon-nova-2-mme/AWS-Lambda-explained-in-90-seconds-audio.mp3"
                  }
              }
             "segmentationConfig": {
                  "durationSeconds": 15
              }
          }
     }
}

Now, let's explore what's happening next in the asyncInvokeBedrockModelAndPutVectorsToS3 method. We first use the created Document to start the asynchronous invocation of the Bedrock model, which happens in the startAsyncInvokeBedrockModel method. The asynchronous mode is required for audio and video because the segmentation of the corresponding time can take several minutes. If the file duration is 15 minutes or less, we can use the synchronous invocation as we did for creating text and image embeddings. We first build an AsyncInvokeS3OutputDataConfig object by providing it the S3_EMBEDDINGS_DESTINATION_URI, which is defined as private final static String S3_EMBEDDINGS_DESTINATION_URI = S3_BUCKET + "embeddings-output/". It is then used to build an AsyncInvokeOutputDataConfig object. Then we build a StartAsyncInvokeRequest object by providing it with the Document and AsyncInvokeOutputDataConfig created before and the model id (amazon.nova-2-multimodal-embeddings-v1:0). Then we use a Bedrock Runtime Client to invoke the StartAsyncInvokeRequest and get the invocationArn. This Amazon Resource Name (ARN) is a part of the response, which we also return. Below is the complete source code of the startAsyncInvokeBedrockModel method:

private static String startAsyncInvokeBedrockModel(Document document) {
   var ais3dc = AsyncInvokeS3OutputDataConfig.builder()
       .s3Uri(S3_EMBEDDINGS_DESTINATION_URI).build();
   var aiodc = AsyncInvokeOutputDataConfig.builder()
        .s3OutputDataConfig(ais3dc).build();

   var saiRequest = StartAsyncInvokeRequest.builder()
           .modelId(MODEL_ID).modelInput(document)
       .outputDataConfig(aiodc).build();

   var saiResponse = BEDRDOCK_RUNTIME_CLIENT.startAsyncInvoke(saiRequest);
   var invocationARN = saiResponse.invocationArn();
   return invocationARN;
}

Let's continue to explore the asyncInvokeBedrockModelAndPutVectorsToS3 method after we started the asynchronous Bedrock model invocation. We create a GetAsyncInvokeRequest object with the invocationArn. Then we use Bedrock Runtime Client to invoke it and get the status from the response. As long as the status is IN_PROGRESS, we take a pause for 20 seconds and repeat our GetAsyncInvokeRequest request until we get the status COMPLETED. In such a case, we are ready to fetch its results from the s3Uri from the GetAsyncInvokeResponse. This is how the output looks for each result of the asynchronous Bedrock model invocation response:

To retrieve the created segmented audio embeddings, we need to parse the file with the name embedding-audio.jsonl and split it. For this, we use the regular expression
\r?\n|\r to get the individual embeddings. For the audio duration of a bit more than 90 seconds and a segmentation of 15 seconds, there will be 7 generated embeddings. Then, for each embedding, we convert its representation into the AsyncEmbeddingResponse object. This object contains the embeddings array, status, and segment metadata. To such metadata belong: index number, segment start and end seconds, for example, from the 16th to the 30th second. Then we retrieve embedding for the current AsyncEmbeddingResponse object and store it with the key, which is defined as a composition of the file name and the segment number (which is the value of the variable i) in Amazon S3 Vectors using the putVectors method, which is described in part 2. Below is the complete source code of the asyncInvokeBedrockModelAndPutVectorsToS3 method:

private static void asyncInvokeBedrockModelAndPutVectorsToS3(Document document, String fileName, String embeddingsResultFileName)
   throws Exception {
    var invocationARN= startAsyncInvokeBedrockModel(document);

    while (true) {
      var gaiRequest = GetAsyncInvokeRequest.builder()
           .invocationArn(invocationARN).build();
      var gaiResponse = BEDRDOCK_RUNTIME_CLIENT.getAsyncInvoke(gaiRequest);
      var status = gaiResponse.status();
      if (AsyncInvokeStatus.IN_PROGRESS.equals(status)) {
        Thread.sleep(20000);
      }
      if (AsyncInvokeStatus.COMPLETED.equals(status)) {
          var s3Uri = gaiResponse.outputDataConfig()
                 .s3OutputDataConfig().s3Uri();
          int i = 1;
          for (String line : new String(getS3ObjectWithEmbeddings(s3Uri,embeddingsResultFileName))
            .split("\\r?\\n|\\r")) {
        AsyncEmbeddingResponse asyncEmbeddingResponse =              MAPPER.readValue(line, AsyncEmbeddingResponse.class);
                     putVectors(asyncEmbeddingResponse.embedding(), fileName + "_" + i);
                    i++;
            }
     return;
      }
    }
}

To test creating and storing the audio embeddings, we can uncomment this invocation in the main method.

public static void main(String[] args) throws Exception {
    createAndStoreAudioEmbeddings();
}

Create and store video embeddings

Creating and storing the video embeddings is very similar to the procedure described for the audio embeddings. So we'll only describe the differences.

For demo purposes, I converted the official AWS video about AWS Lambda function into an .mp4 file with the name defined in private VIDEO_NAMES = { "AWS-Lambda-explained-in-90-seconds-video" } and uploaded it into the S3 bucket defined as private final static String S3_BUCKET = "s3://vk-amazon-nova-2-mme/" (please use your unique bucket name for it).

As the duration of this video file is longer than 15 seconds (more than 90 seconds), we need to use the asynchronous Bedrock API. This asynchronous API splits the video into 15-second parts and creates embeddings for each. The relevant part is in the method createAndStoreVideoEmbeddings:

private static void createAndStoreVideoEmbeddings() throws Exception {
    for (String videoName : VIDEO_NAMES) {  
      asyncInvokeBedrockModelAndPutVectorsToS3(prepareVideoDocument(S3_BUCKET +
      videoName + VIDEO_EXTENSION), videoName, "embedding-audio-video.jsonl");
   }
}

Let's look at what's happening here. First, in the prepareVideoDocument method, we use software.amazon.awssdk.core.document.Document API to create a JSON request for creating audio embedding. We set taskType as SEGMENTED_EMBEDDING (because we split the video file), durationSeconds of the segmentationConfig to 15 seconds. Then we set embeddingPurpose as GENERIC_INDEX and embeddingDimension as 384 (you can use 4 dimension sizes to trade-off embedding accuracy and vector storage cost: 3072, 1024, 384, and 256). Then we define the video file format as mp4, embeddingMode as AUDIO_VIDEO_COMBINED. At the end, we define the source and its s3Location of our video file. For the complete embeddings request and response schema, I refer to the following article. Below is the complete source code of this method:

private static Document prepareVideoDocument(String s3_audio_uri) {
      var s3locationConfig = Document.mapBuilder()
           .putString("uri", s3_video_uri).build();

      var sourceConfig = Document.mapBuilder()
           .putDocument("s3Location", s3locationConfig).build();

      var durationConfig = Document.mapBuilder()
           .putNumber("durationSeconds", 15).build();

      var videoConfig = Document.mapBuilder().putString("format", "mp4")
           .putString("embeddingMode", "AUDIO_VIDEO_COMBINED")
           .putDocument("source", sourceConfig)
           .putDocument("segmentationConfig", durationConfig).build();

      var singleEmbeddingParams = Document.mapBuilder()
           .putString("embeddingPurpose", "GENERIC_INDEX")
           .putNumber("embeddingDimension", EMBEDDING_DIMENSION)
           .putDocument("video", videoConfig).build();

      var request = Document.mapBuilder()
           .putString("taskType", "SEGMENTED_EMBEDDING")
           .putDocument("segmentedEmbeddingParams", singleEmbeddingParams)
           .build();

    return request;
}

The generated Document looks like this:

{
   "taskType": "SEGMENTED_EMBEDDING",
   "segmentedEmbeddingParams": 
     {
       "embeddingPurpose": "GENERIC_INDEX",
       "embeddingDimension": 384,
       "video": {
           "format": "mp4",
           "embeddingMode": "AUDIO_VIDEO_COMBINED",
           "source": {
             "s3Location": {
                 "uri": "s3://vk-amazon-nova-2-mme/AWS-Lambda-explained-in-90-seconds-audio.mp4"}
              },
             "segmentationConfig": {
                  "durationSeconds": 15
              }
          }
     }
}

Everything else works the same as in the audio embeddings case described above. The only additional difference for creating video embeddings is how the output looks for each result of the asynchronous Bedrock model invocation response. Please pay attention to the names of the files:

We retrieve all generated video embeddings from the file embedding-audio-video.jsonl, parse them individually, and create the embeddings in the Amazon S3 Vectors. See the above-described asyncInvokeBedrockModelAndPutVectorsToS3 method.

To test creating and storing the audio embeddings, we can uncomment this invocation in the main method.

public static void main(String[] args) throws Exception {
  createAndStoreVideoEmbeddings();
}

Conclusion

In this part, we covered creating audio and video embeddings with Amazon Nova 2 Multimodal Embeddings and storing them in Amazon S3 Vectors using AWS Java SDK. In the next part of the series, we'll take a look at the similarity search across all (text, image, audio, and video) created embeddings.

Please also check out my website for more technical content and upcoming public speaking activities.