Part 3 - Configuration Deep Dive: Empowering Conversations with Vector Storage

Intro

In order to be able to create LLM Applications using Azure Cognitive Search, we need to setup the components for it, this was explained in the first part, we need Indexes, Indexers, Knowledge Store, Data Sources and Skillsets. In this part we will create the backend functions to support all of this.

If you prefer, check the backend folder for the entire code, which is also very well documented.

create_indexes

This function is a wrapper on top of all what we have to do, it will be used later from the backend which is implemented as an Azure Function.
From here we instantiate the DocumentIndexManager, and then we create the document index resources with the function create_document_index_resources

  • Index.
  • Indexer.
  • Datasource.
  • Skillset with custom skill (OpenAI embedding generator).

Remember these resources are tied to the source documents, pdf, word, excel, powerpoint, md, or whatever is supported. Until this point we don't have any vector storage yet. However when we create the skillset, we define a knowledge store, this means that the output of the custom skill will be saved into the knowledge store, more on this later.

Then we instantiate the ChunkIndexManger which will create the chunk index resources using create_chunk_index_resources:

  • Index.
  • Indexer.
  • Datasource.

In this second set of resources, the indexer is set to a datasource pointing to our Knowledge Store, remember a Knowledge Store is just a storage account, and there we have the projections generated by the previous step, the projections are actually a lot of JSON files with the embeddings generated in our previous step, you will see this later in the code.

Document Indexing Manager

The 'Document Indexing Manager' code on GitHub.

The given code defines a Python class called *DocumentIndexManager* that facilitates the creation, management, and deletion of resources for document indexing using Azure Cognitive Search. This class encapsulates functions to set up a document index, create datasources, skillsets, and indexers, as well as to manage these resources. Let’s break down the main components of the code:

  • _create_document_index: This function creates a document index within Azure Cognitive Search. It defines the schema of the index, specifying various fields such as document ID, content, filesize, filepath, and more. It also includes searchable and retrievable attributes to enhance search and retrieval efficiency.
  • _create_document_datasource: This function establishes a blob datasource within Azure Search, allowing documents to be ingested from a specified storage container. The function takes inputs like the index prefix, storage connection string, container name, and Azure Search configuration to create the datasource.
  • _create_document_skillset: This function defines a skillset, which is a set of skills applied to the indexed content to extract meaningful information. It might include skills like open ai embedding, OCR, merging, and image analysis. These skills enhance search accuracy by extracting relevant data from the documents. For our project we used only Open AI Embedding, but the code is there for you to try OCR, Merging and Image Analysis skills. When Creating a skillset, a Knowledge Store has to be defined also, why? Because the output of a custom skill needs a place to be saved.
  • _create_document_indexer: This function creates an indexer that connects the datasource to the document index. The indexer specifies how data should be processed and ingested into the index, including field mappings and indexing parameters. It utilizes the previously defined skillset to enhance the indexed content.
  • _create_document_index_resources: This function orchestrates the creation of all necessary resources for document indexing. It invokes the previously defined functions to create the index, datasource, skillset, and indexer. After setting up these resources, it waits for the indexer to complete its processing.
  • - delete_document_index_resources: This function cleans up the resources associated with a document index. It deletes the index, indexer, datasource, skillset, and related components. Additionally, it deletes any knowledge store tables and blobs associated with the index.
  • The DocumentIndexManager class aims to provide a comprehensive solution for setting up and managing document indexing in Azure Cognitive Search. It encapsulates the various steps involved in creating an effective search solution for documents. By leveraging this class, developers can streamline the process of creating and managing the resources required for efficient document indexing and retrieval using Azure Cognitive Search.

 

    ChunkIndexManager

    The 'ChunkIndexManager' code on GitHub.

    This code defines a Python class called ChunkIndexManager, which facilitates the creation, management, and deletion of resources for chunk indexing using Azure Cognitive Search. This class encapsulates functions for setting up a chunk index, creating datasources, and creating indexers for the chunks of data within documents. Let’s break down the main components of the code:

    • _create_chunk_index: This function creates a chunk index within Azure Cognitive Search. Similar to the previous example, it defines the schema of the index with various fields, including id, source_document_id, title, text, embedding, and more. Additionally, it configures a vector search using the HNSW algorithm for the embedding field, which is used to perform similarity searches based on document embeddings.
    • _create_chunk_datasource: This function establishes a blob datasource for the chunk index. It takes inputs such as the index prefix, storage connection string, container name, and Azure Search configuration to create the datasource. This datasource allows the chunks of data (e.g., paragraphs, sections) from documents to be ingested.
    • _create_chunk_indexer: This function creates an indexer for the chunk index. It connects the datasource to the index and specifies indexing parameters, including parsing_mode set to “json”. The indexer processes the chunks of data from the datasource and indexes them in the chunk index.
    • create_chunk_index_resources: This function orchestrates the creation of resources for chunk indexing. It invokes the previously defined functions to create the chunk index, datasource, and indexer. After setting up these resources, it waits for the indexer to complete its processing.
    • delete_chunk_index_resources: This function cleans up the resources associated with chunk indexing. It deletes the chunk index, indexer, and datasource, as well as their related components.
    • The ChunkIndexManager class aims to provide a streamlined solution for setting up and managing chunk-based indexing in Azure Cognitive Search. It encapsulates the steps involved in creating an effective search solution for chunks of data within documents. Developers can use this class to simplify the process of creating and managing resources required for efficient chunk-based indexing and retrieval using Azure Cognitive Search.

     

      Utilities

      The 'Utilities' code on GitHub.

      This code provides utility functions and methods for interacting with Azure Cognitive Search services and Azure Blob Storage, particularly focused on managing index, datasource, and indexer resources. Let’s break down the key components and functionalities:

      • Environment Variable Configuration:
        The code starts by retrieving essential configuration values from environment variables. These values include AZURE_SEARCH_SERVICE_ENDPOINT, AZURE_SEARCH_API_KEY (admin key for Azure Search), and AZURE_KNOWLEDGE_STORE_STORAGE_CONNECTION_STRING (connection string for an Azure Knowledge Store, which could be a blob storage).
      • Client Functions:
        The code defines functions get_index_client and get_indexer_client that return instances of SearchIndexClient and SearchIndexerClient respectively. These clients are used to interact with Azure Cognitive Search indexes and indexers.
      • Utility Functions:
        Several utility functions are provided to generate resource names and other useful operations:
        • get_index_name, get_datasource_name, get_skillset_name, get_indexer_name, get_chunk_index_blob_container_name: These functions generate the names for Azure Search index, datasource, skillset, indexer, and a blob container for chunk indexing based on an index prefix.
        • get_knowledge_store_connection_string: This function retrieves the connection string for an Azure Knowledge Store (such as a blob storage) from the configuration.
        • create_index: This function creates an Azure Search index with specified fields, vector search settings, and semantic configurations. It utilizes SearchIndexClient to create the index.
        • create_blob_datasource: This function creates an Azure Search datasource for Azure Blob Storage using a REST request. It sets up a connection to a specified blob container and includes a soft delete policy. The SearchIndexerClient is used to manage datasources.
        • wait_for_indexer_completion: This function waits for an Azure Search indexer to complete its indexing process. It polls the indexer status and waits until the indexer completes or encounters a transient failure.
          The provided code functions as a set of tools and utilities to streamline the creation, management, and monitoring of Azure Cognitive Search resources, particularly focusing on chunk indexing using Azure Blob Storage. Developers can use these utilities to interact with Azure Search services effectively and manage various aspects of the search indexing process.

       

      Requirements

      The following are the required pip packages to run the solution, please note that azure-search-document its using a beta version
      # DO NOT include azure-functions-worker in this file
      # The Python Worker is managed by Azure Functions platform
      # Manually managing azure-functions-worker may cause unexpected issues
      azure-functions
      langchain
      openai
      openai[datalib]
      azure-storage-blob
      azure-identity
      azure-core
      unstructured
      tiktoken
      #pre release

      azure-search-documents==11.4.0b6


      Conclusion

      Azure Cognitive Search offers powerful capabilities for search indexing, but navigating its complexities can be daunting. The utility functions and code snippets provided in this project part offer a practical solution to streamline the creation, configuration, and management of search indexes, datasources, and indexers. By abstracting away intricacies and automating common tasks, developers can focus on building effective search solutions that deliver actionable insights from their data.
      In a world where data-driven decisions are the driving force behind success, simplifying search indexing processes with Azure Cognitive Search utilities becomes a strategic advantage for businesses aiming to unlock the full potential of their data. By incorporating these utilities into your development workflow, you can accelerate the deployment of search solutions and empower your organization to make informed decisions based on accurate and up-to-date information.

      Want to know more?

      This insight is part of a series where we go through the necessary steps to create and optimize Chat & AI Applications.

      Below, you can find the full overview and the links to the different parts of the series:

      If you want to get started with creating your own AI-powered chatbot, contact us