Intro
In order to be able to create LLM Applications using Azure Cognitive Search, we need to set up the components for it, this was explained in the first part, we need Indexes, Indexers, Knowledge Store, Data Sources and Skillsets. In this part, we will create the backend functions to support all of this.
If you prefer, check the backend folder for the entire code, which is also very well documented.
create_indexes
This function is a wrapper on top of all that we have to do, it will be used later from the backend which is implemented as an Azure Function.
From here we instantiate the DocumentIndexManager, and then we create the document index resources with the function create_document_index_resources
- Index.
- Indexer.
- Datasource.
- Skillset with custom skill (OpenAI embedding generator).
Remember these resources are tied to the source documents, pdf, word, excel, PowerPoint, md, or whatever is supported. Until this point, we don't have any vector storage yet. However when we create the skillset, we define a knowledge store, this means that the output of the custom skill will be saved into the knowledge store, more on this later.
Then we instantiate the ChunkIndexManger which will create the chunk index resources using create_chunk_index_resources:
- Index.
- Indexer.
- Datasource.
In this second set of resources, the indexer is set to a data source pointing to our Knowledge Store, remember a Knowledge Store is just a storage account, and there we have the projections generated by the previous step, the projections are actually a lot of JSON files with the embeddings generated in our previous step, you will see this later in the code.
Document Indexing Manager
The 'Document Indexing Manager' code on GitHub.
The given code defines a Python class called *DocumentIndexManager* that facilitates the creation, management, and deletion of resources for document indexing using Azure Cognitive Search. This class encapsulates functions to set up a document index, create data sources, skillsets, and indexers, as well as to manage these resources. Let’s break down the main components of the code:
- _create_document_index: This function creates a document index within Azure Cognitive Search. It defines the schema of the index, specifying various fields such as document ID, content, filesize, filepath, and more. It also includes searchable and retrievable attributes to enhance search and retrieval efficiency.
- _create_document_datasource: This function establishes a blob datasource within Azure Search, allowing documents to be ingested from a specified storage container. The function takes inputs like the index prefix, storage connection string, container name, and Azure Search configuration to create the data source.
- _create_document_skillset: This function defines a skillset, which is a set of skills applied to the indexed content to extract meaningful information. It might include skills like open AI embedding, OCR, merging, and image analysis. These skills enhance search accuracy by extracting relevant data from the documents. For our project, we used only Open AI Embedding, but the code is there for you to try OCR, Merging, and Image Analysis skills. When Creating a skillset, a Knowledge Store has to be defined also, why? Because the output of a custom skill needs a place to be saved.
- _create_document_indexer: This function creates an indexer that connects the data source to the document index. The indexer specifies how data should be processed and ingested into the index, including field mappings and indexing parameters. It utilizes the previously defined skillset to enhance the indexed content.
- _create_document_index_resources: This function orchestrates the creation of all necessary resources for document indexing. It invokes the previously defined functions to create the index, data source, skillset, and indexer. After setting up these resources, it waits for the indexer to complete its processing.
- - delete_document_index_resources: This function cleans up the resources associated with a document index. It deletes the index, indexer, data source, skillset, and related components. Additionally, it deletes any knowledge store tables and blobs associated with the index.
- The DocumentIndexManager class aims to provide a comprehensive solution for setting up and managing document indexing in Azure Cognitive Search. It encapsulates the various steps involved in creating an effective search solution for documents. By leveraging this class, developers can streamline the process of creating and managing the resources required for efficient document indexing and retrieval using Azure Cognitive Search.
ChunkIndexManager
The 'ChunkIndexManager' code on GitHub.
This code defines a Python class called ChunkIndexManager, which facilitates the creation, management, and deletion of resources for chunk indexing using Azure Cognitive Search. This class encapsulates functions for setting up a chunk index, creating data sources, and creating indexers for the chunks of data within documents. Let’s break down the main components of the code:
- _create_chunk_index: This function creates a chunk index within Azure Cognitive Search. Similar to the previous example, it defines the schema of the index with various fields, including id, source_document_id, title, text, embedding, and more. Additionally, it configures a vector search using the HNSW algorithm for the embedding field, which is used to perform similarity searches based on document embeddings.
- _create_chunk_datasource: This function establishes a blob datasource for the chunk index. It takes inputs such as the index prefix, storage connection string, container name, and Azure Search configuration to create the data source. This data source allows the chunks of data (e.g., paragraphs, sections) from documents to be ingested.
- _create_chunk_indexer: This function creates an indexer for the chunk index. It connects the data source to the index and specifies indexing parameters, including parsing_mode set to “json”. The indexer processes the chunks of data from the data source and indexes them in the chunk index.
- create_chunk_index_resources: This function orchestrates the creation of resources for chunk indexing. It invokes the previously defined functions to create the chunk index, data source, and indexer. After setting up these resources, it waits for the indexer to complete its processing.
- delete_chunk_index_resources: This function cleans up the resources associated with chunk indexing. It deletes the chunk index, indexer, and data source, as well as their related components.
- The ChunkIndexManager class aims to provide a streamlined solution for setting up and managing chunk-based indexing in Azure Cognitive Search. It encapsulates the steps involved in creating an effective search solution for chunks of data within documents. Developers can use this class to simplify the process of creating and managing resources required for efficient chunk-based indexing and retrieval using Azure Cognitive Search.
Utilities
The 'Utilities' code on GitHub.
This code provides utility functions and methods for interacting with Azure Cognitive Search services and Azure Blob Storage, particularly focused on managing index, data source, and indexer resources. Let’s break down the key components and functionalities:
- Environment Variable Configuration:
The code starts by retrieving essential configuration values from environment variables. These values include AZURE_SEARCH_SERVICE_ENDPOINT, AZURE_SEARCH_API_KEY (admin key for Azure Search), and AZURE_KNOWLEDGE_STORE_STORAGE_CONNECTION_STRING (connection string for an Azure Knowledge Store, which could be a blob storage). - Client Functions:
The code defines functions get_index_client and get_indexer_client that return instances of SearchIndexClient and SearchIndexerClient respectively. These clients are used to interact with Azure Cognitive Search indexes and indexers. - Utility Functions:
Several utility functions are provided to generate resource names and other useful operations:- get_index_name, get_datasource_name, get_skillset_name, get_indexer_name, get_chunk_index_blob_container_name: These functions generate the names for Azure Search index, data source, skillset, indexer, and a blob container for chunk indexing based on an index prefix.
- get_knowledge_store_connection_string: This function retrieves the connection string for an Azure Knowledge Store (such as blob storage) from the configuration.
- create_index: This function creates an Azure Search index with specified fields, vector search settings, and semantic configurations. It utilizes SearchIndexClient to create the index.
- create_blob_datasource: This function creates an Azure Search datasource for Azure Blob Storage using a REST request. It sets up a connection to a specified blob container and includes a soft delete policy. The SearchIndexerClient is used to manage data sources.
- wait_for_indexer_completion: This function waits for an Azure Search indexer to complete its indexing process. It polls the indexer status and waits until the indexer completes or encounters a transient failure.
The provided code functions as a set of tools and utilities to streamline the creation, management, and monitoring of Azure Cognitive Search resources, particularly focusing on chunk indexing using Azure Blob Storage. Developers can use these utilities to interact with Azure Search services effectively and manage various aspects of the search indexing process.
Requirements
The following are the required pip packages to run the solution, please note that azure-search-document uses a beta version
# DO NOT include azure-functions-worker in this file
# The Python Worker is managed by Azure Functions platform
# Manually managing azure-functions-worker may cause unexpected issues
azure-functions
langchain
openai
openai[datalib]
azure-storage-blob
azure-identity
azure-core
unstructured
tiktoken
#pre release
azure-search-documents==11.4.0b6
Conclusion
Azure Cognitive Search offers powerful capabilities for search indexing, but navigating its complexities can be daunting. The utility functions and code snippets provided in this project part offer a practical solution to streamline the creation, configuration, and management of search indexes, data sources, and indexers. By abstracting away intricacies and automating common tasks, developers can focus on building effective search solutions that deliver actionable insights from their data.
In a world where data-driven decisions are the driving force behind success, simplifying search indexing processes with Azure Cognitive Search utilities becomes a strategic advantage for businesses aiming to unlock the full potential of their data. By incorporating these utilities into your development workflow, you can accelerate the deployment of search solutions and empower your organization to make informed decisions based on accurate and up-to-date information.
Want to know more
This insight is part of a series where we go through the necessary steps to create and optimize Chat & AI Applications.
Below, you can find the full overview and the links to the different parts of the series:
- Overview: Elevate Chat & AI Applications: Mastering Azure Cognitive Search with Vector Storage for LLM Applications with Langchain | element61
- Part 1 - Architecture: Building the Foundation for AI-Powered Conversations | element61
- Part 2 - Embedding Generator for Cognitive Search: Revolutionizing Conversational Context | element61
- (this article) Part 3 - Configuration Deep Dive: Empowering Conversations with Vector Storage | element61
- Part 4 - Backend Brilliance: Integrating Langchain and Cognitive Search for AI-Powered Chats | element61
- Part 5 - Frontend Flourish: Craft Immersive AI Experiences Using Streamlit | element61
If you want to get started with creating your AI-powered chatbot, contact us.