Building solutions using closed-source large language models (LLMs), including models like GPT-4 from OpenAI, or PaLM2 from Google, is a markedly different process to creating private machine learning (ML) models, so traditional MLOps playbooks and best practices might appear irrelevant when applied to LLM-centric projects. And indeed, many companies currently approach LLM projects as greenfield developments that cannot fully reuse the existing MLOps infrastructure, and that do not have a clear strategy for creating any new reusable infrastructure. This approach invariably results in the accumulation of significant technical debt for new LLM-based solutions, elevated infrastructure costs, operational inefficiencies, and security and safety breaches.
In this article, we describe a reusable infrastructure (platform) for solutions that are powered by closed-source LLMs. This platform aims to streamline the development and operation of LLM-based solutions by standardizing services, tools, and other generic components across the applications. We focus specifically on closed-source LLMs accessible via APIs (such as OpenAI GPT-4, Google PaLM2, or Anthropic Claude). It’s worth noting that a platform for training and hosting open-source-based LLMs is a different topic that requires a separate discussion.
LLMOps capabilities and features
For the sake of specificity, let us assume a retrieval-augmented generation (RAG) application that enables internal or external users to query a collection of documents (knowledge base) via a conversational interface. Let us also assume that this application has access to tools, including systems such as relational databases or search engines that can be queried or updated using API calls or SQL queries. A general layout of such a RAG system with tools is shown in the figure below:
This architecture can support a wide range of business scenarios. Consider the following examples:
A vendor of a software platform for financial advisors needs to build a knowledge management system that allows the advisors to ask questions about the platform, financial products, and tax policies. Under the hood, the system searches through an extensive collection of documents and other unstructured data.
A wholesaler of industrial goods needs to build a knowledge management system that allows customers to ask questions about products and browse product details. Internally, the system queries product specifications submitted by the supplier in PDF format.
A banking company needs to provide its associates with a user-friendly business intelligence (BI) solution for querying stock data using natural language questions like “What were the daily trading volumes for Alphabet stock over the last two weeks?”. Internally, the application uses an LLM to generate SQL and query a relational database (which we view as a tool).
Although the architecture outlined above is relatively straightforward, it necessitates the resolution of recurrent problems that manifest in virtually all RAG applications, as well as in LLM applications in general:
Document storing and searching: RAG architecture requires a vector database for indexing knowledge base documents and their components, while enabling nearest neighbor search. This capability is commonly implemented using embedding-based search which, in turn, requires a model or service for computing text embeddings.
Document preprocessing: The documents should typically be preprocessed to extract text from structured formats like PDF or mask-sensitive information such as PII data.
Document indexing: The preprocessed documents are indexed in the vector database, which can be a relatively complex process that requires tracking the removed/updated/deleted documents in the knowledge base, enriching documents with metadata such as document type and access groups, splitting documents into chunks, and computing embeddings.
LLM gateway: The platform should support integrations with various LLM providers and models, including text generation and embedding computing models. The interaction gateway can also provide features like rate limiting, autoretry, etc.
Model management: Providers of closed-source LLMs commonly provide the ability to fine-tune models on proprietary data, necessitating proper management of the corresponding datasets and collections of fine-tuned models.
Prompt management: LLM-based applications tend to use a large number of prompts that directly or indirectly control system behavior and user experience. The applications should use a unified framework for managing and versioning prompts.
Observability: The production instance of the application needs to be continuously monitored to detect technical and functional issues in early stages. In particular, the problem of text generation quality should be continuously evaluated.
Cost efficiency: The platform should provide caching, cost tracking, and other services that optimize the cost of LLM usage.
Performance and latency: Text generation using LLMs is not a particularly fast operation, necessitating means for improving latency and other performance properties of the solution.
Safety and compliance: The platform should provide a mechanism for detecting and preventing LLM hallucinations, toxic responses, and other threats to a safe and consistent user experience. The platform should also provide protection for tools/systems that are controlled by LLMs.
Protection against bad actors: Security and data privacy are very important aspects of LLM-based solutions, and the platform should provide protection on both the LLM and user sides.
In the next sections, we will demonstrate how to develop an LLMOps platform that addresses these concerns. We start with a minimal solution that is required to support basic RAG applications and add more advanced capabilities step by step.
A basic RAG application usually includes two flows: indexing and querying. The indexing flow can be viewed as a data (document) processing pipeline that performs the following steps:
Document preprocessing such as PII data masking.
Splitting the documents into manageable chunks that fit the LLM context, and attributing metadata to each chunk.
Computing embeddings for the chunks.
Saving the chunks indexed by embeddings to the vector database.
The querying flow is executed in run time for each request (e.g. each turn in a chatbot dialog) and includes the following steps:
The application creates a query context (e.g. by concatenating all messages in the dialog history including the latest input from the user) and converts it into a standalone question using the text-generating LLM.
The embedding for this standalone question is computed using an embedding model.
Relevant document chunks are searched in the vector database based on the question embedding (nearest neighbor search).
The relevant chunks can be additionally filtered based on the metadata.
The chunks are retrieved, combined, and injected as the context, enriching the LLM’s ability to produce a relevant and well-grounded response.
The minimal set of components required for these two flows includes an LLM API for text generation, an LLM API for embedding computing, a data (document) indexing pipeline, and a vector database. These essential components are shown in green in the platform architecture diagram presented below.
In this architecture, we assume that interoperability with multiple LLM providers is ensured by the LLM framework used by the application (e.g. LangChain). The LLM framework can also provide built-in document preprocessing components, integrations with vector databases, and large building blocks that implement complete RAG and Agent flows.
However, the green components alone do not address all the challenges and concerns that we outlined in the previous section. For this reason, LLMOps platforms usually include additional services that provide more advanced functionality. These services are depicted in yellow and blue in the above diagram, and we discuss them one by one in the next sections. We also discuss how these components can be integrated with the basic flow (e.g. how document preprocessing can be coordinated with query-time post-processing).
Example technology stack
- Data indexing:Spark, LangChain, Airflow - Vector database:ChromaDB, Qdrant, Milvus, Pinecone, or FAISS - Application stack: Python, LangChain, LlamaIndex, FastAPI
Caching and streaming
Text generation using LLMs is a relatively slow operation which often becomes a major issue from the user experience standpoint. This problem can be addressed using several techniques:
Caching: The LLM responses can be cached to handle frequent queries/contexts more efficiently. The most basic solution is to use standard caching components like Redis that perform an exact match between a new query and a cached query. However, this approach might be inefficient in LLM applications because of the high variability of queries. A more sophisticated solution lies in semantic caching, which employs embedding-based search to find cached queries that are similar to the new one. Internally, a semantic cache can rely on a vector database.
Streaming: User experience can be improved by printing out the generated text token by token (similar to the approach employed in the ChatGPT web interface) instead of waiting for a complete response in a blocking mode. This technique is commonly referred to as streaming. Streaming APIs are offered by most LLM providers, but creating a streaming application requires propagating near real-time token streams through all layers including backend services and frontend UI. Consequently, streaming needs to be incorporated in the solution design and appropriate orchestration and UI frameworks should be selected.
The caching layer can also help to reduce the cost of the LLM services, although the efficiency of this solution can greatly vary depending on the application.
LLM-powered applications often need to look up customer data and other contextual information. For example, a customer assistant app might fetch information about the loyalty tier and engagement level from a customer profile to personalize generated messages.
In general, LLM-powered applications can be integrated with any source of contextual data such as customer data platforms (CDPs). However, it is often beneficial to use a feature store, a concept that is well-known in traditional MLOps, for managing and sourcing such data. The rationale behind this choice is manifold:
Tailored for low-latency transactions: Feature stores are purpose-built to provide low-latency transactional access to large collections of features including basic attributes and derived values such as propensity scores. This design effectively mitigates scalability bottlenecks, ensuring swift and efficient data retrieval.
Interoperability with ML models: Feature stores provide interoperability between in-house ML models (e.g. personalization or recommendation models) and LLM applications. If the company already uses a feature store, it is easy for LLM applications to tap into this infrastructure.
Continuous feature updates: Feature stores are typically connected with supplementary infrastructure elements like data collection pipelines and scoring models, ensuring that the features remain consistently updated.
Example technology stack
- Feast, Vertex AI Feature Store, or Databricks Feature Store
Prompts are among the most important building blocks for Generative AI (GenAI) applications, second only to LLMs themselves. Why? Because application behavior and the majority of the business logic are encoded in the prompts. Complex applications can use tens or hundreds of different prompts, many of which are parametrized with dynamic data or even generated by LLMs.
In small standalone applications, prompts can be hard coded or stored in configuration files. However, complex enterprise applications might require more solid infrastructure for prompt management that not only enables the dynamic alteration of application behavior, but also expedites the resolution of user experience issues, facilitates AB testing, allows for test changes and fixes before production deployment, and even automates prompt testing on multiple models and sets of input parameters. These functions can be consolidated in a separate prompt management system. Prompt management services are available in some ML platforms and as standalone products.
Example technology stack
Guardrails: Safety, compliance, and user experience
The pursuit of user experience quality and safety within LLM-based applications presents a formidable challenge, owing to a multitude of factors, including the following:
Inherent complexity of generated text evaluation: Assessing the quality, usefulness, compliance, and safety of generated text is inherently difficult to evaluate in general.
Versatility of conversational systems: Conversational systems are extremely versatile and flexible, making it difficult or impossible to perform comprehensive testing.
Integration with vector search: Vector search used in RAG adds an additional layer of complexity and uncertainty.
Continuous LLM updates: LLM providers continuously update their products, causing shifts in application behavior that can prove unpredictable.
Testability of complex LLM chains: Given their intricate interdependencies, complex LLM chains are difficult to test and debug.
These challenges can be addressed at different levels of the LLM stack including datasets for LLM training, fine-tuning, and various interceptors on the LLM provider and application sides. From the LLMOps perspective, request/response interceptors on the application side are a very important and powerful technique. These interceptors, commonly called guardrails, can perform a broad range of checks and make corrections to steer the application behavior, improve user experience, and prevent safety and compliance issues. Examples of safety and compliance guardrails include the following:
Toxicity assessment: This is a component that evaluates text for the presence of violence, harmful, or offensive content, generating scores (e.g. hate_speech = 0.28) that trigger mitigating actions. The toxicity checks can be performed for both user input and LLM output.
Topic bans: This is a component that detects certain topics and takes mitigating actions. For example, it can detect politics-related topics in the user input (e.g. “What do you think of the president?”) and return a blocking response (e.g. “I'm a shopping assistant, I don't like to talk about politics”). This check can be applied to both user input and LLM output.
Relevance validation: This is a guardrail that validates the generated response as relevant to the input question and context, often measured through similarity scores between the prompt and response. For example, the score will be low for the question “What is the current sales tax in California?” when the answer is “Yangtze River is the longest river in both China and Asia”.
Contradiction assessment: This is a component that verifies the absence of LLM output self-contradictions, contradictions to the input, or contradictions with established facts.
Hallucinations and factuality detection: This guardrail detects hallucinations and non-factual statements in the LLM output. Detecting hallucinations is generally a challenging task that can be approached in many different ways. One practical approach for closed-source LLMs is to generate multiple responses for the same query and validate their agreement.
The standard checks described above require limited or no configuration and can be implemented using off-the-shelf libraries. However, you might also need to create more advanced guardrails that combine custom checks and mitigating actions into complex flows. Such custom guardrails might not even be directly related to safety or compliance, but steer other aspects of application behavior that require more flexible frameworks.
Example technology stack
- Standard guardrails: LLM-Guard, OpenAI Moderation APIs - Custom flows: NVIDIA NeMo, custom guards using Hugging Face model
Applications that integrate with external tools invariably demand specialized guardrails. For example, an application that queries a relational database using text-to-SQL generation needs to validate that the generated SQL is valid, execute it, analyze the response returned by the database, fix the SQL query in case of errors, and repeat until the required data are fetched or retry limits are reached. The guardrails can become even more critical and sophisticated when the application can update data or change the tool state.
Guardrails: Security and privacy
While the previous section delved into the multifaceted challenges of quality and safety, the realm of LLM usage introduces a secondary concern of equal significance: security and privacy. Similar to the management of quality and safety, security and privacy concerns demand an equally comprehensive approach, spanning various layers of the LLM framework. This includes establishing contractual terms with LLM providers, implementing training and RAG data preprocessing and minimization methods, and orchestrating real-time guardrails that scan user inputs and generated outputs.
One possible way of planning the security strategy is to list different classes of assets (data and systems) that we need to protect and possible threats, and make sure that each asset-threat combination is covered using one or several methods.
An example of such planning is presented in the table below. It is important to note that this is a simplified illustration that doesn't account for open-source or private LLMs and associated techniques such as differential privacy.
Similar to safety and compliance, many security and privacy threats can be addressed using real-time guardrails. Standard examples of security guardrails include the following:
Anonymization/deanonymization: This guardrail consists of two parts. The first one is the input interceptor that detects sensitive data elements such as names, addresses, and other PII data, and replaces them with surrogate tokens. The second is the output interceptor that performs the reverse mapping. In some cases, the anonymization operation can be made irreversible.
Prompt injection: This is a component that detects malicious prompts that attempt to override system prompts, hijack the conversation context, or instruct the LLM to perform unintended actions (jailbreak). This guardrail is particularly important for LLM agents that control internal or external systems via API.
Secret detection: This is a scanner that detects login, password, credit card numbers and other sensitive data. This check can be applied to both user input and LLM output.
URL detection: This scanner detects malicious URLs in user input and LLM output.
Rate limiting: This guardrail throttles input querying and LLM invocation rates with the goal of preventing outages, DoS attacks, and excessive (and costly) LLM calls due to defects.
One of the key considerations for designing or selecting a security guardrail framework is the ability to efficiently react to new threats and breaches. From that perspective, security guardrails should follow the protocols and best practices used in the cybersecurity industry.
It is worth noting that some of the above guardrails require coordination between the document preprocessing pipeline and query-time processing. For example, a RAG system can prevent the LLM provider from seeing the PII data using the following approach:
The PII data can be replaced with surrogate tokens as a part of the preprocessing stage.
The mapping between actual PII values and tokens is saved in a database.
Reverse mapping is performed in query time to produce the final response for the user.
Example technology stack
- LLM-Guard, Guardrail ML, custom guards using Hugging Face model
The safety, compliance, and security challenges discussed in the previous sections underscore the importance of observability, which is another well-known concept from traditional MLOps. In the case of LLMOps, observability use cases include the following:
Prompt analytics: The prompts received from users or external systems should be logged and analyzed. In particular, the prompts can be clustered and visualized to better understand the application usage.
Problematic prompts: Prompts that deviate from the typical clusters can be flagged as outliers.
User feedback capturing: As we discussed earlier, the quality of generated text can be challenging to evaluate. The quality analysis and detection of problematic prompts can be facilitated by capturing implicit or explicit user feedback (e.g. thumbs up/down).
Automatic checks: Continuous LLM updates can be countered with automatic quality, compliance, and safety checks.
Monitoring: LLM-backed applications require tracking metrics such as cache hit ratio and throughputs/latencies at different stages of the LLM chains.
Logging: Logging requests, responses, prompts, and automatic checks enable auditability, and support optimization and troubleshooting.
These and other observability capabilities can be implemented using specialized products or custom tools. Observability components should also be integrated with data lakes/warehouses and general-purpose analytics and BI tools to support deep analysis and application optimization.
Example technology stack
In the previous sections, we implicitly assumed immutable foundation models. However, most LLM vendors provide the ability to fine-tune the foundation models via specialized APIs. This creates an additional layer of complexity from the LLMOps perspective. More specifically, we need to provide the following capabilities:
Training dataset management: Fine-tuning hinges on the availability of training datasets, which demand data cataloging and tracking, mirroring the conventional practices applied to datasets in traditional MLOps.
Model registry: As the number of fine-tuned models grows, maintaining a registry of such models with the associated metadata, such as training parameters, becomes an important concern. This is also a well-understood problem in traditional MLOps.
Conceptually, these capabilities can be implemented using the same tools as in traditional MLOps, but LLMOps imposes several unique challenges. For example, fine-tuned models can be hosted across multiple LLM providers making it more difficult to create a unified registry. Another example is model metrics–quality metrics are an integral part of the metadata in traditional MLOps, but automatic evaluation of the quality metrics in LLMOps is much more challenging, although possible.
Example technology stack
- Training dataset management: DataHub - Model repository: Vertex AI, MLflow
Throughout the previous sections, we have described the LLMOps platform specifically for closed-source LLMs, but is it reasonable to build applications using only closed-source models? From this strategic perspective, it is generally recommended to consider individual functions that LLMs perform, and make design decisions for each function separately:
Auxiliary functions such as embedding computing, PII data detection and masking, and some other operations that are performed by guardrails do not necessarily require cutting-edge LLMs and can be implemented using small models. Consequently, these functions are good candidates for implementing open-source, privately deployed, and fine-tuned models.
Text generation and reasoning capabilities typically require state-of-the-art LLMs which makes them more difficult to implement using the open-source approach. Using privately deployed, closed-source LLMs or open-source LLMs for text generation is a major strategic decision that is informed by functional, security, budgeting, operational, and business considerations.
For these reasons, combining closed-source and open-source models can be particularly advantageous. For example, security concerns can be effectively addressed by using privately deployed open-source models for detecting and masking sensitive data, while subsequently, the actual business flow is executed using a public closed-source LLM based on the masked data.
GenAI adoption is expected to grow at a near-exponential pace in the next few years, and as the scale and complexity of GenAI solutions increase, the role of LLMOps will become increasingly important. Moreover, many pilot solutions that were developed in the early stages of GenAI adoption will ultimately require migration to solid, manageable, and cost-efficient LLMOps platforms. In this article, we outlined the functional and technical designs of such a platform specifically for closed-source LLMs, and provided mappings to well-established and emerging frameworks, libraries, and products that can be used for implementation.
Subscribe to our latest Insights
Subscribe to our latest Insights
Subscribe to updates from the Grid Dynamics Blog
Please confirm your email
Thank you for subscribing to our blog. Please check your inbox for an email confirmation.
This window would be closed automatically in 10 second.
Email is confirmed
Your email is confirmed. Thank you for subscribing to our blog.
This window would be closed automatically in 10 second.
Have a question? We'd love to hear from you
Please provide us with your preferred contact method so we can be sure to reach you