Small LLMs: When Less Is More

The rise of Large Language Models (LLMs) like GPT-4, Bard, and Gemini has been a game-changer in AI language processing. These models are powerful, but do they cover the full end of the story? Enter Small LLMs! At first glance, they might seem less capable due to their lower performance on various benchmarks. Yet, businesses are investing in them for good reason. This blog post explores why smaller can sometimes be better, depending on the use case.

To discuss the differences between small & larger LLM's we identify 4 important evaluation axes.

Quality: High-quality answers are crucial, and the latest LLMs do deliver impressively. But not every task requires the absolute best in logical reasoning or creativity. When the cost of high-end models outweighs the benefits, a smaller LLM might be the smarter choice. However, quality can also be expressed in the number of modalities you need your model to be good at. Let's say that you have a use case that requires both vision (images), text and speech to be integrated into the solution. In this case the larger LLMs might be the better option as they typically are multimodal to some degree. However, if you only need one modality like text, it might be worth it to fine-tune a smaller LLM for the specific task.
Cost: Budget is a real concern for most businesses. The most advanced models come with a steep price, but obviously that investment needs to be justified by the return. In cases where the budget is tight, a smaller LLM that does the job well enough can be the right economic decision. However, note that the cost of your LLM is very dependent on where it is hosted as well. Without going too much into detail, it's typically cheaper to use a proprietary model when there's only small traffic from and to the LLM, while hosting open source LLMs is typically cheaper for higher traffic volumes.
Latency: Speed matters, especially when processing large volumes of text or interacting with users in real time. No user likes to wait, and here, smaller LLMs often have the edge. They can respond faster, keeping up with the demands of instant communication.
Customizability: Not every model understands niche industry terms or company-specific jargon. Smaller LLMs can be more easily fine-tuned to handle particular vocabularies, making them more relevant for specialized applications.

Trade-off

This trade-off is also a known concept for Hugging Face, a well known French-American company that develops open-source and platform solutions for Machine Learning. Below you can find a screenshot from the Hugging Face leaderboard using quality, memory (cost) & latency. As you can see there are small LLM's beating a lot of the larger LLM's but those small LLM's have been fine-tuned heavily to obtain their performance.

screenshot from the Hugging Face leaderboard

In other words: there's always a trade-off between these 4 axes. They clearly are important factors when selecting the right LLM for the right use case. But, what are some of these use cases that should be tackled with smaller or larger LLM's? Let's have a look:

Quality out of the box
- Legal and Medical Document Analysis: in industries where accuracy is paramount, such as in legal or medical fields, the quality of answer generation is crucial. LLMs can be used to analyze and summarize complex documents, but any errors could have significant consequences. As large LLMs typically provide a better quality output out of the box, a first step here could be starting with building a solution around one of the bigger LLMs. This minimizes the time it takes to get your solution up and running.
- Educational Content Creation: when creating educational materials or tools such as intelligent tutoring systems, the quality of the content produced by the LLM is critical. The information must be accurate, well-explained, and pedagogically sound to ensure that students are learning correctly and effectively. A first solution could involve setting up a RAG system with a large LLM. This means that you first index your educational materials and later feed them to the LLM.
Cost of Compute
- MVPs (Minimum Viable Products): Projects with limited budgets might prioritize cost of compute when testing out concepts or building MVPs. They may opt for a less expensive LLM that provides "good enough" results for initial prototypes and user testing before committing to more expensive models with higher quality outputs. Costs typically differ in 2 ways: the size of the model and the fact whether or not the model is open source. Hosting open source models is typically more expensive when there is only small traffic from and to the LLM, so we would advise to start with a proprietary model like the ones offered by OpenAI.
- Large-Scale Data Processing: Companies needing to process vast amounts of text data (e.g., social media sentiment analysis, customer feedback categorization) may prioritize cost, especially if the task doesn't require the highest level of language understanding or creativity. The fact that hosting smaller open-source LLMs is usually more cost effective in setups with high daily traffic volumes makes these small LLMs perfect candidates for the end solution.
Latency
- Real-Time Translation Services: For applications that require immediate translation, such as live customer support chat or real-time communication tools, latency is a critical factor. The LLM must deliver translations quickly to maintain the flow of conversation without noticeable delays. Smaller LLMs have an edge here as their latency is typically faster.
- Interactive Voice Assistants: In voice-operated devices and applications, quick response times are essential to provide a seamless user experience. Users expect near-instantaneous replies to their queries, making the low-latency smaller LLMs an interesting choice for these products.
Customizability
- In industries with specialized terminology, such as finance, legal, or healthcare, customer service bots can be fine-tuned with industry-specific data to understand and respond accurately to inquiries that contain niche jargon. Smaller LLMs have an edge once again.
- Legal and Medical Document Analysis: let's say that you have created a proof-of-concept with a large LLM and you realize that the initial solution mainly falls short in terms of output quality. After some investigation, you realize that this is mainly due to some specific jargon you have in your dataset. In this case, it might make sense to investigate whether finetuning a smaller LLM for your specific use case can increase the quality of the final solution.

Conclusion

In short, choosing the right LLM for your project isn't just about picking the biggest model available. There are multiple factors that will decide which particular model you want to use. However, as the combination of having a quick and affordable solution with great quality out of the box is often preferred by our customers, at element61 we typically start with the implementation of one of the "bigger" models. If that initial solution would seem to fall short in one of the 4 different evaluation axes, we can still pivot to a more customized solution.

Small LLMs: When Less Is More

Trade-off

Conclusion

Sources

Published on

Authors