Mistral: The French Revolution of Generative AI

The generative AI field is a fast-paced, continuously changing landscape that is dominated by the United States and China. To combat this dominance, European-based companies are increasingly recognizing the importance of creating their own Large Language Models (LLMs) trained on European data. This allows for the development of models fluent in the most widely spoken European languages, as opposed to LLMs only trained on English/American data. An often-heard concern is that the European Union tends to focus more on legislation rather than fostering growth. However, there is positive news on the horizon!

Several EU-based companies, often supported by major corporations and governments, have taken up the challenge of developing European LLMs. We investigated one of them called Mistral and explored how their models compare to the well-known American Models. Other initiatives include the Aleph Alpha, a German hub for AI innovation with funding from the Schwarz Gruppe - owners of Lidl and Kaufland - and SAP, with a strong focus on data sovereignty. At the time of writing, there is a strong French-German alliance pushing for less European regulation and more development.

Who is Mistral?

Mistral, headquartered in Paris, boasts 22 employees and was founded by three school friends who previously had experience at Meta and Google. In their latest funding round, they secured €105 M, valuing the company at €2 B. This is however still a fraction of the valuation of their American counterparts. Mistral is dedicated to developing open-source LLMs that not only rival popular models in performance but are also simultaneously open and transparent.

Mixtral 8x7B: new kid on the block

Mistral's ambition to offer transparent and open-source models is not just smoke and mirrors; they released their latest model called Mixtral 8x7B - not to be confused with their previous Mistral 7B model - by posting a magnet link on X. This unconventional approach stands in stark contrast to how major players unveil models and showcases Mistral's distinctive marketing strategy. The model itself is a mixture of experts implementation, meaning that it combines multiple (8) expert models into 1 model.

This technique creates so-called sparse models, which are nowadays increasingly recognized for their potential. Sparse models use a method known as conditional computation: depending on the input prompt, a sparse model like Mixtral 8x7B will activate a certain model within its network of 8 experts. This is in stark contrast with the approach Google took with their Gemini which was built to be multimodal from the ground up - we covered Google Gemini in one of our previous Insights. However, the advantages of Mistral's approach are plural: first of all, this approach allows to drastically increase the model's size without increasing the computational demands. After all, we're only running 7B parameter models instead of bigger ones which is not only more efficient in terms of scalability, but also more environmentally friendly. Next, dense models often struggle with issues like catastrophic forgetting, a mechanism known that when large models are finetuned on specific tasks, they tend to forget some of the tasks they were initially trained for. This issue further amplified when a model tries to learn too many tasks at once or in sequence. By splitting up the models into 8 experts, Mixtral 8x7B essentially created some natural segmentation, which is beneficial in multitasking and continual learning.

Not just another LLM

What makes Mixtral 8x7B unique is that it not only offers performance similar to GPT-3.5 and even outperforms LLaMa 2 70B on most benchmarks; it does this while being small enough to be ran on a personal computer.

(source: Mistral - Mixtral of experts)

Fluent in five languages - English, French, Italian, German, and Spanish - this model also exhibits strong code generation and mathematical capabilities. With its smaller size and comparable performance to much larger models, Mixtral 8x7B positions itself as one of the best models in terms of the cost/performance trade-off.

Why use Mistral's model(s) and not GPT3.5?

When considering working with a Mistral model vs an OpenAI model like GPT3.5, it's essential to grasp the pros and cons of both solutions and choose the option that works best for you.

Generally speaking, an OpenAI model like GPT3.5 will give you a faster time to market at a lower cost for low usage (low number of tokens per day) and an easy infrastructure setup. On top of that, it requires only minimal specialization of your workforce in the field of LLMs. However, when not set up correctly - e.g. you are not hosting it yourself via your Azure Platform -, you might be prone to exposing your data or run into higher costs when your request volume increases dramatically over time.

On the other hand, open source models like Mistral's models are typically transparent and customizable, which can lead to better performance than GPT models on domain specific tasks. Furthermore, open source models are typically more economically viable for high request volumes than GPT models. On the flip side though, Mistral's models require quite a bit more infrastructure setup as well as a more specialized workforce - especially when finetuning the model.

How to use it

Currently, Mistral offers 3 types of models on their platform called Mistral's La Platform, with each model having its own performance/price trade-off:

Mistral-tiny: only available in English scoring 7.6 on MT-Bench
Mistral-small: in the 5 languages scoring 8.3 on MT-Bench
Mistral-medium: in the 5 languages scoring 8.6 on MT-Bench

The model is available on Hugging Face under an Apache 2.0 license and Hugging Face auto train feature can be used to fine-tune the model using your own data.

Another way is to access the Mistral API, but this is currently in beta testing and only accessible after payment. The billing is on a per-token basis, meaning there is no monthly payment, but only payment for what you use. Using the API a "safe output mode" can be accessed, meaning that guardrails are enforced on the output of the model. This is especially important if you use Mistral models to develop products that will directly display their output to the end user.

The model can also be run locally using Ollama however this requires large compute power.

Yet another possibility is to deploy the model in Azure AI Studio's, but this in turn requires close inspection on the hosting costs associated with this approach, as these can explode if not closely monitored.

Last but not least, Mistral is also part of the Databricks Marketplace. This means that you, as a developer, can easily run the Mistral model from your Databricks compute. Great feature!

What's next?

Mixtral's latest model represents just one of many small LLMs that offer performance comparable to classical LLMs. It raises the intriguing possibility of smaller, task-specific LLMs dominating the future landscape, potentially outperforming their more generalized counterparts. So is Mistral like the French Revolution in Generative AI? Who knows! Only the future will tell. So sit back and enjoy the developments that are happening in the Generative AI space!