With the rapid evolution of large language models (LLMs), the world has begun to rely on the APIs of managed AI models such as OpenAI’s GPT-4. Its cutting-edge capabilities and developer-friendly interface have gained a large fan base not only in academia but also in the industry-wide as well. As every rose has its thorn, there are some downsides to those proprietary APIs despite all the benefits they bring to the table. Performance reliability, uptime predictability, and cost are some of them. At the same time, a flurry of open-source small language models (SLMs) has been made available for commercial use.
So, here we face the ultimate question,
Is GPT-4 really worth it?
or can those open-source SLMs replace proprietary LLMs?
But before we dive into this question, another problem comes to play.
How are we going to evaluate proprietary LLM vs open-source SLM?
In the paper, the authors introduce a new toolset, SLaM to evaluate the performance of SLM vs LLM across a wide range of metrics.
SLaM is a system that facilitates the process of hosting and evaluating language models on the AWS cloud. It works by downloading models from the Hugging Face repository and setting them up for experiments. To evaluate the quality of these models, SLaM uses both human and automated methods. For human evaluation, SLaM provides a user interface (UI) where humans can rate model responses without knowing which model generated which response. For automated evaluation, SLaM uses a Similarity Scorer and GPT-4-based scoring to assess response quality based on similarity metrics. In addition to this, SLaM performs automatic performance evaluations by measuring query latency over time.
When investigating this feature in a real-world product, the authors have considered a ‘PEP-TALK’ feature in a task management and productivity application called ‘Myca’. When users log into the app at the start of their day, it gives them a positive message tailored to their previous day’s achievements, plans, and overall progress toward their goals. The ‘PEP-TALK’ feature combines the user’s past activities and plans into a prompt, fed into a language model, which generates a motivational response displayed to the user.
So, this paper discusses four questions related to moving from proprietary LLMs to self-hosted SLMs.
1. Is the quality of SLMs good enough for users?
2. How well can AI-assisted tooling automate the process of identifying SLM alternatives?
3. What are the latency implications of self-hosted SLMs in a utility-based cloud environment?
4. What are the cost tradeoffs of open-source SLMs compared to proprietary LLM APIs?
In this article we are going to talk about the quality of the content generated by SLMs.
This box distribution depicts the performances of 9 SLMs all together with their 29 quantized models which were reviewed by humans along with a proprietary LLM.
This shows that some smaller language models (SLMs) can create responses just as good as, or even better than, the larger models made by OpenAI. Especially, the smaller, optimized versions of these models, called quantized models, are doing really well, which is great for using them in real-world applications because they take up less space. Most of the SLMs tested produce responses that are almost as good as the best models, indicating that the quality of their responses is quite similar to that of OpenAI’s models. However, a few specific models, namely the orca2:7b (both base and optimized versions) and stablelm-zephyr:3b-q3, didn’t perform as well and had noticeably poorer response quality compared to others.
So, SLMs can indeed produce good content just as good as proprietary LLMs
but the debate on whether they can replace LLMs continues. And we can’t come to a final decision without the answers to other three questions.