With the recent developments in the field of artificial intelligence (AI) and natural language processing (NLP), large language models (LLMs) are gaining attention from the businesses for developing advanced applications that aim to improve our daily lives. From customer support chatbots to virtual personal assistants, the capabilities of these models continue to grow. Thus finding a proper benchmarking system that measures the reliability, limitations and effectiveness of these models for a particular business use case becomes necessary for the businesses.

Data security is also another reason why benchmarking of various models is important. Some companies/enterprises like banks are concerned about data security and they are not willing / allowed to share their data to companies like OpenAI. For them having their own in-house (On-premise or self managed cloud) LLM system similar to GPT is the best solution and hence benchmarking becomes an important and mandatory step in the process.

When evaluating LLMs, they need to be tested against various scenarios such as reasoning, understanding linguistic subtleties and answering questions related to specialized domains. Since LLMs are capable of various tasks and can be judged using several performance metrics, the process of benchmarking/evaluation becomes difficult.

Different approaches for Benchmarking

There are several key methodological approaches to evaluate the large language models:

Our approach and results

We decided to evaluate the various LLMs (open source and OpenAI) from the question answering point of view using two approaches as described above: Task-Specific Evaluation and Human Evaluation.

For the Task-Specific Evaluation we chose the Google BoolQ Dataset and for the Human Evaluation we chose our predefined set of questions based on the given text with various levels of difficulties.

The goal is to get a holistic picture of how well the LLM understands the context by generating meaningful and coherent answers to difficult questions.

BoolQ dataset evaluation:

BoolQ is a question answering dataset for yes/no type questions which are gathered from anonymized, aggregated queries to the Google search engine. These questions are naturally occurring as they are generated in unprompted and unconstrained settings.

We fed in this dataset to the various LLMs and calculated their success rate by measuring the number of correct answers. Below is the result for this benchmarking:

We can see that google/flan-t5-xxl, tiiuae/falcon-180B-chat and gpt3.5 models performed better than other models with google/flan-t5-xxl being the best and databricks/dolly-v2-3b being the worst. It is interesting to note that gpt3.5 was outperformed by google/flan-t5-xxl and tiiuae/falcon-180B-chat while large llama model meta-llama/Llama-2-70b-chat-hf performed below average.

Human evaluation:

For the Human evaluation we have selected the story The Snow Queen and created a set of benchmark questions and answers based on that story. After that we evaluated the LLMs by feeding them these questions and measuring the correctness of the corresponding answers by grading them using a semi-manual approach1.

We chose questions with various levels of difficulties for example:
Easy question: Who were the two close friends mentioned in the story?
Difficult question: What object connected Karl's and Gerda’s houses, symbolizing their friendship in the story?

The full list of questions and answers can be found in Appendix 1, the answers by the various LLMs are found in Appendix 2.

Overall the performance of the various LLMs we tested here varied a lot. The best model was the  tiiuae/falcon-180B-chat model which was pretty accurate and even outperformed OpenAIs GPT3.5 on this task. While smaller models like the 7b version of falcon or the data bricks dolly performed quite poorly.

Criteria for Grading Answers:

We semi-manually graded the given answers based on the following criteria:

We used the grade point system (1-4) to assign the points to the answers from 1 to 4 including the intermediate points like 1/3, 2/3, etc. We then used GPT4 to do the actual grading, please refer to Appendix 3 for more details.

For the different LLMs we have got the following results for the various questions:

Success rate in percentage:

In this task of open question answering the tiiuae/falcon-180B-chat, gpt3.5 and meta-llama/Llama-2-70b-chat-hf model performed better here than other models with databricks/dolly-v2-3b being the worst. We can see the best model is tiiuae/falcon-180B-chat and it even outperforms gpt3.5. The google/flan-t5-xxl model which was the best in the boolQ benchmarking task is just an average performer here in case of open question answering task. The second best model is gpt3.5 followed by meta-llama/Llama-2-70b-chat-hf model, which also shows that meta-llama/Llama-2-70b-chat-hf model is a good competitor to gpt3.5 for this kind of tasks and being significantly smaller than the tiiuae/falcon-180B-chat model it can also be used as an alternative to gpt3.5

Conclusion

We can see from the above results that google/flan-t5-xxl is the best in case of BoolQ benchmarking followed by tiiuae/falcon-180B-chat and in the case of Human evaluation task tiiuae/falcon-180B-chat is clearly leading while outperforming other LLMs.

The best model in the BoolQ benchmarking task (google/flan-t5-xxl) is actually quite disappointing in the human evaluation task, this might be due to its specialized training on BoolQ dataset.

This also shows the necessity of doing evaluation on open questions, if one wants to use the model for tasks like knowledge extraction, as pure classification based tasks can be misleading and good performance in such tasks are not necessarily indicative of good performance in tasks that require text generation.

In summary, the TIIUAE/Falcon-180B-Chat model emerges as the clear leader, surpassing GPT-3.5 in both tasks. This model represents an excellent choice for hosting an in-house, open-source Large Language Model (LLM) for question-answering systems. Its accessibility as a free, open-source option with multilingual capabilities is particularly advantageous. However, deploying it on a self-managed cloud or on-premise setup can be costly, often requiring multiple high-end GPUs. A more budget-friendly alternative might be the quantized version, Falcon-180B-Chat-GGUF, which operates on a single GPU (such as Nvidia A100 or H100). Yet, its performance compared to the original model remains to be verified.

Our results showed that open source models like tiiuae/falcon-180B-chat can also be competitive enough like gpt3.5 when it comes to answer the difficult questions based on the given context and hence can be used as an alternative to gpt3.5 by the businesses. Using open source models also gives the ability to fine tune the model according to business demand along with the greater flexibility and full control.

Future work

Possible future work involves testing the quantized version of the model tiiuae/falcon-180B-chat to see if we can get the power of the original tiiuae/falcon-180B-chat model while working on a single gpu (which might indeed be as good as the full version as reported here), which is also cost effective and can save a lot of money for the businesses. If the quantized version performance is too low then it might be interesting to try to fine tune it by doing QLORA fine tuning, by adding low rank matrices to the quantized matrices and then fine tuning the model on the original dataset (RefinedWeb) which the tiiuae/falcon-180B-chat model was originally trained on. With this approach we could hope that the fine tuning compensates for the downgrade that results from quantization while on the other hand the full rank but quantized parameter skeleton provides a solid base for the fine tuning. This would hopefully distill the 180b falcon model to a much smaller model without loss of performance.

Appendix 1

Question: Who were the two close friends mentioned in the story?

Answer: The two close friends mentioned in the story were Karl and Gerda.

Question: How did Karl's behavior change after a splinter from the shattered mirror pierced him?

Answer: After a splinter from the shattered mirror pierced him, Karl turned into a very nasty boy, often insulting others, especially Gerda.

Question: How was Gerda able to break the evil spell on Karl at the end of the story?

Answer: Gerda was able to break the evil spell on Karl by throwing her arms around him and letting her teardrops drip onto his chest and heart. This act of love and emotion caused the evil spell to be broken.

Question: What object connected Karl's and Gerda’s houses, symbolizing their friendship in the story?

Answer: The sweet pea that grew on Karl's window sill spread across the street to entwine with Gerda's little rose bush, symbolizing their friendship.

Question: Where was the kingdom of the snow queen and how to reach there?

Answer: The kingdom of the Snow Queen was in Lapland, a place where all is icy cold. To reach there, Gerda traveled on the back of a reindeer through the frozen tundra, guided by the Northern Lights.

Question: Who helped Gerda in finding her friend?

Answer: A crow and a reindeer helped Gerda in finding her friend. The crow informed Gerda about Karl's whereabouts with the Snow Queen, and the reindeer carried her to the Snow Queen's kingdom.

Question: How did the snow queen take away Karl with her?

Answer: The Snow Queen approached Karl, asked him to tie his sledge to hers, and then they sped away into the sky, eventually landing in her icy kingdom.

Download Appendix 2 - Overview Questions and Answers per LLM