With the recent developments in the field of artificial intelligence (AI) and natural language processing (NLP), large language models (LLMs) are gaining attention from the businesses for developing advanced applications that aim to improve our daily lives. From customer support chatbots to virtual personal assistants, the capabilities of these models continue to grow. Thus finding a proper benchmarking system that measures the reliability, limitations and effectiveness of these models for a particular business use case becomes necessary for the businesses.
Data security is also another reason why benchmarking of various models is important. Some companies/enterprises like banks are concerned about data security and they are not willing / allowed to share their data to companies like OpenAI. For them having their own in-house (On-premise or self managed cloud) LLM system similar to GPT is the best solution and hence benchmarking becomes an important and mandatory step in the process.
When evaluating LLMs, they need to be tested against various scenarios such as reasoning, understanding linguistic subtleties and answering questions related to specialized domains. Since LLMs are capable of various tasks and can be judged using several performance metrics, the process of benchmarking/evaluation becomes difficult.
Different approaches for Benchmarking
There are several key methodological approaches to evaluate the large language models:
- Task-Specific Evaluation: In this method, the language model is evaluated based on its performance on specific tasks, such as question answering, text summarization, or machine translation. This evaluation usually involves using established datasets and metrics for each task.
- Few-Shot Learning Evaluation: In this approach, the language model is given a few examples of a task at inference time, then asked to complete a similar task. The performance on these tasks is then measured. This method tests the model's ability to generalize learning from a few examples to a new instance of a task.
- Zero-Shot Learning Evaluation: Similar to few-shot learning, but in this case, the model is not given any prior examples at the inference time. The model's performance on these tasks, not seen during training, is measured. This tests the model's ability to understand and complete tasks it was not specifically trained to perform.
- Fine-Tuning Evaluation: In this approach, the model is fine-tuned on a specific task with additional task-specific training, then its performance on that task is measured. This helps to understand how well the model can adapt to specific tasks after pretraining.
- Human Evaluation: Finally, human evaluation plays a crucial role in benchmarking language models. This might involve humans rating the coherence, relevance, or factual correctness of the text generated by the model. Human evaluation can also involve more specific tasks, such as assessing the model's ability to generate creative stories, answering questions based on text, etc.
- Bias and Fairness Evaluation: This involves assessing the model's output for any biases or unfair portrayals based on factors like gender, race, religion, etc. This helps in understanding if the model has inadvertently learned any societal biases from its training data.
- Safety and Robustness Evaluation: This approach tests how well the model handles malicious input, misinformation, or adversarial attacks.
Our approach and results
We decided to evaluate the various LLMs (open source and OpenAI) from the question answering point of view using two approaches as described above: Task-Specific Evaluation and Human Evaluation.
For the Task-Specific Evaluation we chose the Google BoolQ Dataset and for the Human Evaluation we chose our predefined set of questions based on the given text with various levels of difficulties.
The goal is to get a holistic picture of how well the LLM understands the context by generating meaningful and coherent answers to difficult questions.
BoolQ dataset evaluation:
BoolQ is a question answering dataset for yes/no type questions which are gathered from anonymized, aggregated queries to the Google search engine. These questions are naturally occurring as they are generated in unprompted and unconstrained settings.
We fed in this dataset to the various LLMs and calculated their success rate by measuring the number of correct answers. Below is the result for this benchmarking:
We can see that google/flan-t5-xxl, tiiuae/falcon-180B-chat and GPT-3.5 models performed better than other models with google/flan-t5-xxl being the best and databricks/dolly-v2-3b being the worst. It is interesting to note that GPT-3.5 was outperformed by google/flan-t5-xxl and tiiuae/falcon-180B-chat while large llama model meta-llama/Llama-2-70b-chat-hf performed below average.
Human evaluation:
For the Human evaluation we have selected the story The Snow Queen and created a set of benchmark questions and answers based on that story. After that we evaluated the LLMs by feeding them these questions and measuring the correctness of the corresponding answers by grading them using a semi-manual approach1.
We chose questions with various levels of difficulties for example:
Easy question: Who were the two close friends mentioned in the story?
Difficult question: What object connected Karl's and Gerda’s houses, symbolizing their friendship in the story?
The full list of questions and answers can be found in Appendix 1, the answers by the various LLMs are found in Appendix 2.
Overall the performance of the various LLMs we tested here varied a lot. The best model was the tiiuae/falcon-180B-chat model which was pretty accurate and even outperformed OpenAIs GPT-3.5 on this task. While smaller models like the 7b version of falcon or the data bricks dolly performed quite poorly.
1supported by GPT-4
Criteria for Grading Answers:
We semi-manually graded the given answers based on the following criteria:
- Accuracy: The answer should be correct according to the details provided in the story.
- Completeness: The answer should cover all aspects of the question without leaving out key details.
- Clarity: The answer should be clearly written, making it easy for the reader to understand.
- Brevity: While the answer should be complete, it should also be concise, avoiding unnecessary details.
- Directness: The answer should address the question directly without deviating from the main point.
We used the grade point system (1-4) to assign the points to the answers from 1 to 4 including the intermediate points like 1/3, 2/3, etc. We then used GPT-4 to do the actual grading, please refer to Appendix 3 for more details.
For the different LLMs we have got the following results for the various questions:
Success rate in percentage:
In this task of open question answering the tiiuae/falcon-180B-chat, GPT-3.5 and meta-llama/Llama-2-70b-chat-hf model performed better here than other models with databricks/dolly-v2-3b being the worst. We can see the best model is tiiuae/falcon-180B-chat and it even outperforms GPT-3.5. The google/flan-t5-xxl model which was the best in the boolQ benchmarking task is just an average performer here in case of open question answering task. The second best model is GPT-3.5 followed by meta-llama/Llama-2-70b-chat-hf model, which also shows that meta-llama/Llama-2-70b-chat-hf model is a good competitor to GPT-3.5 for this kind of tasks and being significantly smaller than the tiiuae/falcon-180B-chat model it can also be used as an alternative to GPT-3.5
Conclusion
We can see from the above results that google/flan-t5-xxl is the best in case of BoolQ benchmarking followed by tiiuae/falcon-180B-chat and in the case of Human evaluation task tiiuae/falcon-180B-chat is clearly leading while outperforming other LLMs.
The best model in the BoolQ benchmarking task (google/flan-t5-xxl) is actually quite disappointing in the human evaluation task, this might be due to its specialized training on BoolQ dataset.
This also shows the necessity of doing evaluation on open questions, if one wants to use the model for tasks like knowledge extraction, as pure classification based tasks can be misleading and good performance in such tasks are not necessarily indicative of good performance in tasks that require text generation.
In summary, the TIIUAE/Falcon-180B-Chat model emerges as the clear leader, surpassing GPT-3.5 in both tasks. This model represents an excellent choice for hosting an in-house, open-source Large Language Model (LLM) for question-answering systems. Its accessibility as a free, open-source option with multilingual capabilities is particularly advantageous. However, deploying it on a self-managed cloud or on-premise setup can be costly, often requiring multiple high-end GPUs. A more budget-friendly alternative might be the quantized version, Falcon-180B-Chat-GGUF, which operates on a single GPU (such as Nvidia A100 or H100). Yet, its performance compared to the original model remains to be verified.
Our results showed that open source models like tiiuae/falcon-180B-chat can also be competitive enough like GPT-3.5 when it comes to answer the difficult questions based on the given context and hence can be used as an alternative to GPT-3.5 by the businesses. Using open source models also gives the ability to fine tune the model according to business demand along with the greater flexibility and full control.
Future work
Possible future work involves testing the quantized version of the model tiiuae/falcon-180B-chat to see if we can get the power of the original tiiuae/falcon-180B-chat model while working on a single gpu (which might indeed be as good as the full version as reported here), which is also cost effective and can save a lot of money for the businesses. If the quantized version performance is too low then it might be interesting to try to fine tune it by doing QLORA fine tuning, by adding low rank matrices to the quantized matrices and then fine tuning the model on the original dataset (RefinedWeb) which the tiiuae/falcon-180B-chat model was originally trained on. With this approach we could hope that the fine tuning compensates for the downgrade that results from quantization while on the other hand the full rank but quantized parameter skeleton provides a solid base for the fine tuning. This would hopefully distill the 180b falcon model to a much smaller model without loss of performance.
Appendix 1
Question: Who were the two close friends mentioned in the story?
Answer: The two close friends mentioned in the story were Karl and Gerda.
Question: How did Karl's behavior change after a splinter from the shattered mirror pierced him?
Answer: After a splinter from the shattered mirror pierced him, Karl turned into a very nasty boy, often insulting others, especially Gerda.
Question: How was Gerda able to break the evil spell on Karl at the end of the story?
Answer: Gerda was able to break the evil spell on Karl by throwing her arms around him and letting her teardrops drip onto his chest and heart. This act of love and emotion caused the evil spell to be broken.
Question: What object connected Karl's and Gerda’s houses, symbolizing their friendship in the story?
Answer: The sweet pea that grew on Karl's window sill spread across the street to entwine with Gerda's little rose bush, symbolizing their friendship.
Question: Where was the kingdom of the snow queen and how to reach there?
Answer: The kingdom of the Snow Queen was in Lapland, a place where all is icy cold. To reach there, Gerda traveled on the back of a reindeer through the frozen tundra, guided by the Northern Lights.
Question: Who helped Gerda in finding her friend?
Answer: A crow and a reindeer helped Gerda in finding her friend. The crow informed Gerda about Karl's whereabouts with the Snow Queen, and the reindeer carried her to the Snow Queen's kingdom.
Question: How did the snow queen take away Karl with her?
Answer: The Snow Queen approached Karl, asked him to tie his sledge to hers, and then they sped away into the sky, eventually landing in her icy kingdom.
Appendix 2
Download Appendix 2 - Overview Questions and Answers per LLM
Appendix 3
We first prompted GPT-4 with the set of questions and given text and then asked it to give the ideal answers for each question. Then we asked GPT-4 to evaluate the answers given by LLMs against the correct answers obtained in the previous step and grade them from A to D. We then translated the grades from A to D to the scale of 1-4 using the following mapping:
A = 4
B = 3
C = 2
D = 1
A- = 3.66
B+ = 3.33
C- = 1.66
D- = 0.66
D+ = 1.33