Here I am, again during my quick lunch break, bringing the third installment in our series where I curate some of NLP topics/blogs/papers. I must confess, the task of keeping up with the fast-paced world of NLP while juggling my own schedule has been overwhelming lately. In fact, as I type this, there’s a pile of travel bags eyeing me, begging to be packed for my upcoming trip. So, I find myself wondering if, perhaps, a good old “copy and paste” might be the way to go—just for this month.

What happened in April 2024?

Something not happened in April, or not about LLM, but

My notes on hallucination

Been studying Representation Engineering previously mentioned recently, and spent some time on hallucination.

The current status quo of hallucination spotting is empirical based: once you see it, you claim it is hallucination. And as the time goes by, you might have an overall impression about how often a model hallucinates, even though you are not sure if your prompt is controlled. Or, under a different scenario, what if a prompt is altered, will the model hallucinate as before, or totally differently?

We don’t have a universal answer to these.

Plus, what is the ultimate goal of hallucination evaluation? Just saying one model is superior to the others? Is it possible that model A hallucinates in area X, while model B hallucinates more in area Y?

Stumbled upon this leaderboard (plus an associated model) from Vectara:

  • Vectara’s Hughes Hallucination Evaluation Model (HHEM) leaderboard on HuggingFace.
    • Methodology explained in a blog post Cut the Bull…. Detecting Hallucinations in Large Language Models (RIP, Simon.)
    • Trained a model to detect hallucinations in LLM outputs, using open source datasets from the factual consistency research into summarization models. Insert multiple SOTA models with the same prompt, and ask them to summarize with facts presented in open-source documents (CNN/Daily Mail Corpus) at temperature 0.
    • Determining hallucinations is impossible to do for any ad hoc question since it’s not known precisely what data every LLM is trained on. In addition, having a model that can determine whether any response was hallucinated without a reference source requires solving the hallucination problem and presumably training a model as large or larger than these evaluated LLMs.
    • “Arguably the best approach for reducing hallucinations in LLM responses is to ground the responses in an existing knowledge source…”
    • “Thus if we can measure how accurate an LLM is at summarizing data, i.e., acting as a reader model, we can estimate how accurate these systems are when provided with accurate search results.”
    • vectara/hallucination_evaluation_model · Hugging Face
    • When evaluating, consider accuracy, hallucination rate, average summary length, and answer rate.
    • You are a chat bot answering questions using data. You must stick to the answers provided solely by the text in the passage provided. You are asked the question ‘Provide a concise summary of the following passage, covering the core pieces of information described.’ <PASSAGE>’