DistiLLM 01
DistiLLM is a casual LLM-related reading curation
I don’t like how social media sometimes equate ‘sharing’ with actually ‘reading’ a good article. Occasionally, a shared piece turns out to be a time-waster after investing 15 minutes deciphering thousands of words. To fulfill two objectives: 1) challenging myself to read 52 papers this year as a New Year’s resolution, and 2) sparing you time amid information overload, I present you this:
An Untitled LLM-related Reading CurationDistiLLM
Alternatively, interpret it as:
Hacker News posts mostly filtered by the keyword “LLM” that don’t actually suck.
I hope this self-learning process could benefit some strangers somehow.
What happened in Feb 2024?
Let’s get started with some name-dropping:
- OpenAI’s Sora swept the Internet in the past few weeks. Besides feeling impressed and losing trust in video contents gradually, we shouldn’t say anything conclusive without too many technical details.
- Related: Pika, probably the most popular text-to-video generative model/platform before Sora, released a lip sync demo and a sound effect demo. Honestly, it’s not that impressive to me.
- Related: Sora Videos, a collection of Sora generated with video and prompt pairs.
- Google’s Gemini 1.5 supports over 1M context length.
Discussion on Hacker News highlights some skeptics about how Google manages to obtain a 1M context window.- Technical report. Tedious, like a product manual.
- The killer app of Gemini Pro 1.5 is video. Truly multi-modal.
- Right after Gemini 1.5, Google released Gemma: Introducing new state-of-the-art open models. A technical report is available to public as well.
ℹ️ Let’s get started. Performance wise, Gemma is slightly better than Mistral 7B v0.2 Instruct in terms of safety, but pretty close to or worse than that in instruction following.
- Mistral AI launches Mitral-Next, not officially, but you can access it via LMSys (Direct Chat - Model -
mistral-next
)- This was later proven to be Mistral Large.
- Aya: Accelerating Multilingual AI Through Open Science. Cohere’s Aya is crowd-sourced LLM that expands the horizon of multilingual support, covering more than 100 languages.
- LangChain: Announcing the General Availability of LangSmith and Our Series A Led By Sequoia Capital. The value of LangSmith is a unified DevOps platform for developing LLM app from prototype to production. (P.S. Their UI improved hugely in the past year. Somehow I confuse their color palette with Cohere’s, their design languages look similar.)
This might be my read of the month:
- Representation Engineering Mistral-7B an Acid Trip.
- Paper discussed: Representation Engineering: A Top-Down Approach to AI Transparency, GitHub
- “While a long-term goal of mechanistic interpretability is to understand networks well enough to improve their safety, we find that many aspects of this goal can be addressed today through RepE. In particular, we develop improved baselines for reading and controlling representations and demonstrate that these RepE techniques can provide traction on a wide variety of safety-relevant problems, including truthfulness, honesty, hallucination, utility estimation, knowledge editing, jailbreaking, memorization, tracking emotional states, and avoiding power-seeking tendencies." RepE can help understand the model and improve safety.
- “In addition to demonstrating the broad potential of RepE, we also find that advances to RepE methods can lead to significant gains in specific areas, such as honesty. By increasing model honesty in a fully unsupervised manner, we achieve state-of-the-art results on TruthfulQA, improving over zero-shot accuracy by 18.1 percentage points and outperforming all prior methods. We also show how RepE techniques can be used across diverse scenarios to detect and control whether a model is lying. We hope that this work will accelerate progress in AI transparency by demonstrating the potential of a representational view. As AI systems become increasingly capable and complex, achieving better transparency will be crucial for enhancing their safety, trustworthiness, and accountability, enabling these technologies to benefit society while minimizing the associated risks." RepE methods leads to better instruction-following.
- Thought: Lots of issues could potentially be resolved by “honesty” and “truthfulness”, great weapon to tackle hallucinations.
- A truthful model avoids asserting false statements. An honest model asserts what it thinks is true.
- Definition of RepE: … is a top-down approach to transparency research that treats representations as the fundamental unit of analysis, with the goal of understanding and controlling representations of high-level cognitive phenomena in neural networks.
- Two main areas of RepE: Reading and Control.
- Representation Reading seeks to locate emergent representations for high-level concepts and functions within a network.
- Extract various concepts (truthfulness, utility, probability, morality, and emotion), as well as functions which denote processes (lying and power-seeking).
-
- Design stimulus and task; 2) Collect neural activity; 3) Construct a linear model.
- Representation Control seeks to modify or control the internal representations of concepts and functions.
- Representation Reading seeks to locate emergent representations for high-level concepts and functions within a network.
- Representation Engineering (RepE): calculating a “control vector” that can be read from or added to model activations during inference to interpret or control the model’s behavior, without prompt engineering or finetuning.
- A control vector is a vector (a list of vectors, one per layer) that you can apply to model activations during inference to control the model’s behavior without additional prompting. All a control vector does is modify the value of
hidden_state
in a desired way. Consider it as a simplified, condensed version of “prompting”. - Is it hard to achieve? See the article for a complete solution script. Plus an example notebook.
- “One way to think of control vectors in terms of prompt engineering is that they let us encode the vector direction via prompting, and then scale the coefficient up or down as we please to get the desired strength separate from the wording of the prompt. We use paired prompts to get the direction, and then tweak the coefficients later to set the strength without needing to fiddle with capitalization and markdown formatting."
- Related paper: Anthropic’s Monosemanticity Features (Towards Monosemanticity1: Decomposing Language Models With Dictionary Learning)
- Mechanistic interpretability seeks to understand neural networks by breaking them into components, but the most natural computational unit of the neural network (neuron), is not understandable by human. Simply because many neurons are polysemantic, which means they “responds to mixtures of seemingly unrelated inputs."
- Most features are interpretable and explain a non-trivial portion of the MLP layer. For a small number of features that activate in highly specific contexts and features are much more interpretable than neurons. Features refine and split in runs with more learned sparse features, which suggests that dictionary learning can be understood as a process of feature splitting that reflects something deep about the geometry of superposition.
- Simon Willison: “The examples Theia provides, using control vectors to make Mistral 7B more or less honest, trippy, lazy, creative and more, are very convincing."
- HN discussion.
- Paper discussed: Representation Engineering: A Top-Down Approach to AI Transparency, GitHub
- MultiModal RAG for Advanced Video Processing with LlamaIndex & LanceDB. LlamaIndex dissects a multimodal (video, specifically) RAG as the following steps: video downloading, video processing, building of the index and vector store, retrieving relevant images and context, reasoning and response generation.
eugeneyan/open-llm
, a list of open LLMs available for commercial use 📋.- DS/ML bookclub’s new reading Build a Large Language Model (From Scratch), work-in-progress.
- Introduction to Matryoshka Embedding2.
- An embedding is a fixed-length numerical representation of a more complex object. Matryoshka embedding: Kusupati et al. (2022) was inspired to create embedding models whose embeddings could reasonably be shrunk without suffering too much on performance. (Trained so small truncated embeddings are still useful.)
- How Matryoshka Embedding models are trained:
- Normally, a training step for an embedding model involves producing embeddings for training batches, then using some loss function to create a loss value (quality). The optimizer will adjust the model weights throughout training to reduce the loss value.
- For Matryoshka Representation Learning (MRL) , one difference is the loss function determines not only the quality of the full-size embeddings, but also the quality of the embeddings at various different dimensionalities.
- How to use Matryoshka Embedding models?
- Difference: after receiving the embeddings, we can optionally truncate them to a smaller dimensionality.
- Faster downstream tasks processing does not guarantee faster embedding generation.
- How Matryoshka Embedding models are trained:
- Relevant paper: Matryoshka Representation Learning
- Demo: Adaptive Retrieval with Matryoshka Embeddings.
- ”…instead of a fixed 768 dimension embedding vector you can trade size for quality and drop that size all the way down to 64, while still maintaining strong semantically relevant results."
- An embedding is a fixed-length numerical representation of a more complex object. Matryoshka embedding: Kusupati et al. (2022) was inspired to create embedding models whose embeddings could reasonably be shrunk without suffering too much on performance. (Trained so small truncated embeddings are still useful.)
- Fine-Tuning Large Language Models by SingleStore
- Challenges include: tailored outputs (proper fine-tuning and training needed), missing context (no domain-specific details), specialized vocabulary.
- Techniques:
- Full model fine-tuning: LLM as a blank state -> retrain all layers on the target data.
- Feature-based fine-tuning: Only specific layers or components of the LLM are retrained.
- Parameter-efficient fine-tuning: Techniques like LoRA (Low-Rank Adapters) use fewer parameters for fine-tuning.
- RLHF fine-tuning: Use human feedback instead of labeled data for training.
- Thinking about High-Quality Human Data by Lilian Weng
- Synthetic Data for Finetuning: Distillation and Self-Improvement by Eugene Yan
- How to generate and use synthetic data for finetuning? Sounds like a lucrative and attractive idea, isn’t it?
- Deep dive into two approaches to generate synthetic data: distillation from a stronger model or self-improvement on the model’s own output.
- Distillation transfers knowledge and reasoning skills from a stronger teacher to a weaker but more efficient student.
- Self-improvement has the model learn from its responses via an iterative loop.
- Paper: [2402.04315] Training Language Models to Generate Text with Citations via Fine-grained Rewards
- What is a Mixtral of Experts?
- Mixtral is a mixture-of-experts (MoE) network. The idea is to divide a complex problem into simpler parts, each handled by a specialized expert. The experts are typically individual models or neural networks good at solving specific tasks or operating on particular data types. The gating network then decides how to weight the output of each expert for a given input, effectively choosing which experts are most relevant or blending their outputs in a meaningful way.
- More about Mistral:
- One of the founding members of OpenAI, high-profile ML engineer Andrej Karpathy announced his second departure from OpenAI, and immediately dropped a video tutorial, Let’s build the GPT Tokenizer, in his YouTube channel. Some of his previous videos are classic, too:
- Fine-Grained Human Feedback by Databricks
- Related paper: [2306.01693] Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
- Fine-Grained RLHF (reinforcement learning from human feedback), a framework that enables training and learning from reward functions that are fine-grained in two ways: density and diversity.
- Density: by providing a reward after every segment (e.g., a sentence) is generated.
- Diversity: by incorporating multiple reward models associated with different feedback types (e.g., factual incorrectness, irrelevance, and information incompleteness).
- [2401.11817] Hallucination is Inevitable: An Innate Limitation of Large Language Models
- In this (quite theoretical) paper, authors define hallucination as inconsistencies between a computable LLM and a computable ground truth function, which is found to be inevitable in a formal world setup. Furthermore, real world could be more complex due to other factors considered.
- Classification of Hallucination: intrinsic-extrinsic dichotomy
- Intrinsic hallucination: when LLM outputs contradict with the provided input.
- Extrinsic hallucination: when LLM outputs cannot be verified by the information in the input.
- Causes of Hallucination:
- issues in data (poor quality, misinformation, bias, outdated, also possibly long-tailed and hard-to-recall knowledge),
- training (architectural and strategic deficiencies, e.g. exposure bias or attention diluted as sequence length increases),
- and inference stages (sampling randomness, insufficient context attention, and softmax bottleneck).
- Mamba Explained and Mamba: The Easy Way
- Introducing an alternative class of models other than Transformers, Mamba, is one of the State Space Models (SSMs), with similar performance but faster.
- Segment Anything
- RAG-related: [Querying a network of knowledge with llama-index-networks](https://www.llamaindex.ai/blog/querying-a-network-of-knowledge-with-llama-index-networks-d784b4c3006f
- GGUF, the long way around
- To introduce what GGUF is, through the process of the birth of a model.
- What is a model?
- Input: training data corpuses aggregated from human-generated natural language content.
- Algorithm:
- Convert data into embeddings
- Positionally encoding the embeddings to provide information about where the words are in relation to each other in the sequence
- Creating multi-headed self-attention for each word in relation to each other word in the sequence based on an initialized combinations of weights
- Normalize layers via softmax
- Run the resulting matrix through a feedforward neural network
- Project the output into the correct vector space for the desired task
- Calculate loss and then update model parameters
- Output: new word’s statistical likelihood that any given word completes a phrase.
- Originally, people use pickle to write objects created by PyTorch into files. But this causes some ML model supply chain security issues when practitioners started creating and uploading pickled model artifacts to model hubs.
- Safetensors, developed by HuggingFace, is more efficient and ergonomic: not bound to Python as closely as Pickle, available across all languages. It works with tensors specifically as a datatype. It’s also written in Rust with more safety, faster reading/writing.
- GGML (GPT-Generated Model Language) was initially both a library and a complementary format created specifically for on-edge inference for whisper. You can also perform fine-tuning with it, but generally it’s used to read models trained on PyTorch in GPU Linux-based environments and converted to GGML to run on Apple Silicon.
- The resulting GGML file compresses all the following files into one: a magic number with an optional version number, model-specific hyperparameters, an embedded vocabulary, and a list of tensors.
- Why GGML?
- Using 16-bit floating point representations of model weights instead of torch’s default 32-bit, leads to 50% less memory at compute and inference time without significant loss in model accuracy.
- Using C features more efficient memory allocation than Python.
- Optimized for Apple Silicon chips.
- Why not GGML?
- Everything, data and metadata included, was written into one file, so whenever a model added hyperparameters, it would break backward compatibility. Plus no model architecture metadata is in the file, and each architecture required its own conversion script.
- GGUF (GPT-Generated Unified Format) has the same type of layout as GGML, with metadata and tensor data in a single file, but in addition is also designed to be backward compatible with storing hyperparameters in key-value lookup tables instead of a list.
- Predictive Human Preference: From Model Ranking to Model Routing by Chip Huyen
- A pretty solid long read comparing the model available as of Feb 2024. Just a battle log kept in an arena.
- Always ranked by human preference.
- Not Elo score, but Bradley-Terry algorithm.
- “Better” models don’t always win, sometimes there are prompts for which other models can outperform GPT-4.
- Possible use case in model routing3, which will eventually reduce costs and latency while improving response quality.
- Fine-tuning a large language model on Kaggle Notebooks (or even on your own computer) for solving real-world tasks
- What we learned in 6 months of working on a CodeGen dev tool GPT Pilot
A bit taste of other interesting articles before ending this month’s contents
These are probably not related to LLM, but personally I find them interesting to share:
- Two discussions on HN:
- Observable 2.0. JS on the frontend, any language on the backend. Might be the future of interactive data visualization.
- Don’t Mock Machine Learning Models in Unit Tests
- A beginner’s guide to making beautiful slides for your talks
numpy-100
: 100 numpy exercises with solutions- A bird’s eye view of Polars
- A love letter to Apache Echarts
- latent-scope, a scientist instrument for investigating latent spaces
- Aggregating Measures of Uncertainty
- Modern Polars, a side-by-side comparison of the Polars and Pandas libraries.
- Python for Data Analysis, 3E by Wes McKinney
- The Querynomicon: An Introduction to SQL for Wary Data Scientists
- A data-driven storytelling about running her first marathon in Hong Kong.
- Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models
-
“Monosemanticity” is a neural network design that each neuron is “dedicated to a single, specific concept.” This one-to-one design increases the model’s interpretability. ↩︎
-
Matryoshka means the nested Russian dolls. ↩︎
-
An example of model routing: if we know in advance that for a prompt, users with prefer model A’s response over model B’s, and model A is usually cheaper/faster, then we can route this prompt to model A. ↩︎