Life has a funny way of keeping us on our toes, doesn’t it? These past few months have been, a whirlwind of errands filling my todo lists. Despite the chaos, I did manage to carve out a bit of downtime to put together this blog post. But let’s face it, they won’t ever be my longterm commitment. Honestly, my goal is not (maybe was) to keep a rundown of what’s happening in or around the industry. Rather, I want this space to be more like a reflection of my thoughts and encounters.

Expect shorter, bite-sized tidbits I stumble upon in my day-to-day life. After all, I don’t wanna carry more burden with me :)

What happened in March 2024?

  • March 4th.
  • March 5th.
    • You can now train a 70b language model at home, a quite detailed walkthrough how the current LLM training procedure is developed.
      • HN discussion
      • A normal desktop computer setup, combining FSDP and QLoRA, can train a 70b language model.
      • QLoRA combines two advanced techniques in modern neural networks: quantization and LoRA.
        • 4 or even fewer bits.
        • Once quantized, cannot be trained any further using regular approaches–the gradient descent method used will observe zero gradients almost everywhere.
        • Quantized models are good for inference but no good for training in this sense.
        • LoRA (“Low-Rank Adaptation of Large Language Models”) doesn’t train the whole large language model at all, but instead adds “adaptors”. Use a quantized base model, which is not changed at all by the training, and add trainable LoRA adaptors that are not quantized.
        • Batching appears as we need to save time to train multiple sequences together. The batch size available depends on the memory space left after loading the model weights.
      • FSDP (Fully Sharded Data Parallel)
        • One potential solution to the issue above (not enough memory) is to use more GPUs, e.g., by placing a few layers of the model on each card. When training, run the first few layers on the 1st GPU, then the next few on the 2nd, … transformers has this device_map='auto' setting does exactly what was described. Downside: only one GPU is active at a time, all the others wait for their “turn”. Note this as IDEA A.
        • Distributed Data Parallel is the good old standard approach to train models across multiple GPUs by copying the full model on each GPU – to multiple the training speed. But DDP doesn’t work when a full model cannot fully fit on a single GPU. Note this as IDEA B.
        • FSDP = `device_map=‘auto’ + DDP
        • FSDP is a library, that ‘shards’ a large model, by splitting its parameters across multiple GPUs, allowing all the GPUs to be used simultaneously. When a layer of the neural network is calculated on a particular GPU during training, all the required shards are copied there. Then, the calculation is made, and finally the copied parts are deleted from that GPU. It’s quite efficient to copy the data of the next layer at the same time the current layer is busy calculating.
      • In a later post, Answer.AI - Enabling 70B Finetuning on Consumer GPUs, the team demonstrated how to add FSDP and QLoRA support to quantization libraries and training frameworks.
  • March 6th.
  • March 8th.
  • March 11th.
  • March 13th.
  • March 14th.
  • March 20th.
  • March 22nd.
  • March 27th.
  • March 31st.
    • A Builder’s Guide to Evals for LLM-based Applications
      • Classification/Extraction: ROC, PR, class distributions
      • Summarization: Consistency via NLI, relevance via reward model, length checks
        • 4 key dimensions of evaluating abstractive summaries:
          • Fluency: Are sentences in the summary well-formed and easy to read? We want to avoid grammatical errors, random capitalization, etc. (not a problem in modern LLM)
          • Coherence: Does the summary as a whole make sense? It should be well-structured and logically organized, and not just a jumble of information. (lesser concern in modern LLM)
          • Consistency: Does the summary accurately reflect the content of the source document? We want to ensure there’s no new or contradictory information added.
          • Relevance: Does the summary focus on the most important aspects of the source document? It should include key points and exclude less relevant details.
        • We can reframe the latter two as binary classification and reuse the metrics from classification above.
        • To measure factual consistency, we can finetune a natural language inference (NLI1) model as a learned metric.
        • Treat the source document as the premise and the generated summary as the hypothesis. If the summary contradicts the source, the summary is factually inconsistent aka a hallucination.
        • For relevance, we can develop a learned metric with the same paradigm, or train a reward model on human preferences.
        • Opinion summarization: to generate a summary that captures the key aspects and associated sentiments from a set of opinions, such as customer feedback, social media, or product reviews. We adapt the metrics of consistency and relevancy for:
          • Sentiment consistency: For each key aspect, does the summary accurately reflect the overall sentiment expressed?
          • Aspect relevance: Does the summary cover the main topics discussed?
      • Translation: Quality measure via chrF, BLEURT, COMET, COMETKiwi
      • Copyright: Exact regurgitation, near-extract reproduction
      • Toxicity: Proportion of toxic generations on regular and toxic prompts

Some other stuffs


  1. Given a premise and a hypothesis, the task is to predict whether the hypothesis is entailed by (logically flows from), neutral to, or contradicts the premise. ↩︎