The Attention Is All You Need paper proposed the Transformer Architecrture as an improvement to the dominant encoder-decoder models of the time (both recurrent and convolutional). These models used an attention mechanism to connect the encoder and decoder parts, but the Transformer Architecture flipped the script, putting the Attention Mechanism at the center. An early implementation of the Transformer Architecture was BERT, which used the Transformer as an encoder. Later models such as BART used encoder and decoder Transformer components in a sequence-to-sequence setup. Since then, there has been an explosion of variants around this basic model, accompanied by a steady breaking of benchmarks at tasks where older recurrent and convolution sequence-to-sequence models reigned supreme.
A second major breakthrough was the emergence of decoder-only Transformer models for text generation. Early models were less than encouraging, but as researchers increased the number of parameters to train ever larger and larger models on larger and larger datasets, their text generation capabilities improved to the point where they become viable candidates for using as pre-trained general purpose inference models. These models are also based on Transformers, but are generally differentiated by calling them Large Language Models (LLM) or Foundation Models (FM).
From a user's point of view, once you get past the slightly larger computing requirements, the first category (BERT like Transformer models) is actually easier to fine tune for custom tasks than its predecessors, thanks to tooling available from libraries such as HuggingFace Transformers and SentenceTransformers. The second category (LLMs), at least initially, were the domain of compute and data rich organizations, who would create these models and make them available to others over an HTTP API as inference-only models, often for a fee. Because of the massive number of parameters and volume of training data, these models were generalized enough to do inference on diverse tasks in diverse domains without additional fine-tuning. Of course, because they were generative models, their outputs were not deterministic, prompting cautions such as the On the Dangers of Stochastic Parrots paper, and patterns to alleviate it like Retrieval Augmented Generation (RAG) and Chain of Thought (CoT) prompting. More recently, fine tuning has become practical for this class of models with the advent of Parameter Efficient Fine Tuning (PEFT) techniques. Also, with the advent of multimodal LLMs and reasoning capabilities, they are more than just Large Language Models.
Anyway, the point of this (probably incomplete) history lesson is that the Transformers in Action book by Nicole Koenigstein, that I am reviewing, primarily covers Transformers in the second category, except for the first two chapters where it covers basics of the Transformer architecture. If you were more interested in the first category, I would recommend Transformers for Natural Language Processing by Denis Rothman, which I have reviewed on Amazon previously.
Back to the review. The book is organized in three parts, with the first part consisting of Chapters 1 and 2, the second part consisting of Chapters 3-5 and the third part consisting of Chapters 6-10.
In Part 1, Chapter 1 describes the Transformers architecture at a high level, how it incorporates ideas from earlier neural models and how it is different from them. It covers the idea of in-context learning (zero-shot and few shot), the distinguishing feature of Transformer based LLMs. Chapter 2 does a deep dive into the Transformer Architecture and its components, covering ideas such as Stacked Encoder-Decoder, Add and Norm (LayerNorm) layers, the Query-Key-Value Attention Mechanism, and position wise Feed Forward Network (FFN).
In Part 2, Chapter 3 moves the discussion into decoder-only Transformers, i.e. Large Language Models, and the central theme of this book. It describes variants of the Transformer Architecture, i.e. encoder only models such as BERT versus decoder only Autoregressive models that predict the next token. It touches on Causal Attention and KV Cache as necessary ingredients for this type of model. It also touches on the use of encoder only models as Embedding models and how it relates to RAG. It also mentions Mixture of Experts (MoE) as a promising architectural variant of Decoder-only models.
Chapter 4 covers some basics about parameters that control the behavior of LLMs, such as top-k and top-p sampling, temperature, prompting styles such as Zero-shot, Few shot, CoT (Chain of Thought), Contrastive CoT where both right and wrong reasoning traces are provided, Chain of Verification (CoVe) where the model reflects on and verifies its output, Tree of Thought (ToT) which introduces intermediate steps in problem solving traces, and Thread of Thought (ThoT) that partitions the problem into sub-problems and combines the threads from the sub-solutions into the final generation.
Chapter 5 covers Preference Alignment and RAG. The first part, Preference Alignment, is aimed at fine-tuning the behavior of an LLM to a particular domain or behavior. It covers Reinforcement Learning from Human Feedback (RLHF) as a Markov Decision Process (MDP) and the use of Proximal Policy Optimization (PPO). It describes specializations of PPO such as DPO (Direct Preference Optimization) that does not need an explicit reward model, and GRPO ((Group Relative Policy Optimization) tht removes the need for an explicit value function. Both DPO and GRPO are preceded by SFT (Supervised Fine Tuning) to align a transformer to a domain. The second part covers RAG, which is more familiar to most people using LLMs -- the discussion formalizes the structure of a RAG pipeline (retriever, generator and refinement layer) and describes some popular RAG variants, i.e. Agentic RAG, Corrective RAG, Self RAG and Fusion RAG.
In Part 3, Chapter 6 discusses Multimodal models, how they are different from text-only LLMs, and how they work by projecting text and non-text input into a shared embedding space. It differentiates between Converter based alignment where all modalities are projected onto the same space and Perception based alignment where modality specific encoders produce each embedding and the LLM uses an Attention mechanism to combine them.
Chapter 7 discusses Small Language Models (SLMs) which are decoder-only transformer models with 8-13 billion parameters. These are larger than encoder-only or encoder-decoder style models, but smaller than other decoder-only models. Such models are usually better at general purpose inference than (smaller) encoder-only or encoder-decoder models, but not as good as their full size counterparts. SLMs focus on specialization and efficiency, and can be deployed as edge devices or specialized components co-existing with LLMs in RAG pipelines. They can be used to generate data to train their larger counterparts using Weak to Strong Learning, Approximate Gradient Proxies, and function as Auxilliary Reward Models for RLHF. They can be deployed as specialized tools or agents in Agentic Workflows, e.g. classifiers for sentiment analysis, compliance checks, intent detection, guard models and coding models. They also work well in privacy concious domains, where you don't want your requests going out to a third party model provider. Finally, they are more practical to fine-tune for your specific use case than the full sized LLM.
Chapter 8 discusses training and evaluating Large Language Models, and suggests the use of Ray Tune for Hyperparameter Tuning and the Weights and Biases (W&B) platform for logging and determining GPU utilization. It details various PEFT techniques such as LoRA (Low Rank Adaptation), DoRA (Weight Decomposed LoRA), Quantization, QLoRA (Quantized LoRA), QA-LoRA (Quantized Aware LoRA), and LQ-LoRA (Low Rank plus Quantized Matrix Decomposition LoRA). Unfortunately, the author has not included too many examples of this in the book, possibly based on it being perceived as out of scope for this book's average reader. However, I had expected some coverage of evaluation techniques which I did not find -- evaluation is a real problem for teams building RAG or other inference-only pipelines, and it is complex because outputs are non-deterministic. Perhaps this is an oversight that can be addressed in a future edition of the book.
Chapter 9 covers deployment issues associated with LLMs, namely around optimization and scaling. Model Optimization techniques such as pruning (removing neurons or edges) and distillation (from larger to more efficient smaller models for specific tasks) and Memory Optimization techniques such as various types of sharding (tensor, pipeline, optimizer and hybrid) are described. The chapter also describes Inference Optimization techniques such as KV Caching, Paged Attention, vLLM and Operator Fusion, GPU Optimizations such as Tiling and Flash Attention, and extensions to support Long Context such as Rotary Embeddings (RoPE), iRoPE which alternates between RoPE and NoPE (No Positional Embeddings), block sparse and linear attention, and sliding window and chunked attention.
Finally, Chapter 10 covers techniques to create Responsible and Ethical LLM based applications. It outlines some possible reasons for LLM bias based on the geographical distribution of training data, approaches to flag and filter hateful or toxic generations using pre-trained BERT class models such as RoBERTa-Toxicity and HateBERT, the use of custom logging on W&B for interpretability analysis, using perturbation models in the Captum tool to determine feature attribution, and explanations using Local Interpretable Model Agnostic Explanations (LIME). It also describes some rule-based techniques to ensure Responsible behavior of LLMs, such as adding disclaimer text, penalizing tokens if they match a blacklist, using rule and LLM driven input and output guards like the ones provided by llm-guard. It also suggests using safe/unsafe classifiers in Purple Llama to address Lifecylce vulnerabilities and prevent Jailbreaks.
As with the earlier Transformers book, what I found most useful about this book was the coverage. While I feel fortunate to have actually lived through these transformative times rather than read about them in a book, the pace of breakthroughs in the state of the art are hard to keep up with unless you are actively doing the research yourself (and maybe not even then). As a result, you end up knowing about a few things that you have used or considered using or found interesting, but are woefully ignorant about a lot of the other things in the field. Books like this not only fill out your knowledge gaps, they also give you new ideas based on things that you just learned.
In addition, this book describes many useful techniques to improve your LLM pipelines. Many of us, me included, have built traditional and neural (pre-transformer and transformer based) ML pipelines, and have been building RAG pipelines over the past couple of years. But we may not be familiar with all the latest prompting techniques, or we may not have fine-tuned a SLM because of the compute requirements. Books like these show us how to do it, and thereby make us more productive and more effective users of LLMs.












