2024 IEEE Information Theory Workshop
Tutorials
Home / Tutorials



Tutorials

Theory on Training Dynamics of Transformers

Presenter: Yingbin Liang

Transformers, as foundation models, have recently revolutionized many machine learning (ML) applications such as natural language processing, computer vision, robotics, etc. Alongside their tremendous experimental successes, theoretical studies have also emerged to explain why transformers can be trained to achieve fantastic performance. This tutorial aims to provide an overview of these recent theoretical investigations that have characterized the training dynamics of transformer-based ML models. Additionally, the tutorial will explain the primary techniques and tools employed for such analyses, which leverage various information theoretical concepts and tools in addition to learning theory, stochastic optimization, dynamical systems, probability, etc.

Specifically, the tutorial will begin with an introduction to basic transformer models, and then delve into several ML problems where transformers have found extensive applications, such as in-context learning, next token prediction, and masked vision pretraining. For each learning problem, the tutorial will go over the problem formulation, the main theoretical techniques for characterizing the training process, the convergence guarantee, and the optimality of the attention models at the time of convergence, the implications to the learning problem, and the insights and guidelines to practical solutions. Finally, the tutorial will discuss future directions and open problems in this actively evolving field.


Unraveling Contagion Origins: Mathematical Analysis, Network Algorithms and Applications to Epidemics and Infodemics

Presenter: Chee Wei TAN

The rapid spread of infectious diseases and online rumors share striking similarities in their speed, scale, and contagion patterns. Recent outbreaks, such as the COVID-19 pandemic, have underscored the devastating impact of simultaneous epidemics and misinformation crises. Recognizing this, the World Health Organization has emphasized the need for proactive measures to combat future pandemic threats, such as Disease X. Networks play a pivotal role in both contagions, as influential individuals can rapidly propagate fear and misinformation. The ability to trace contagion origins, whether for epidemics or rumors, is critical for effective containment strategies. This tutorial offers a thorough examination of contagion source detection, merging mathematical theories, computational algorithms, and graph signal processing techniques. Participants will delve into network centrality, contagion dynamics, and graph signal processing to promptly pinpoint sources, monitor spread, and forecast outbreaks. The focus will be on revealing connections between graph algorithms and large-scale optimization in detecting contagion sources, based on “Contagion Source Detection in Epidemic and Infodemic Outbreaks: Mathematical Analysis and Network Algorithms” published by Foundations and Trends® in Networking in 2023. The tutorial thus allows attendees to gain insights into complex network dynamics and advanced mathematical techniques to address the spread of future pandemics and misinformation effectively.


Converse Techniques for Identification: A Cutting-Edge Example of Goal-Oriented Communication

Presenters: Christian Deppe, Wafa Labidi, Johannes Rosenberger, and Shun Watanabe

There is a surge of interest in models that go beyond Shannon’s classic transmission framework, celebrated for its channel capacity formula. One such exciting principle is message identification via channels, as introduced by Ahlswede and Dueck. Unlike Shannon’s traditional model, where the receiver aims to estimate which message was sent from a set of M messages. Message identification concentrates solely on determining whether a specific message m was transmitted. The encoder can function deterministically or through randomization, which yields significant performance gains for message identification.

The ability to perform identification is tightly coupled with the ability to generate common randomness, which is of importance, e.g., in cryptography, to counter jamming, as well as for rate-distortion-perception theory, a recent extension of rate-distortion theory aimed at the evaluation of neural networks for human perception. Further, strong connections exist to soft-covering and channel resolvability, which are also important in source coding and cryptography.

When the encoder can randomize, Shannon’s model allows for the transmission of M = 2nC messages with n usages of a channel, eg., a discrete memoryless channel (DMC) with capacity C, whereas Ahlswede and Dueck’s model enables the identification of M = 22^(nC) (double exponential) messages over a DMC. In their groundbreaking paper, Ahlswede and Dueck established the achievability part and introduced a "soft" converse bound. Subsequent works have further refined this, leading to a strong converse bound applicable under specific conditions. Watanabe’s recent contributions notably expand the applicability of the converse bound.

Moreover, unlike in classical Shannon transmission, feedback can increase the identification capacity, since the main task to perform identification consists of common randomness generation. In addition, it's noteworthy that noise can further enhance the identification capacity via the feedback link. Ahlswede and Dueck's pioneering work furnished proofs for identification capacities with feedback and either stochastic or deterministic encoding, akin to Wolfowitz's proof for transmission with feedback. Attempting to apply Watanabe’s technique to derive these capacity formulas unearthed numerous open questions, likely to captivate the interest of many researchers.

This tutorial aims to achieve multiple objectives: understanding the formalism and proof techniques outlined in the aforementioned works, analyzing several converses, tracing their evolution from earlier converse, and emphasizing key similarities and differences underlying the enhancements. Additionally, we delve into various models within the realm of identification via channels, seeking to establish connections between distinct proof techniques. Furthermore, we explore the converse proof for message identification with feedback, pioneered by Ahlswede and Zhang. By elucidating how their approaches were inspired by preceding proofs, we provide a comprehensive overview.

This tutorial endeavors to offer insights into diverse converse techniques for message identification, with a focus on the seminal works and recent publications by the instructors. Moreover, we elucidate how these converse techniques can be applied and extended to various other models.


Characterizing the Generalization Error of Machine Learning Algorithms Via Information Measures

Presenters: Gholamali Aminian, Yuheng Bu,  Iñaki Esnaola, and Samir M. Perlaza

This tutorial explores methodologies originating from information theory to understand the generalization behavior of machine learning algorithms, i.e., how such algorithms perform on unseen data. While a broad set of techniques for the study of generalization are briefly discussed, the focus is on the notion of generalization error. The generalization error consists of the expectation with respect to models and datasets of the difference between the training empirical risk and the population risk. The main highlight of this tutorial is that such a metric, which has been the center of attention in machine learning for several decades, is characterized for all machine learning algorithms via closed-form expressions involving information measures. The Gibbs algorithm is taken as a particular example for introducing the main insights gained from these expressions. We will also discuss some practical insights of such analysis in understanding over-parameterization. Such an information-theoretic approach is versatile, as we can also characterize the generalization error of some transfer learning, semi-supervised learning algorithms. We believe that this analysis can guide the choice of different learning algorithms and advance the understanding of generalization in practice.