Multimodal Learning

February 28, 2023 - 2 minute read - Category: Intro - Tags: Deep learning


This post covers the tenth lecture in the course: “Multimodal Learning.”

Humans learn through multiple modalities, and combining modalities is also of relevance to a variety of economic applications. This lecture focuses primarily on vision language models.

Lecture Video

Watch the video

Lecture notes

References Cited in Lecture 9: Multimodal Learning

Goh, Gabriel, Nick Cammarata, Chelsea Voss, Shan Carter, Michael Petrov, Ludwig Schubert, Alec Radford, and Chris Olah. “Multimodal neurons in artificial neural networks.” Distill 6, no. 3 (2021): e30

Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry et al. “Learning transferable visual models from natural language supervision.” International Conference on Machine Learning. PMLR, (2021).

Li, Liunian Harold, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. “VisualBERT: A simple and performant baseline for vision and language.” arXiv preprint arXiv:1908.03557 (2019).

Wang, Zirui, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. “Simvlm: Simple visual language model pretraining with weak supervision.” arXiv preprint arXiv:2108.10904 (2021).

Tsimpoukelli, Maria, Jacob L. Menick, Serkan Cabi, S. M. Eslami, Oriol Vinyals, and Felix Hill. “Multimodal few-shot learning with frozen language models.” Advances in Neural Information Processing Systems 34 (2021): 200-212.

Mokady, Ron, Amir Hertz, and Amit H. Bermano. “Clipcap: Clip prefix for image captioning.” arXiv preprint arXiv:2111.09734 (2021).

Alayrac, Jean-Baptiste, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc et al. “Flamingo: a visual language model for few-shot learning.” arXiv preprint arXiv:2204.14198 (2022).

Nagrani, Arsha, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. “Attention bottlenecks for multimodal fusion.” Advances in Neural Information Processing Systems 34 (2021): 14200-14213. See also blog post here

Rust, Phillip, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, and Desmond Elliott. “Language Modelling with Pixels.” arXiv preprint arXiv:2207.06991 (2022).

Other Resources

Generalized Vision Language Models (a highly informative blog post overview)

OpenAI blog about Clip

Twitter thread by Christopher Manning

Image Source: