Background

Multimodal deep learning is a trending field processing and learning from multiple channels of information, i.e. "modalities", in the dataset conjointly. Different methods have been employed to integrate the various data views: early fusion filled to a single model, parallel processing of each modalities independently then combined together, a mixture of the strategies.

The project is building upon a data-agnostic architecture based on the perceiver model and relying on the novel sequential integration of modalities. The main objective is to explore, improve and propose various alterations of the innovative implementation.

Project

The project will be closely supervised. While it is structured and constrained with strategies and corresponding theories planed, it is designed to remain quite flexible to account for the student new ideas that might be out-of-the-box. <br /> Familiarization with the architecture, its code and datasets will be done while adapting some of its layers with more computationally efficient or more performant modules. In parallel, various pre-training approaches can be researched and assessed.<br /> Different data encoding tactics, proved to drastically affect the model performance can be implemented and evaluated.<br /> Alterations combining standard modality fusion method with the new sequential approach might prove to be thrilling contributions.

Expected output:

The student will develop an extensive understanding of deep learning and attention-based architecture in general, multimodality in particular, their limitations and creative ways to overcome the problems. The analysis of the performances will be equally important.<br /> Further information about the methodology, specific datasets to try and references will be discussed upon meeting the student. The project can lead to a publication.

Required Skills

Understanding of machine learning, Experience with PyTorch framework (or deep learning comparable tool). Prior familiarity with Pytorch lightning is a plus.

Supervisor

Anaïs Haget, a PhD student working on Multimodal deep learning. For more information, please reach out to me at anais.haget@epfl.ch

References

Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., & Carreira, J. (2021, July). Perceiver: General perception with iterative attention. In International conference on machine learning (pp. 4651-4664). PMLR
Liang, P. P., Zadeh, A., & Morency, L. P. (2023). Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. ACM Computing Surveys.
Liang, P. P., Lyu, Y., Fan, X., Wu, Z., Cheng, Y., Wu, J., ... & Morency, L. P. (2021). Multibench: Multiscale benchmarks for multimodal representation learning. arXiv preprint arXiv:2107.07502.
Swamy, V., Satayeva, M., Frej, J., Bossy, T., Vogels, T., Jaggi, M., ... & Hartley, M. A. (2024). MultiModN—Multimodal, Multi-Task, Interpretable Modular Networks. Advances in Neural Information Processing Systems, 36.