Skip to main content

Call for Collaboration: Multimodal AI – combining text, images video, audio and structured as well as graph-based data


An exciting frontier in current AI research and application is building systems which can combine multiple modalities such as text, images, video, audio and structured as well as graph-based data, called Multimodal AI. This allows to gain a deeper understanding of the tasks at hand and thus achieve solutions of much higher accuracy or for much more complex problems than was previously possible.

Two research groups at the recently founded Centre for Artificial Intelligence (CAI) at the Zurich University of Applied Sciences ZHAW, the Computer Vision, Perception and Cognition (CVPC) as well as the Natural Language Processing (NLP) groups, would like to join forces in the domain of multimodal AI. Both teams have many years of experience in their respective domains, evidenced by successful applied research projects with industrial and academic partners, as well as realized products and scientific publications.

Call for collaboration

We are looking for project partners and research collaborations which provide real world challenges to solve with multimodal deep learning approaches by digesting data from multiple sources such as text, audio, images, video as well as tabular data. In particular, we would like to partner with a company that has a suitable use case, and potentially with other researchers, to apply for joint 3rd party funding for finding and implementing a solution for the company’s case.

Application areas

Example industrial use cases include, but are not limited to, the medical domain (e.g., combination of medical imaging with textual or tabular data related to patient histories, treatments, drugs etc.), news, tv and journalism (combining newspaper articles, news videos, press images), insurance (case data with text forms and images), engineering (descriptions, part lists, drawings, videos) and domain-specific visual question answering or web search tasks. Other emerging applications are in the areas of conversational AI, image and video search using language, autonomous robots and drones as well as multimodal assistants. All these tasks require systems which can interact with the world using all available modalities.


In recent years, deep-learning based AI solutions have reached unprecedented and often better-than-human performance in many natural language processing (NLP) and computer vision (CV) tasks using highly specialized neural network architectures trained on single modality data (either text or images/video). Examples include machine translation and text generation (e.g., SuperGLUE, SQuAD), as well as image classification, object detection (e.g., ImageNet) and segmentation (e.g., PASCAL VOC, COCO) tasks.

While the achievements on these single modality learning tasks are impressive, the next important step towards more powerful AI in real-world use case scenarios is creating systems which have more cognitive abilities, because they are able to digest inputs and fuse knowledge from multiple modalities. In particular, they would ultimately be able to accumulate world knowledge or “common sense” understanding about the world.

Current tasks and architectures in multimodal AI research are mainly combining image and text data. Prominent examples include image description or “caption” generation, as well as text-to-image generation systems (e.g., OpenAI’s DALL-E [1] and GLIDE [2]) and visual question answering systems (see e.g., METER [3]).

A related research avenue is to develop more general deep learning architectures, which are not specifically optimized for a single modality, but are able to process inputs from different modalities without making domain-specific assumptions (inductive bias), e.g., Deepmind’s Perceiver networks [4] for classification tasks, as well as PerceiverIO, which is able to produce arbitrary outputs in addition to digesting arbitrary inputs [5].



In case you would like to know more or want to discuss how your use case would fit into this applied research, do not hesitate to contact us.

Prof. Dr. Thilo Stadelmann
Director & head CVPC group @ CAI

Prof. Dr. Mark Cieliebak
Head NLP group @ CAI


[1] DALL-E:
[2] GLIDE:
[3] METER:
[4] Perceiver:
[5] Perceiver IO: