Image Recognition
Multimodal perception, a fundamental intelligence component, is required to attain artificial general intelligence regarding knowledge acquisition and connection to reality. Figure 1: A multimodal massive language model (MLLM) called KOSMOS-1 can learn in context for both language and multimodal tasks, perceive input in several modes, and follow instructions. In this study, we further the transition from LLMs to MLLMs by integrating vision with large language models (LLMs). In this study, they provide KOSMOS-1, a Multimodal Large Language Model (MLLM) with available modalities perception, zero-shot learning, and context-based learning capabilities (i.e., few-shot learning). The general-purpose interface, as seen in Figure 1, is a language model built on the Transformer framework, and perceptual modules dock with the language model.