Image Recognition
As the war on artificial intelligence (AI) chatbots heats up, Microsoft has unveiled Cosmos-1, a new AI model that can respond to visual cues or images in addition to text prompts or messages. The Multimodal Large Language Model (MLLM) can help with a range of new tasks, including image captioning, visual question answering, and more. A great convergence of language, multidimensional perception, action and world modeling is an important step towards artificial general intelligence, said Microsoft AI researchers. In this work, we present Cosmos-1, a multimodal large language model (MLLM) that can understand general modalities, learn in context, and follow instructions. As ZDNet reports, the paper suggests that going beyond ChatGPT-like capabilities to Artificial General Intelligence (AGI) requires multimodal perception, or knowledge acquisition and grounding in the real world.