Microsoft unveils AI model that understands image content, solves visual puzzles


Image Recognition

ARTICLE SOURCE

On Monday, researchers from Microsoft introduced Kosmos-1, a multimodal model that can reportedly analyze images for content, solve visual puzzles, perform visual text recognition, pass visual IQ tests, and understand natural language instructions. For example, “ document ” is a text input, and " paragraph Image Embedding paragraph ” is an interleaved image-text input. ... An embedding module is used to encode both text tokens and other input modalities into vectors. For input tokens, we use a lookup table to map them into embeddings. To test Kosmos-1, the researchers fed it a filled-out test, one at a time, with each option completed, and asked if the answer was correct.