Image Recognition
On Monday, researchers from Microsoft introduced Kosmos-1, a multimodal model that can reportedly analyze images for content, solve visual puzzles, perform visual text recognition, pass visual IQ tests, and understand natural language instructions.
For example, “ document ” is a text input, and " paragraph ” is an interleaved image-text input.
... An embedding module is used to encode both text tokens and other input modalities into vectors.
For input tokens, we use a lookup table to map them into embeddings.
To test Kosmos-1, the researchers fed it a filled-out test, one at a time, with each option completed, and asked if the answer was correct.