課程簡介
Feature representation of different modalities is the main focus of current cross-modal information retrieval research. Existing models typically project texts and images into the same embedding space. In this talk, we will introduce some basic ideas of text and image modeling and how can we build cross-modal relations using deep learning models. In details, we will discuss a joint model by using metric learning to minimize the similarity of the same content from different modalities. We will also introduce some recent research developments in image captioning and vision question answering (VQA)
【工作坊大綱】
1. 語義鴻溝
2. 圖像建模與CNN
3. 文本模型與詞向量
4. 聯(lián)合模型
5. 自動標注
6. 文本生成
7. 視覺問答
目標收益
了解到深度學習的前沿研究,了解如何利用深度學習進行圖像、文本信息的聯(lián)合建模并如何跨模態(tài)的實現(xiàn)語義搜索和圖像問答系統(tǒng)。
培訓對象
課程內(nèi)容
Feature representation of different modalities is the main focus of current cross-modal information retrieval research. Existing models typically project texts and images into the same embedding space. In this talk, we will introduce some basic ideas of text and image modeling and how can we build cross-modal relations using deep learning models. In details, we will discuss a joint model by using metric learning to minimize the similarity of the same content from different modalities. We will also introduce some recent research developments in image captioning and vision question answering (VQA)。
outline:
-語義鴻溝
-圖像建模與CNN
-文本模型與詞向量
-聯(lián)合模型
-自動標注
-文本生成
-視覺問答