Methods and systems to associate semantically-related items of a plurality of item types using a joint embedding space are disclosed. The disclosed methods and systems are scalable to large, web-scale training data sets. According to an embodiment, a method for associating semantically-related items of a plurality of item types includes embedding training items of a plurality of item types in a joint embedding space configured in a memory coupled to at least one processor, learning one or more mappings into the joint embedding space for each of the item types to create a trained joint embedding space and one or more learned mappings, and associating one or more embedded training items with a first item based upon a distance in the trained joint embedding space from the first item to each said associated embedded training items. Exemplary item types that may be embedded in the joint embedding space include images, annotations, audio and video.
展开▼