References:
1. Peng, Y., Huang, X., and Zhao, Y. "An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges", IEEE Trans. Circuits Syst. Video Technol., 28(9), pp. 2372-2385 (2018).
2. Baltrusaitis T, Ahuja, C., and Morency, L.-P. "Multimodal machine learning: A survey and taxonomy", IEEE Trans. Pattern Anal. Mach. Intell., 41(2), pp. 423-443 (2019).
3. Hotelling, H. "Relations between two sets of variates", Biometrika, 28(3-4), pp. 321-377 (1936).
4. Rasiwasia, N., Costa Pereira, J., Coviello, E., et al. "A new approach to cross-modal multimedia retrieval", 18th ACM Int. Conf. on Multimedia, pp. 251-260 (2010).
5. Wang, K., He, R.,Wang, L., et al. "Joint feature selection and subspace learning for cross-modal retrieval", IEEE Trans. Pattern Anal. Mach. Intell., 38(10), pp. 2010-2023 (2016).
6. Zhai, X., Peng, Y., and Xiao, J. "Heterogeneous metric learning with joint graph regularization for cross-media retrieval", AAAI Conf. on Artif. Intell., 27(1), pp. 1198-1204 (2013).
7. Ngiam, J., Khosla, A., Kim, M., et al. "Multimodal deep learning", 28th Int. Conf. on Mach. Learn., pp. 689-696 (2011).
8. Feng, F., Wang, X., and Li, R. "Cross-modal retrieval with correspondence autoencoder", 22nd ACM Int. Conf. on Multimedia, pp. 7-16 (2014).
9. Srivastava, N. and Salakhutdinov, R. "Learning representations for multimodal data with deep belief nets", Int. Conf. on Mach. Learn. Workshop (2012).
10. Karpathy, A., Joulin, A., and Fei-Fei, L. "Deep fragment embeddings for bidirectional image sentence mapping", 27th Int. Conf. on Neural Inform. Process. Syst., pp. 1889-1897 (2014).
11. Lee, K.-H., Chen, X., Hua, G., et al. "Stacked cross attention for image-text matching", European Conf. on Comput. Vis., pp. 201-216 (2018).
12. Wang, L., Li, Y., and Lazebnik, S. "Learning deep structurepreserving image-text embeddings", IEEE Conf. on Comput. Vis. and Pattern. Recogn., pp. 5005- 5013 (2016).
13. Chen, H., Ding, G., Liu, X., et al. "IMRAM: Iterative matching with recurrent attention memory for crossmodal image-text retrieval", IEEE Conf. on Comput. Vis. and Pattern. Recogn., pp. 12652-12660 (2020).
14. Hermans, A., Beyer, L., and Leibe, B. "In defense of the triplet loss for person re-identification", arXiv:1703.07737 (2017).
15. Andrew, G., Arora, R., Bilmes, J., et al. "Deep canonical correlation analysis", 30th Int. Conf. on Mach. Learn., pp. 1247-1255 (2013).
16. Zhai, X., Peng, Y., and Xiao, J. "Learning crossmedia joint representation with sparse and semisupervised regularization", IEEE Trans. Circuits Syst. Video Technol., 24(6), pp. 965-978 (2014).
17. Peng, Y., Zhai, X., Zhao, Y., et al. "Semi-supervised cross-media feature learning with unified patch graph regularization", IEEE Trans. Circuits Syst. Video Technol., 26(3), pp. 583-596 (2016).
18. Frome, A., Corrado, G.S., Shlens, J., et al. "DeViSE: a deep visualsemantic embedding model", AAAI 26th Int. Conf. on Neural Inform. Process. Syst., pp. 2121- 2129 (2013).
19. Socher, R., Karpathy, A., Le, Q.V., et al. "Grounded compositional semantics for finding and describing images with sentences", Trans. of Assoc. for Comput. Ling., 2, pp. 207-218 (2014).
20. Vendrov, I., Kiros, R., Fidler, S., et al. "Order-embeddings of images and language", arXiv:1511.06361 (2015).
21. Chen, H., Ding, G., Lin, Z., et al. "Cross-modal imagetext retrieval with semantic consistency", 27th ACM Int. Conf. on Multimedia, pp. 1749-1757 (2019).
22. Misraa, A.K., Kale, A., Aggarwal, P., et al. "Multi-modal retrieval using graph neural networks", arXiv:2010.01666 (2020).
23. Wang, L., Li, Y., Huang, J., et al. "Learning two-branch neural networks for image-text matching tasks", IEEE Trans. Pattern Anal. Mach. Intell., 41(2), pp. 394-407 (2019).
24. Xu, X., Wang, T., Yang, Y., et al. "Cross-modal attention with semantic consistence for image-text matching", IEEE Trans. Neural Networks Learn. Syst., 31(12), pp. 5412-5425 (2020).
25. Silberer, C. and Lapata, M. "Learning grounded meaning representations with autoencoders", 52nd Annu. Mtg. of the Assoc. for Comput. Ling., pp. 721-732 (2014).
26. Salakhutdinov, R. and Larochelle, H. "Efficient learning of deep Boltzmann machines", 13th Int. Conf. on Artif. Intell. and Stats., pp. 693-700 (2010).
27. Hinton, G.E., Osindero, S., and Teh, Y.-W. "A fast learning algorithm for deep belief nets", Neural Comput., 18(7), pp. 1527-1554 (2006).
28. Peng, Y., Qi, J., Huang, X., et al. "CCL: Crossmodal correlation learning with multigrained fusion by hierarchical network", IEEE Trans. Multimedia, 20(2), pp. 405-420 (2018).
29. Zhang, J., Peng, Y., and Yuan, M. "SCH-GAN: semisupervised cross-modal hashing by generative adversarial network", IEEE Trans. Cybern., 50(2), pp. 489- 502 (2020).
30. Zhang, J., Peng, Y., and Yuan, M. "Unsupervised generative adversarial cross-modal hashing", AAAI Conf. on Artif. Intell., 32(1) (2018).
31. Wang, B., Yang, Y., Xu, X., et al. "Adversarial crossmodal retrieval", 25th ACM Int. Conf. on Multimedia, pp. 154-162 (2017).
32. Schroff, F., Kalenichenko, D., and Philbin, J. "FaceNet: a unified embedding for face recognition and clustering", 2015 IEEE Conf. on Comp. Vis. Patt. Recog., pp. 815-823 (2015).
33. Zhou, M., Niu, Z., Wang, L., et al. "Ladder loss for coherent visualsemantic embedding", AAAI Conf. on Artif. Intell., pp. 13050-13057 (2020).
34. Zhai, X., Peng, Y., and Xiao, J. "Cross-modality correlation propagation for cross-media retrieval", 2012 IEEE Int. Conf. on Acou. Sp. Sig. Proc., pp. 2337- 2340 (2012).
35. Zhai, X., Peng, Y., and Xiao, J. "Effective heterogeneous similarity measure with nearest neighbors for cross-media retrieval", Int. Conf. on Adv. Multim. Model., pp. 312-322 (2012).