Cross-media retrieval via fusing multi-modality and multi-grained data

Document Type : Article

Authors

- School of Computer Science and Technology, Shandong University of Finance and Economics, Jinan, 250014, Shandong, China - Shandong Provincial Key Laboratory of Digital Media Technology, Shandong University of Finance and Economics, Jinan, 250014, Shandong, China

Abstract

Traditional cross-media retrieval methods mainly focus on coarse-grained data that reflect global characteristics, while ignoring the fine-grained descriptions of local details. Meanwhile, traditional methods cannot accurately describe the correlations between the anchor and the irrelevant data. To solve the problems mentioned above, this paper proposes to fuse coarse-grained and fine-grained features and a multi-margin triplet loss on the basis of a dual-framework. 1) Framework I: a multi grained data fusion framework based on Deep Belief Network, and 2) Framework II: a multi-modality data fusion framework based on the multi-margin triplet loss function. In Framework I, the coarse grained and fine-grained features fused by the joint Restricted Boltzmann Machine are input into Framework II. In Framework II, we innovatively propose the multi-margin triplet loss. The data, which belong to different modalities and semantic categories, are stepped away from the anchor in a multi-margin way. Experimental results show that the proposed method achieves better cross-media retrieval performance than other methods with different datasets. Furthermore, the ablation experiments verify that our proposed multi-grained fusion strategy and the multi-margin triplet loss function are effective.

Keywords

Main Subjects


References:
1. Peng, Y., Huang, X., and Zhao, Y. "An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges", IEEE Trans. Circuits Syst. Video Technol., 28(9), pp. 2372-2385 (2018).
2. Baltrusaitis T, Ahuja, C., and Morency, L.-P. "Multimodal machine learning: A survey and taxonomy", IEEE Trans. Pattern Anal. Mach. Intell., 41(2), pp. 423-443 (2019).
3. Hotelling, H. "Relations between two sets of variates", Biometrika, 28(3-4), pp. 321-377 (1936).
4. Rasiwasia, N., Costa Pereira, J., Coviello, E., et al. "A new approach to cross-modal multimedia retrieval", 18th ACM Int. Conf. on Multimedia, pp. 251-260 (2010).
5. Wang, K., He, R.,Wang, L., et al. "Joint feature selection and subspace learning for cross-modal retrieval", IEEE Trans. Pattern Anal. Mach. Intell., 38(10), pp. 2010-2023 (2016).
6. Zhai, X., Peng, Y., and Xiao, J. "Heterogeneous metric learning with joint graph regularization for cross-media retrieval", AAAI Conf. on Artif. Intell., 27(1), pp. 1198-1204 (2013).
7. Ngiam, J., Khosla, A., Kim, M., et al. "Multimodal deep learning", 28th Int. Conf. on Mach. Learn., pp. 689-696 (2011).
8. Feng, F., Wang, X., and Li, R. "Cross-modal retrieval with correspondence autoencoder", 22nd ACM Int. Conf. on Multimedia, pp. 7-16 (2014).
9. Srivastava, N. and Salakhutdinov, R. "Learning representations for multimodal data with deep belief nets", Int. Conf. on Mach. Learn. Workshop (2012).
10. Karpathy, A., Joulin, A., and Fei-Fei, L. "Deep fragment embeddings for bidirectional image sentence mapping", 27th Int. Conf. on Neural Inform. Process. Syst., pp. 1889-1897 (2014).
11. Lee, K.-H., Chen, X., Hua, G., et al. "Stacked cross attention for image-text matching", European Conf. on Comput. Vis., pp. 201-216 (2018).
12. Wang, L., Li, Y., and Lazebnik, S. "Learning deep structurepreserving image-text embeddings", IEEE Conf. on Comput. Vis. and Pattern. Recogn., pp. 5005- 5013 (2016).
13. Chen, H., Ding, G., Liu, X., et al. "IMRAM: Iterative matching with recurrent attention memory for crossmodal image-text retrieval", IEEE Conf. on Comput. Vis. and Pattern. Recogn., pp. 12652-12660 (2020).
14. Hermans, A., Beyer, L., and Leibe, B. "In defense of the triplet loss for person re-identification", arXiv:1703.07737 (2017).
15. Andrew, G., Arora, R., Bilmes, J., et al. "Deep canonical correlation analysis", 30th Int. Conf. on Mach. Learn., pp. 1247-1255 (2013).
16. Zhai, X., Peng, Y., and Xiao, J. "Learning crossmedia joint representation with sparse and semisupervised regularization", IEEE Trans. Circuits Syst. Video Technol., 24(6), pp. 965-978 (2014).
17. Peng, Y., Zhai, X., Zhao, Y., et al. "Semi-supervised cross-media feature learning with unified patch graph regularization", IEEE Trans. Circuits Syst. Video Technol., 26(3), pp. 583-596 (2016).
18. Frome, A., Corrado, G.S., Shlens, J., et al. "DeViSE: a deep visualsemantic embedding model", AAAI 26th Int. Conf. on Neural Inform. Process. Syst., pp. 2121- 2129 (2013).
19. Socher, R., Karpathy, A., Le, Q.V., et al. "Grounded compositional semantics for finding and describing images with sentences", Trans. of Assoc. for Comput. Ling., 2, pp. 207-218 (2014).
20. Vendrov, I., Kiros, R., Fidler, S., et al. "Order-embeddings of images and language", arXiv:1511.06361 (2015).
21. Chen, H., Ding, G., Lin, Z., et al. "Cross-modal imagetext retrieval with semantic consistency", 27th ACM Int. Conf. on Multimedia, pp. 1749-1757 (2019).
22. Misraa, A.K., Kale, A., Aggarwal, P., et al. "Multi-modal retrieval using graph neural networks", arXiv:2010.01666 (2020).
23. Wang, L., Li, Y., Huang, J., et al. "Learning two-branch neural networks for image-text matching tasks", IEEE Trans. Pattern Anal. Mach. Intell., 41(2), pp. 394-407 (2019).
24. Xu, X., Wang, T., Yang, Y., et al. "Cross-modal attention with semantic consistence for image-text matching", IEEE Trans. Neural Networks Learn. Syst., 31(12), pp. 5412-5425 (2020).
25. Silberer, C. and Lapata, M. "Learning grounded meaning representations with autoencoders", 52nd Annu. Mtg. of the Assoc. for Comput. Ling., pp. 721-732 (2014).
26. Salakhutdinov, R. and Larochelle, H. "Efficient learning of deep Boltzmann machines", 13th Int. Conf. on Artif. Intell. and Stats., pp. 693-700 (2010).
27. Hinton, G.E., Osindero, S., and Teh, Y.-W. "A fast learning algorithm for deep belief nets", Neural Comput., 18(7), pp. 1527-1554 (2006).
28. Peng, Y., Qi, J., Huang, X., et al. "CCL: Crossmodal correlation learning with multigrained fusion by hierarchical network", IEEE Trans. Multimedia, 20(2), pp. 405-420 (2018).
29. Zhang, J., Peng, Y., and Yuan, M. "SCH-GAN: semisupervised cross-modal hashing by generative adversarial network", IEEE Trans. Cybern., 50(2), pp. 489- 502 (2020).
30. Zhang, J., Peng, Y., and Yuan, M. "Unsupervised generative adversarial cross-modal hashing", AAAI Conf. on Artif. Intell., 32(1) (2018).
31. Wang, B., Yang, Y., Xu, X., et al. "Adversarial crossmodal retrieval", 25th ACM Int. Conf. on Multimedia, pp. 154-162 (2017).
32. Schroff, F., Kalenichenko, D., and Philbin, J. "FaceNet: a unified embedding for face recognition and clustering", 2015 IEEE Conf. on Comp. Vis. Patt. Recog., pp. 815-823 (2015).
33. Zhou, M., Niu, Z., Wang, L., et al. "Ladder loss for coherent visualsemantic embedding", AAAI Conf. on Artif. Intell., pp. 13050-13057 (2020).
34. Zhai, X., Peng, Y., and Xiao, J. "Cross-modality correlation propagation for cross-media retrieval", 2012 IEEE Int. Conf. on Acou. Sp. Sig. Proc., pp. 2337- 2340 (2012).
35. Zhai, X., Peng, Y., and Xiao, J. "Effective heterogeneous similarity measure with nearest neighbors for cross-media retrieval", Int. Conf. on Adv. Multim. Model., pp. 312-322 (2012).
Volume 30, Issue 5
Transactions on Computer Science & Engineering and Electrical Engineering (D)
September and October 2023
Pages 1645-1669