W.S. Rogoza, Ye.A. Borysenko
Èlektron. model. 2025, 47(5):23-39
https://doi.org/10.15407/emodel.47.05.023
ABSTRACT
Used by users of search engines problem of detecting similar pairs of text documents in large arrays of documents is considered. The weaknesses of traditional approaches to solving this problem are identified, such as significant consumption of computer resources (memory and computing time), which becomes especially vulnerable to mass computers with limited resources in situations where the number of pairs of documents analyzed in terms of their semantic identity reaches billions. It is noted that in such cases, the comparison of documents by content, which requires pairwise calls of documents from disk memory to RAM, can take tens of hours of machine time, which may be unacceptable for a researcher.
A new approach to solving the problem of determining the semantic similarity of documents is proposed, which consists of two stages: 1) approximate determination of the semantic similarity of documents using simplified methods and 2) approximate evaluation of the lexical similarity of documents using the Minhesh signature method. Due to the two-stage solution of the task, a certain compromise between the accuracy of the analysis and resource costs has been achieved. The proposed approach’s theoretical justification and the experimental data presented confirm its effectiveness and give reason to believe that it can serve as a basis for the development of multi-stage methods for identifying semantically similar text documents.
KEYWORDS
information search, unstructured data, semantic and lexical similarity of documents.
REFERENCES
- Rogoza W.S., Ishchenko G.V. Direct and Inverse Problems of Information Retreival of Text Documents. Electronic modeling. 2014. Vol. 46, no. 6. P. 8-28.
- Zhang Z., Gentile A.L., Ciravegna F. Recent advances in methods of lexical semantic relatedness — a survey. Natural language engineering. 2013. Vol. 19, no. 4. P. 411-479.
- Goczyła K. Ontologie w systemach informatycznych. Warszawa : Akademicka Oficyna Wydawnicza EXIT, 2011. 310 p.
- The description logic handbook / ed. by F. Baader et al. Cambridge : Cambridge University Press, 2003. 574 p.
- List of datasets for machine-learning research. URL: https://en.wikipedia.org/wiki/List_of_datasets.
- Curiskis S.A. et al. An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Information processing & management. 2020. Vol. 57, no. 2. P. 1-50.
- Word2vec. URL: https://en.wikipedia.org/wiki/Word2vec
- Sculley D. Web-scale k-means clustering. Proceedings of the 19th international conference on World wide web : Матеріали Міжнародної Наукоіої Конференції, North Carolina, 26 April 2010. P. 1177-1178.
- K-means clustering. URL: https://en.wikipedia.org/wiki/K-means_clustering
- Qurashi A.W., Holmes V., Johnson A.P. Document processing: methods for semantic text similarity analysis. Methods for semantic text similarity analysis : Матеріали Міжнародної Наукової конференції, Novi Sad, 24 August 2020. P. 1-6.
- Jang B., Kim I., Kim J.W. Word2vec convolutional neural networks for classification of news articles and tweets. PLoS one. 2019. Vol. 14, no. 8. P. 1-20.
- Amorim R.C., Hennig C. Recovering the number of clusters in data sets with noise features using feature rescaling factors. Information sciences. 2015. Vol. 324, no. 2. P. 126-145.