DETERMINATION OF THE CONTENT SIMILARITY OF TEXT DOCUMENTS USING A TWO-STAGE ANALYSIS METHOD

W.S. Rogoza, Ye.A. Borysenko

Èlektron. model. 2025, 47(5):23-39

https://doi.org/10.15407/emodel.47.05.023

ABSTRACT

Used by users of search engines problem of detecting similar pairs of text documents in large arrays of documents is considered. The weaknesses of traditional approaches to solving this problem are identified, such as significant consumption of computer resources (memory and computing time), which becomes especially vulnerable to mass computers with limited resources in situations where the number of pairs of documents analyzed in terms of their semantic identity reaches billions. It is noted that in such cases, the comparison of documents by content, which requires pairwise calls of documents from disk memory to RAM, can take tens of hours of machine time, which may be unacceptable for a researcher.

A new approach to solving the problem of determining the semantic similarity of documents is proposed, which consists of two stages: 1) approximate determination of the semantic similarity of documents using simplified methods and 2) approximate evaluation of the lexical similarity of documents using the Minhesh signature method. Due to the two-stage solution of the task, a certain compromise between the accuracy of the analysis and resource costs has been achieved. The proposed approach’s theoretical justification and the experimental data presented confirm its effectiveness and give reason to believe that it can serve as a basis for the development of multi-stage methods for identifying semantically similar text documents.

KEYWORDS

information search, unstructured data, semantic and lexical similarity of documents.

REFERENCES

Rogoza W.S., Ishchenko G.V. Direct and Inverse Problems of Information Retreival of Text Documents. Electronic modeling. 2014. Vol. 46, no. 6. P. 8-28.
https://doi.org/10.15407/emodel.46.06.008
Zhang Z., Gentile A.L., Ciravegna F. Recent advances in methods of lexical semantic relatedness — a survey. Natural language engineering. 2013. Vol. 19, no. 4. P. 411-479.
https://doi.org/10.1017/S1351324912000125
Goczyła K. Ontologie w systemach informatycznych. Warszawa : Akademicka Oficyna Wydawnicza EXIT, 2011. 310 p.
The description logic handbook / ed. by F. Baader et al. Cambridge : Cambridge University Press, 2003. 574 p.
List of datasets for machine-learning research. URL: https://en.wikipedia.org/wiki/List_of_datasets.
Curiskis S.A. et al. An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Information processing & management. 2020. Vol. 57, no. 2. P. 1-50.
https://doi.org/10.1016/j.ipm.2019.04.002
Word2vec. URL: https://en.wikipedia.org/wiki/Word2vec
Sculley D. Web-scale k-means clustering. Proceedings of the 19th international conference on World wide web : Матеріали Міжнародної Наукоіої Конференції, North Carolina, 26 April 2010. P. 1177-1178.
https://doi.org/10.1145/1772690.1772862
K-means clustering. URL: https://en.wikipedia.org/wiki/K-means_clustering
Qurashi A.W., Holmes V., Johnson A.P. Document processing: methods for semantic text similarity analysis. Methods for semantic text similarity analysis : Матеріали Міжнародної Наукової конференції, Novi Sad, 24 August 2020. P. 1-6.
https://doi.org/10.1109/INISTA49547.2020.9194665
Jang B., Kim I., Kim J.W. Word2vec convolutional neural networks for classification of news articles and tweets. PLoS one. 2019. Vol. 14, no. 8. P. 1-20.
https://doi.org/10.1371/journal.pone.0220976
Amorim R.C., Hennig C. Recovering the number of clusters in data sets with noise features using feature rescaling factors. Information sciences. 2015. Vol. 324, no. 2. P. 126-145.
https://doi.org/10.1016/j.ins.2015.06.039

Full text: PDF