Real-time Method of Accurate Unique IPs Counting Across High Number of Distinct Dimensions and distinct Time Frames for Big Data Systems

A.V. Valialkin
VertaMedia Company
(224 West 35th St., Suite 1102-5, New York, NY 10001, USA,
This email address is being protected from spambots. You need JavaScript enabled to view it.),
O.I. Konashevych, post-graduate
Pukhov Institute for Modelling in Energy Engineering
(15, General Naumov St., Kyiv, 03164, Ukraine,
This email address is being protected from spambots. You need JavaScript enabled to view it.)

Èlektron. model. 2018, 38(3):63-74
https://doi.org/10.15407/emodel.38.03.063

ABSTRACT

The article describes a method which allows counting unique IP addresses within 10 bln of system events per day across high number of distinct dimensions (tuples). Log-based and probability-based methods showed unsatisfactory results. The proposed method allows avoiding excessive resource usage (RAM, CPU and persistent storage) as it appeared in a raw logs method and a probability method of counting. The method also avoids high statistic error for low cardinality as it appeared in a probability method. The main idea is to count unique IP addresses in distinct tuples in real time using RAM for short data interval processing, then flushing it to persistent
storage, using merge algorithms to process and store unique IP counts in ordinary database from 5 minute, hourly, daily, weekly and monthly interval files.

KEYWORDS

probability method, statistics, information technologies, queueing theory, big data, statistical process control.

REFERENCES

1. Erdös, P. (1959), available at: http://cms.math.ca/10.4153/CJM- 1959-003-9 (accessed April 4, 2016).
2. Harchol-Balter, M. (2013), Performance Modeling and Design of Computer Systems. Queueing Theory in Action, Cambridge University Press, New York, USA.
3. Cox, D.R. and Isham, V.I. (1980), Point Processes, Chapman & Hall, London, UK. 
4. Durrett, R. (2010), Probability: Theory and Examples (4th ed.) Cambridge University Press, Camridge, USA.
5. Available at: https://github.com/clarkduvall/hyperloglog (accessed October 21, 2015).
6. Flajolet, P., Fusy, E., Gandouet, O. and Meunier, F. (2007), HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm, Proceedings of the 2007 International Conference on the Analysis of Algorithm(AOFA ’07), available at: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf (accessed April 4, 2016).
7. Knuth, D. (1998), “Section 5.2.4: Sorting by Merging”, The Art of Computer Programming 3 (2nd ed.), Addison Wesley, USA Sorting and Searching.
8. Dean, J. and Ghemawat, S. (2004), MapReduce: Simplified Data Processing on Large Clusters, available at: http://static.googleusercontent.com/media/research.google.com/es/us/archive/mapreduce-osdi04.pdf (accessed April 4, 2016).
9. Shewhart, W. (1931), Economic Control of Quality of Manufactured Product, D.Van Nostrand Company, New York, USA. ISBN 0-87389-076-0.

Full text: PDF (in Russian)