On the Machine Learning Models Inversion Attack Detector

Junzhe Song, Dmitry Namiot
This article is devoted to the detection of adversarial attacks on machine learning models. In the most general case, adversarial attacks are special data changes at one of the stages of the machine learning pipeline, which are designed to either prevent the operation of the machine learning system, or vice versa, to achieve the desired result for the attacker. But there is also a form of attack aimed at extracting non-public information from machine learning models. These include model inversion attacks. These types of attacks pose a threat to the use of machine learning as a service (MLaaS). Machine learning models accumulate a lot of redundant information during training, and the possibility of exposing this data while using the model can come as an unpleasant surprise. Keywords: machine learning, adversarial attacks, model inversion 1. Introduction Machine learning systems (and, at least now, it is a synonym for artificial intelligence systems) depend on data. This tautological statement leads, in fact, to quite serious consequences. Changing the data then, generally speaking, changes the performance of the model. Purposeful data changes are attacks on machine learning models [1]. But the models themselves can be directly affected during attacks. For example, weights can change on the fly, malicious code can be loaded into weights, etc. Adversarial attacks, which are possible for any discriminant machine learning models, pose a great threat to machine learning systems, since they do not guarantee the results and quality of the system. And such guarantees are, for example, mandatory for the use of a machine learning (artificial intelligence) system in critical areas such as avionics, automatic driving, special applications, etc. [2 , 3 ]. An attack directly on the model also carries additional risks of extracting private information stored in machine learning models [4]. Detection (determination of the fact of implementation) of attacks related to the extraction of information is the subject of this work. The remainder of the article is structured as follows. In section 2, we focus on attacks on intellectual property. Section 3 presents the developed algorithm for detecting extraction attacks. Section 4 presents the conclusion.