Secure Multiparty Computation for Privacy Preserving Data Mining
The increasing use of data mining tools in both the public and private sectors raises concerns regarding the potentially sensitive nature of much of the data being mined. The utility to be gained from widespread data mining seems to come into direct conflict with an individual’s need and right to privacy. Privacy preserving data mining solutions aim at achieving the somewhat paradoxical property of enabling a data mining algorithm to use data without ever actually “seeing” it. Thus, the benefits of data mining can be enjoyed, without compromising the privacy of concerned individuals.
This problem is actually a special case of a long-studied problem in cryptography: secure multiparty computation. This problem deals with a setting where a set of n parties with private inputs x1,…,xn wish to jointly compute a function f of their inputs. Loosely speaking, this joint computation should have the property that the parties learn the correct output y=f(x1,…,xn) and nothing else, and this should hold even if some of the parties maliciously attempt to obtain more information. In the above hospital database example, the input xi represents the ith hospital’s confidential patient database, and the function f is a data mining algorithm that is run on the union of all of the xi’s.
In this short chapter, we will provide a succinct overview of the paradigms and techniques of secure multiparty computation, and how they can be applied to the problem of privacy preserving data mining. Our main focus will be on how security is formally defined, why this definitional approach is adopted, and what issues should be considered when defining security for privacy preserving data mining problems. We will also present a brief survey of known results for secure multiparty computation, along with a discussion of the ramifications of these results on privacy preserving data mining. Finally, we will discuss how we see the current state of knowledge with respect to privacy preserving data mining, while highlighting future directions of research for this question. Due to space constraints, the treatment in this chapter is both brief and informal. For more details, we refer the reader to (Goldreich, 2003) for a survey on cryptography and cryptographic protocols (i.e., secure multiparty computation).
SECURITY DEFINITIONS FOR SECURE COMPUTATION
The aim of a secure multiparty computation task is for the participating parties to securely compute some function of their distributed and private inputs. A key question that arises here is what does it mean for a computation to be secure? One way of approaching this question is to provide a list of security properties that should be preserved. The first such property that often comes to mind is that of privacy or confidentiality. A naïve attempt at formalizing privacy would be to require that each party learns nothing about the other parties’ inputs, even if it behaves maliciously. However, such a definition is usually unattainable because the defined output of the computation typically reveals some information on other parties’ inputs. (For example, a decision tree computed on two distributed databases reveals some information about both databases.) Therefore, the privacy requirement is usually formalized by saying that the only information learned by the parties in the computation (again, even by those who behave maliciously) is that specified by the function output. Although privacy is a primary security property, it rarely suffices. Another important property is that of correctness; this states that the parties’ output is really that defined by the function (if correctness is not guaranteed, then a malicious party may be able to receive the specified decision tree while the honest party receives a tree that is modified to provide misleading information). A central question that arises in this process of defining security properties is: when is our list of properties complete? This question is, of course, application-dependent and this essentially means that for every new problem, the process of deciding which security properties are required must be re-evaluated. We stress that coming up with the right list of properties is often very difficult, and it can take many years until we are convinced that a definition truly captures the security requirements that are needed. Furthermore, an incomplete of properties may easily lead to real security failures.
The Ideal/Real Model Paradigm
Due to these difficulties, the standard definitions of secure computation today follow an alternative approach called the ideal/real model paradigm. This has been the dominant paradigm in the investigation of secure computation in the last fifteen years; we refer the reader to (Canetti, 2000) for the formal definition and references therein for related definitional work. Loosely speaking, this paradigm defines the security of a real protocol by comparing it to an ideal computing scenario in which the parties interact with an external trusted and incorruptible party. In this ideal execution, the parties all send their inputs to the trusted party (via ideally secure communication lines). The trusted party then computes the function on these inputs and sends each party its specified output. Such a computation embodies the goal of secure computation. Specifically, it is clear that the privacy property as defined above holds, because the only message that a party receives (even one who behaves maliciously) is the output. Likewise, since the trusted party is incorruptible, correctness is also guaranteed to hold. In addition to the fact that the above and other properties are preserved in an ideal execution, the simplicity of the ideal model provides an intuitively convincing security guarantee. For example, notice that the only message that a party sends in an ideal execution is its input, and so the only power that a corrupted party has is to choose its input (something which is typically legitimate anyway).
So far, we have defined an ideal execution in an ideal world. However, in the real world, there is no external party that can be trusted by all parties. Rather, the parties run some protocol amongst themselves without any outside help. Nevertheless, we would like a real protocol to somehow “emulate” an ideal execution. That is, we say that a real protocol that is run by the parties (in a world where no trusted party exists) is secure, if no adversary can do more harm in a real execution than in an execution that takes place in the ideal world. Stated differently, for any adversary carrying out a successful attack on a real protocol, there exists an adversary that successfully carries out the same attack in the ideal world. This suffices because, as we have seen, no successful attacks can be carried out in an ideal execution. Thus, no successful attacks can be carried out on the real protocol, implying that it is secure. See below for a diagram of the real and ideal models.
We stress the fact that this definition imposes universal quantification over the real adversary. That is, security is guaranteed for every adversary carrying out any feasible attack (within the parameters defined for the adversary, as discussed next). This is a very important component of the security definition.
Defining the Model
The above informal description of the ideal/real model paradigm expresses the intuition that a real execution should behave just like an ideal execution. In order to obtain a complete and formal definition, it is crucial that both the ideal and real models are fully defined. Among other things, this involves defining the real network model and the adversary’s power, including any assumptions on its behavior. We stress that a secure protocol only provides real-world security guarantees if the mathematical definition of the real computation and adversarial models accurately reflects the real network and adversarial threats that exist.
We will briefly discuss a number of parameters that are considered when defining the network model and the adversary (this list is far from comprehensive). Two central considerations that arise when defining the network model relate to the communication channels and whether or not any trusted setup phase is assumed. It is typically assumed that all parties are connected via point-to-point authenticated channels. This assumption that the channels are authenticated (meaning that the adversary cannot modify messages sent between honest parties) can be implemented assuming a public-key infrastructure for digital signatures. Thus, this is actually an assumption that there is a trusted setup phase for setting up the infrastructure. Other parameters to consider are whether the communication over the network is synchronous or asynchronous, and whether or not messages that are sent between honest parties are guaranteed to arrive. Finally, the question of what (if any) other protocols are running simultaneously in the network must also be addressed. This issue is referred to as protocol composition and is currently a very active research subject in the cryptographic community.
When defining the adversary, a number of possibilities arise. We now describe some of these:
1. Complexity: Given that the widely accepted notion of efficient or feasible computation is probabilistic polynomial-time, the natural choice is to limit an adversary to this complexity. However, there are also protocols that are secure against unbounded adversaries.
2. Number of corrupted parties: In a general multiparty setting, we assume that the adversary controls some subset of the participating parties; these parties are called corrupted. The allowed size of this subset must also be defined (typical choices are assuming that less than one third or one half are corrupted, and not assuming any limitation on the number of corrupted parties).
3. Corruption strategy: This parameter relates to whether or not the adversary is static (meaning that the set of corrupted parties is fixed ahead of time), or adaptive (meaning that the adversary can “break into” parties during the protocol execution).
4. Allowed adversarial behavior: In our discussion above, we implicitly referred to malicious adversaries who are allowed to arbitrarily deviate from the protocol specification. However, the adversary’s behavior is sometimes restricted. For example, a semi-honest adversary is assumed to follow the protocol but may attempt to learn secret information from the messages that it receives.
The above very partial list of parameters for defining the adversary begs the question: how do we decide which adversarial model to take? A conservative approach is to take the most powerful adversary possible. However, being overly conservative comes at a price. For example, it is impossible to obtain security for unbounded adversaries in the case that half or more of the parties are corrupted. Furthermore, it is often the case that more efficient protocols can be constructed for weaker adversaries (specifically, highly efficient protocols for many tasks are known for the semi-honest adversarial model, but this is not the case for the malicious model). In general, a good approach is to consider malicious polynomial-time adversaries who may adaptively corrupt any number of the participants. However, in some cases, the semi-honest adversarial model is reasonable. For example, in the medical database example provided in the Introduction, the hospitals are not believed to be malicious; rather, the law prevents them from revealing confidential patient data. In such a case, the protection provided by semi-honest adversarial modeling is sufficient. We stress, however, that in many cases the semi-honest model is not realistic, and malicious adversaries must be considered.
Two central guiding principles in defining security are that: (a) the definition must accurately, and conservatively, model the real-world network setting and adversarial threat (this can differ per application and setting), and (b) all aspects of the model must be fully and explicitly defined. These two conditions are necessary for us to be confident that a given mathematical definition of security truly implies that protocols executed in the real world will withstand all adversarial attacks.
RESULTS AND APPLICATIONS
In this section, we will describe some general results regarding the feasibility of achieving secure multiparty computation. We will then discuss a few of the known results for the specific problem of privacy preserving data mining.
The Feasibility of Secure Multiparty Computation
The aforementioned security definition provides very strong guarantees. An adversary attacking a protocol that is secure is essentially limited to choosing its input (because this is all that it can do in the ideal model). An important question is, can this definition actually be achieved, and if yes under what conditions? A fundamental result of the theory of cryptography states that under certain parameters and assumptions, any efficient multiparty functionality can be securely computed. This result is comprised of a number of different theorems, depending on the model and the number of corrupted parties. We will describe the basic results for the stand-alone model (where only a single protocol execution is considered), and the computational setting (where the adversary is limited to polynomial-time). The basic results for the information-theoretic setting are proven in (Ben-Or, Goldwasser and Wigderson, 1988) and (Chaum, Crepeau and Damgard, 1988).
The first basic theorem states that in the case that a majority of the parties are honest (i.e., not corrupted), any multiparty functionality can be securely computed in the presence of malicious, static adversaries (Yao, 1986; Goldreich, Micali and Widgderson, 1986). Extensions to the case of adaptive adversaries can be found in (Beaver and Haber, 1992; Canetti, Feige, Goldreich and Naor, 1996).
The second basic theorem relates to the case that any number of parties may be corrupted, and so there is no honest majority. In this case, it is impossible to construct protocols that meet the definition as described above. The reason for this is that the definition implies that all parties receive output together; however, this cannot be achieved without an honest majority (Cleve, 1986). We therefore explicitly relax the security definition and allow the adversary to prevent the honest party from receiving its output, even in the ideal model; this relaxed definition is called security with abort. As above, it has been shown that even when any number of parties may be corrupted, any multiparty functionality can be securely computed in the presence of malicious, static adversaries (Yao, 1986; Goldreich, Micali and Widgderson, 1986).
The above results all assume computational hardness assumptions that follow, for example, from the assumption that factoring large numbers is hard. A full proof of the results of (Goldreich, Micali and Widgderson, 1986), for both the honest majority and no honest majority cases, can be found in (Goldreich, 2004).
As we have mentioned, the above results all refer to the stand-alone model of computation, where it is assumed that the secure protocol being analyzed is run once in isolation. Feasibility results have also been shown for the case of protocol composition where many different protocols run concurrently; for example, see (Canetti, 2001) and (Canetti, Lindell, Ostrovsky and Sahai, 2002). A brief survey on known results for the setting of composition can be found in (Lindell, 2003).
The importance of the above results to the question of privacy preserving data mining is that they demonstrate that under an appropriate choice of parameters and assumptions, any privacy preserving data mining problem can be solved, in principle. Therefore, the challenge now is to construct protocols that are efficient enough for practical use; this is especially a challenge because input-size is assumed to be huge. We note that achieving this goal may involve finding new definitions that are more relaxed than those described in the previous section, yet still accurately model the real security concerns. This is a very non-trivial research task due to the subtle nature of security and security definitions.
Secure Protocols for Privacy Preserving Data Mining
The first paper to take the classic cryptographic approach to privacy
preserving data mining was (Lindell and Pinkas, 2002). The paper presents an efficient protocol for
the problem of distributed decision tree learning; specifically, how to
securely compute an ID3 decision tree from two private databases. The model
considered in the paper was that of semi-honest adversaries only. This
approach was adopted in a relatively large number of works that demonstrate
semi-honest protocols for a wide variety of data mining algorithms; see, for
Other work on the problem of privacy preserving data mining has followed what is often called the “data perturbation” approach, as introduced by (Agrawal & Srikant, 2000). We remark that the development of rigorous security definitions that appropriately model security in settings considered by this approach seems to be a very difficult task. The question of whether or not it is possible to provide a definition that captures a realistic (and explicitly stated) adversarial threat model, and can be met, is an interesting open question. We remark that naïve definitions of security have been shown to be completely insecure; see (Dinur and Nissim, 2003) for just one example.
The history of cryptography shows very clearly that when protocols are not proven secure, or when the adversarial models are not explicitly defined, real attacks are very often discovered. Furthermore, the task of coming up with mathematical definitions that accurately model real adversarial threats is a very difficult task. Indeed, slight modifications to existing definitions can render them completely useless; see (Canetti, 2000) for some discussions on this issue. In this short article, we have described the real/ideal model paradigm for defining security. This definitional approach has been the fruit of many years of cryptographic research, and protocols that meet this definition provide very powerful security guarantees.
We stress that the success of privacy preserving data mining may depend on the ability to find new definitions that provide both the rigorous security guarantees that are necessary and can be met by highly efficient protocols. This is perhaps the ultimate challenge of this new and exciting field of research.
Agrawal, R., and Srikant, R. (2000). Privacy Preserving Data Mining. ACM
SIGMOD International Conference on Management of Data, SIGMOD’00,
Beaver, D. & Haber, S. (1992).
Cryptographic Protocols Provably Secure Against Dynamic Adversaries. Advances
in Cryptology, EUROCRYPT’92.
Ben-Or, M., Goldwasser, S.
& Wigderson, A. (1988). Completeness
Theorems for Non-Cryptographic Fault-Tolerant Distributed Computation. The 20th Annual ACM Symposium on Theory of Computing,
Canetti, R. (2000). Security and Composition of Multiparty Cryptographic Protocols. Journal of Cryptology, 13 (1), 143-202.
Canetti, R. (2001). Universally Composable Security: A New Paradigm for Cryptographic
Protocols. The 42rd Annual IEEE Symposium on the
Foundations of Computer Science, FOCS’01.
Canetti, R., Lindell, Y., Ostrovsky,
R. & Sahai, A. (2002). Universally
Composable Two-Party and Multi-Party Computation.
The 34th Annual ACM Symposium on Theory of Computing,
Canetti, R., Feige, U., Goldreich,
O. & Naor, M. (1996). Adaptively Secure
Multi-Party Computation. The 28th Annual ACM
Symposium on Theory of Computing, STOC’96.
Chaum, D., Crepeau, C. & Damgard,
Cleve, R. (1986). Limits on the Security of Coin
Flips when Half the Processors are Faulty. The 18th Annual
ACM Symposium on Theory of Computing, STOC’86.
Goldreich, O. (2003). Cryptography and Cryptographic Protocols. Distributed Computing, 16 (2), 177-199.
Goldreich, O. (2004). Foundations
of Cryptography, Volume 2 – Basic Applications.
Goldreich, O., Micali,
S. & Wigderson A. (1987). How
to Play any Mental Game – A Completeness Theorem for Protocols with Honest
Majority. The 19th Annual ACM Symposium
on the Theory of Computing, STOC’87.
Lindell, Y. (2003). Composition of Secure Multi-Party Protocols – A Comprehensive Study. Springer-Verlag.
Lindell, Y. & Pinkas, B. (2002). Privacy Preserving Data Mining. Journal of Cryptology,
15 (3), 177-206. (An extended abstract appeared in Advances in Cryptology, CRYPTO’00.