Secure Multiparty Computation for Privacy Preserving Data Mining

Yehuda
Lindell,

INTRODUCTION

The increasing use of data mining tools in both the public and private
sectors raises concerns regarding the potentially sensitive nature of much of the
data being mined. The utility to be gained from widespread data mining seems to
come into direct conflict with an individual’s need and right to privacy. Privacy preserving data mining solutions aim at achieving the
somewhat paradoxical property of enabling a data mining algorithm to use data *without*
ever actually “seeing” it. Thus, the benefits of data mining can be
enjoyed, without compromising the privacy of concerned individuals.

A classical example of where privacy preserving data mining solutions can be of great importance is in the field of medical research. Consider the case that a number of different hospitals wish to jointly mine their patient data, for the purpose of medical research. Furthermore, let us assume that privacy policy and law prevents these hospitals from ever pooling their data or revealing it to each other due to the confidentiality of patient records. In such a case, classical data mining solutions cannot be used, with the possible result being a slowing of the advancement of medical knowledge. Fortunately, privacy preserving data mining solutions enable the hospitals to compute the desired data mining algorithm on the union of their databases, without ever pooling or revealing their data. Indeed, the only information (provably) learned by the different hospitals is the output of the data mining algorithm.

This problem is actually a special case of a long-studied problem in
cryptography: *secure multiparty computation*. This problem deals with a
setting where a set of *n* parties with private inputs *x*_{1}*,…,x _{n}*
wish to jointly compute a function

In this short chapter, we will provide a succinct overview of the paradigms and techniques of secure multiparty computation, and how they can be applied to the problem of privacy preserving data mining. Our main focus will be on how security is formally defined, why this definitional approach is adopted, and what issues should be considered when defining security for privacy preserving data mining problems. We will also present a brief survey of known results for secure multiparty computation, along with a discussion of the ramifications of these results on privacy preserving data mining. Finally, we will discuss how we see the current state of knowledge with respect to privacy preserving data mining, while highlighting future directions of research for this question. Due to space constraints, the treatment in this chapter is both brief and informal. For more details, we refer the reader to (Goldreich, 2003) for a survey on cryptography and cryptographic protocols (i.e., secure multiparty computation).

SECURITY DEFINITIONS FOR
SECURE COMPUTATION

The aim of a secure multiparty computation task is for the
participating parties to *securely* compute some function of their
distributed and private inputs. A key question that arises here is what does it
mean for a computation to be *secure*? One way of approaching this
question is to provide a list of *security properties* that should be
preserved. The first such property that often comes to mind is that of *privacy*
or *confidentiality*. A naïve attempt at formalizing privacy would be to
require that each party learns nothing about the other parties’ inputs, even if
it behaves maliciously. However, such a definition is usually unattainable
because the defined output of the computation typically reveals some
information on other parties’ inputs. (For example, a decision tree computed on
two distributed databases reveals some information about both databases.)
Therefore, the privacy requirement is usually formalized by saying that the
only information learned by the parties in the computation (again, even by
those who behave maliciously) is that specified by the function output.
Although privacy is a primary security property, it rarely suffices. Another
important property is that of *correctness*; this states that the parties’
output is really that defined by the function (if correctness is not
guaranteed, then a malicious party may be able to receive the specified
decision tree while the honest party receives a tree that is modified to
provide misleading information). A central question that arises in this process
of defining security properties is: *when is our list of properties complete*?
This question is, of course, application-dependent and this essentially means
that for every new problem, the process of deciding which security properties
are required must be re-evaluated. We stress that coming up with the right list
of properties is often very difficult, and it can take many years until we are
convinced that a definition truly captures the security requirements that are
needed. Furthermore, an incomplete of properties may easily lead to real
security failures.

**The
Ideal/Real Model Paradigm**

Due to these difficulties, the standard definitions of secure
computation today follow an alternative approach called the *ideal/real model
paradigm*. This has been the dominant paradigm in the investigation of
secure computation in the last fifteen years; we refer the reader to (Canetti, 2000) for the formal definition and references
therein for related definitional work. Loosely speaking, this paradigm defines
the security of a real protocol by comparing it to an *ideal computing
scenario* in which the parties interact with an external trusted and
incorruptible party. In this ideal execution, the parties all send their inputs
to the trusted party (via ideally secure communication lines). The trusted
party then computes the function on these inputs and sends each party its
specified output. Such a computation embodies the goal of secure computation. Specifically,
it is clear that the privacy property as defined above holds, because the only
message that a party receives (even one who behaves maliciously) is the output.
Likewise, since the trusted party is incorruptible, correctness is also
guaranteed to hold. In addition to the fact that the above and other properties
are preserved in an ideal execution, the simplicity of the ideal model provides
an intuitively convincing security guarantee. For example, notice that the only
message that a party sends in an ideal execution is its input, and so the only
power that a corrupted party has is to choose its input (something which is typically
legitimate anyway).

So far, we have defined an ideal execution in an ideal world. However, in
the *real world*, there is no external party that can be trusted by all
parties. Rather, the parties run some protocol amongst themselves without any outside
help. Nevertheless, we would like a real protocol to somehow “emulate” an ideal
execution. That is, we say that a real protocol that is run by the parties (in
a world where no trusted party exists) is ** secure**, if

We stress the
fact that this definition imposes *universal quantification* over the real
adversary. That is, security is guaranteed for *every* adversary carrying
out *any feasible attack* (within the parameters defined for the
adversary, as discussed next). This is a very important component of the
security definition.

**Defining
the Model**

The above informal description of the ideal/real model paradigm expresses the intuition that a real execution should behave just like an ideal execution. In order to obtain a complete and formal definition, it is crucial that both the ideal and real models are fully defined. Among other things, this involves defining the real network model and the adversary’s power, including any assumptions on its behavior. We stress that a secure protocol only provides real-world security guarantees if the mathematical definition of the real computation and adversarial models accurately reflects the real network and adversarial threats that exist.

We will briefly discuss a number of parameters that are considered when
defining the network model and the adversary (this list is far from
comprehensive). Two central considerations that arise when defining the network
model relate to the communication channels and whether or not any trusted setup
phase is assumed. It is typically assumed that all parties are connected via
point-to-point *authenticated* channels. This assumption that the channels
are authenticated (meaning that the adversary cannot modify messages sent
between honest parties) can be implemented assuming a public-key infrastructure
for digital signatures. Thus, this is actually an assumption that there is a
trusted setup phase for setting up the infrastructure. Other parameters to
consider are whether the communication over the network is synchronous or
asynchronous, and whether or not messages that are sent between honest parties
are guaranteed to arrive. Finally, the question of what (if any) other
protocols are running simultaneously in the network must also be addressed.
This issue is referred to as *protocol composition* and is currently a
very active research subject in the cryptographic community.

When defining the adversary, a number of possibilities arise. We now describe some of these:

1.
**Complexity:** Given that the widely
accepted notion of efficient or feasible computation is *probabilistic
polynomial-time*, the natural choice is to limit an adversary to this
complexity. However, there are also protocols that are secure against *unbounded*
adversaries.

2.
**Number of corrupted parties:** In a
general multiparty setting, we assume that the adversary controls some subset
of the participating parties; these parties are called corrupted. The allowed
size of this subset must also be defined (typical choices are assuming that
less than one third or one half are corrupted, and not assuming any limitation
on the number of corrupted parties).

3.
**Corruption strategy:** This parameter
relates to whether or not the adversary is *static* (meaning that the set
of corrupted parties is fixed ahead of time), or *adaptive* (meaning that
the adversary can “break into” parties during the protocol execution).

4.
**Allowed adversarial behavior:** In
our discussion above, we implicitly referred to *malicious* adversaries
who are allowed to arbitrarily deviate from the protocol specification.
However, the adversary’s behavior is sometimes restricted. For example, a *semi-honest*
adversary is assumed to follow the protocol but may attempt to learn secret
information from the messages that it receives.

The above very partial list of parameters for defining the adversary begs the question: how do we decide which adversarial model to take? A conservative approach is to take the most powerful adversary possible. However, being overly conservative comes at a price. For example, it is impossible to obtain security for unbounded adversaries in the case that half or more of the parties are corrupted. Furthermore, it is often the case that more efficient protocols can be constructed for weaker adversaries (specifically, highly efficient protocols for many tasks are known for the semi-honest adversarial model, but this is not the case for the malicious model). In general, a good approach is to consider malicious polynomial-time adversaries who may adaptively corrupt any number of the participants. However, in some cases, the semi-honest adversarial model is reasonable. For example, in the medical database example provided in the Introduction, the hospitals are not believed to be malicious; rather, the law prevents them from revealing confidential patient data. In such a case, the protection provided by semi-honest adversarial modeling is sufficient. We stress, however, that in many cases the semi-honest model is not realistic, and malicious adversaries must be considered.

**Conclusions**

Two central guiding principles in defining security are that: **(a)** the definition must accurately, and
conservatively, model the real-world network setting and adversarial threat
(this can differ per application and setting), and **(b)** all aspects of
the model must be fully and explicitly defined. These two conditions are
necessary for us to be confident that a given mathematical definition of security
truly implies that protocols executed in the real world will withstand all
adversarial attacks.

RESULTS AND APPLICATIONS

In this section, we will describe some general results regarding the feasibility of achieving secure multiparty computation. We will then discuss a few of the known results for the specific problem of privacy preserving data mining.

**The
Feasibility of Secure Multiparty Computation**

The aforementioned security definition provides very strong guarantees.
An adversary attacking a protocol that is secure is essentially limited to
choosing its input (because this is all that it can do in the ideal model). An
important question is, can this definition actually be achieved, and if yes
under what conditions? A fundamental result of the theory of cryptography
states that under certain parameters and assumptions, *any* efficient multiparty
functionality can be securely computed. This result is comprised of a number of
different theorems, depending on the model and the number of corrupted parties.
We will describe the basic results for the stand-alone model (where only a
single protocol execution is considered), and the computational setting (where
the adversary is limited to polynomial-time). The basic results for the
information-theoretic setting are proven in (Ben-Or, Goldwasser
and Wigderson, 1988) and (Chaum,
Crepeau and Damgard, 1988).

The first basic theorem states that in the case that a majority of the
parties are honest (i.e., not corrupted), *any* *multiparty
functionality* can be securely computed in the presence of malicious, static
adversaries (Yao, 1986; Goldreich,
Micali and Widgderson,
1986). Extensions to the case of adaptive adversaries can be found in (Beaver
and Haber, 1992; Canetti, Feige, Goldreich and Naor, 1996).

The second basic theorem relates to the case that any number of parties
may be corrupted, and so there is no honest majority. In this case, it is
impossible to construct protocols that meet the definition as described above.
The reason for this is that the definition implies that all parties receive output together; however, this cannot
be achieved without an honest majority (Cleve, 1986). We therefore explicitly relax
the security definition and allow the adversary to prevent the honest party
from receiving its output, even in the ideal model; this relaxed definition is
called *security with abort*. As above, it has been shown that even when
any number of parties may be corrupted, *any* *multiparty functionality*
can be securely computed in the presence of malicious, static adversaries (Yao, 1986; Goldreich, Micali and Widgderson, 1986).

The above results all assume computational hardness assumptions that
follow, for example, from the assumption that factoring large numbers is hard.
A full proof of the results of (Goldreich, Micali and Widgderson, 1986), for
both the honest majority and no honest majority cases, can be found in (Goldreich, 2004).

As we have mentioned, the above results all refer to the *stand-alone
model* of computation, where it is assumed that the secure protocol being
analyzed is run once in isolation. Feasibility results have also been shown for
the case of protocol composition where many different protocols run
concurrently; for example, see (Canetti, 2001) and (Canetti, Lindell, Ostrovsky and Sahai, 2002). A brief
survey on known results for the setting of composition can be found in (Lindell, 2003).

The importance of the above results to the question of privacy preserving
data mining is that they demonstrate that under an appropriate choice of
parameters and assumptions, *any* privacy preserving data mining problem
can be solved, *in principle*. Therefore, the challenge now is to
construct protocols that are efficient enough for practical use; this is
especially a challenge because input-size is assumed to be huge. We note that
achieving this goal may involve finding new definitions that are more relaxed
than those described in the previous section, yet still accurately model the
real security concerns. This is a very non-trivial research task due to the
subtle nature of security and security definitions.

**Secure Protocols for Privacy Preserving Data Mining**

The first paper to take the classic cryptographic approach to privacy
preserving data mining was (Lindell and Pinkas, 2002). The paper presents an efficient protocol for
the problem of distributed decision tree learning; specifically, how to
securely compute an ID3 decision tree from two private databases. The model
considered in the paper was that of *semi-honest* adversaries only. This
approach was adopted in a relatively large number of works that demonstrate
semi-honest protocols for a wide variety of data mining algorithms; see, for
example, (*not*
suffice. Therefore, the malicious model must also be considered (to date,
almost all work has considered the semi-honest model only).

Other work on the problem of privacy preserving data mining has followed what is often called the “data perturbation” approach, as introduced by (Agrawal & Srikant, 2000). We remark that the development of rigorous security definitions that appropriately model security in settings considered by this approach seems to be a very difficult task. The question of whether or not it is possible to provide a definition that captures a realistic (and explicitly stated) adversarial threat model, and can be met, is an interesting open question. We remark that naïve definitions of security have been shown to be completely insecure; see (Dinur and Nissim, 2003) for just one example.

CONCLUSIONS

The history of cryptography shows very clearly that when protocols are
not proven secure, or when the adversarial models are not explicitly defined,
real attacks are very often discovered. Furthermore, the task of coming up with
*mathematical definitions* that accurately model *real adversarial
threats* is a very difficult task. Indeed, slight modifications to existing
definitions can render them completely useless; see (Canetti,
2000) for some discussions on this issue. In this short article, we have
described the real/ideal model paradigm for defining security. This
definitional approach has been the fruit of many years of cryptographic
research, and protocols that meet this definition provide very powerful
security guarantees.

We stress that the success of privacy preserving data mining may depend on the ability to find new definitions that provide both the rigorous security guarantees that are necessary and can be met by highly efficient protocols. This is perhaps the ultimate challenge of this new and exciting field of research.

REFERENCES

Agrawal, R., and Srikant, R. (2000). Privacy Preserving Data Mining. *ACM
SIGMOD International Conference on Management of Data,* SIGMOD’00,

Beaver, D. & Haber, S. (1992).
Cryptographic Protocols Provably Secure Against Dynamic Adversaries. *Advances
in Cryptology, *EUROCRYPT’92.

Ben-Or, M., Goldwasser, S.
& Wigderson, A. (1988). Completeness
Theorems for Non-Cryptographic Fault-Tolerant Distributed Computation. The *20 ^{th} Annual ACM Symposium on Theory of Computing,*
STOC’88.

Canetti, R. (2000). Security
and Composition of Multiparty Cryptographic Protocols. *Journal of
Cryptology*, 13 (1), 143-202.

Canetti, R. (2001). Universally Composable Security: A New Paradigm for Cryptographic
Protocols. The *42 ^{rd}*

Canetti, R., Lindell, Y., Ostrovsky,
R. & Sahai, A. (2002). Universally
Composable Two-Party and Multi-Party Computation.
The *34 ^{th} Annual ACM Symposium on Theory of Computing,*
STOC’02.

Canetti, R., Feige, U., Goldreich,
O. & Naor, M. (1996). Adaptively Secure
Multi-Party Computation. The *28 ^{th} Annual ACM
Symposium on Theory of Computing,* STOC’96.

Chaum, D., Crepeau, C. & Damgard, *20 ^{th} Annual ACM
Symposium on Theory of Computing,* STOC’88.

Cleve, R. (1986). Limits on the Security of Coin
Flips when Half the Processors are Faulty. The *18 ^{th} Annual
ACM Symposium on Theory of Computing,* STOC’86.

*SIGKDD
Explorations,* 4 (2), 28-34.

Dinur, *ACM
Symposium on Principles of Database Systems,* PODS’03.

Goldreich, O. (2003). Cryptography
and Cryptographic Protocols. *Distributed Computing*, 16 (2),
177-199.

Goldreich, O. (2004). *Foundations
of Cryptography, Volume 2 – Basic Applications*.

Goldreich, O., Micali,
S. & Wigderson A. (1987). How
to Play any Mental Game – A Completeness Theorem for Protocols with Honest
Majority. The *19 ^{th} Annual ACM Symposium
on the Theory of Computing,* STOC’87.

Lindell, Y. (2003). *Composition
of Secure Multi-Party Protocols – A Comprehensive Study.* Springer-Verlag.

Lindell, Y. & Pinkas, B. (2002). Privacy Preserving Data Mining. *Journal of Cryptology*,
15 (3), 177-206. (An extended abstract appeared in *Advances in Cryptology, *CRYPTO’00.

*27 ^{th} Annual IEEE Symposium on the Foundations of
Computer Science,* FOCS’86.