@COMMENT This file was generated by bib2html.pl <http://www.cs.cmu.edu/~pfr/misc_software/index.html#bib2html> version 0.91
@COMMENT written by Patrick Riley <http://www.cs.cmu.edu/~pfr>
@COMMENT This file came from Gal A. Kaminka's publication pages at
@COMMENT http://www.cs.biu.ac.il/~galk/Publications/
@PhdThesis{kalech-phd, 
  author = 	 {Meir Kalech}, 
  title = 	 {Diagnosing Coordination Faults in Multi-Agent Systems}, 
  school = 	 {{B}ar {I}lan {U}niversity}, 
  year = 	 {2007}, 
  OPTkey = 	 {}, 
  OPTtype = 	 {}, 
  OPTaddress = 	 {}, 
  OPTmonth = 	 {}, 
  OPTnote = 	 {}, 
  abstract = { 
With increasing deployment of distributed applications in complex, 
dynamic settings, there is an increasing need to also be able to 
respond to failures that occur in their coordinated operation, in 
order to facilitate recovery and reestablish collaboration. We refer 
to this type of diagnosis as social diagnosis, since it focuses on 
finding causes for failures to maintain designer--specified 
relationships between sub-systems. Naive implementations of social 
diagnosis processes can require significant computation and 
communications overhead, which prohibit them from being effective as 
the number of sub-systems is scaled up, or the number of failures 
to diagnose increases. We thus seek to examine in depth the 
communication and computation overhead of diagnosis. 
We choose to use a model-based diagnosis (MBD) approach that 
relies on a model of the diagnosed system, which simulates the 
behavior of the system given the operational context (typically, the 
system's inputs). The resulting simulated behavior (typically, 
outputs) are compared to the actual behavior to detect discrepancies 
indicating failures. The model can then be used to pinpoint possible 
failing components within the system. 
MBD is increasingly being applied in distributed and multi-agent 
systems. While successfully addressing key challenges, MBD has been 
difficult to apply to diagnosing coordination failures. This is 
because many such failures take place at the boundaries between the 
agent and their environment, including other agents. For instance, a 
robot may send a message that another robot, due to an intermittent 
radio failure, did not receive. As a result, the two agents come to 
disagree on an action to be taken. 
A model of both the agents as well as the coordination between the 
agents is defined in advance, which enables to use a general 
reliable and robust diagnosis method which we believe is applicable 
to many multi-agent systems. 
In the first stage of the thesis we present a formalized model of 
social diagnosis in terms of model-based diagnosis approach. The 
multi-agent systems of interest to us are composed of several 
agents, which (by design) are to satisfy certain concurrence and 
mutual--exclusion coordination constraints. 
A fault in the coordination of a multi-agent system may be the 
result of a fault in one of the components or other agent 
components. Given a team model and an observation of the agents' 
components, the goal of the diagnosis is to determine a minimal set 
of abnormal components of agents whose selection may explain the 
inconsistency of the system. 
Based on this formalism, we take a first step towards distributed model-based diagnosis of coordination (inter--agent) failures. Modeling the coordination as a constraint 
graph brings to bear solution methods from distributed constraint 
satisfaction literature, as solutions to the constraint graph form 
the basis for diagnoses. 
We evaluated the use of these algorithms in comprehensive 
experiments with a team of physical and simulated Sony Aibo robots. 
We examined the computational requirements of the algorithms (i.e., 
their run--time and bandwidth usage), and the correctness of the 
diagnoses produced. We find that in general, a trade--off exists 
between computational costs and the ability to produce correct 
diagnosis. 
In the second stage of the thesis we present a design space of 
diagnosis algorithms, distinguishing several phases in the social 
diagnosis process, and providing alternative algorithms for each 
phase. To this aim we focus on the diagnosis of disagreement between 
situated agents, since the control process of such agents and faults 
is relatively simple to model, and we can therefore focus on the 
core communications and computational requirements of the diagnosis. 
We distinguish two phases of diagnosis: (i) selection of the 
diagnosing agents; and (ii) diagnosis of the global team state (by 
the selected agents). We provide alternative algorithms for these 
phases, and empirically evaluate the communications and run--time. 
The results show that centralizing the disambiguation process is a 
key factor in improving communications, but is not a determining 
factor in run--time. On the other hand, explicit reasoning about the 
different sub-systems is a key factor in determining run--time. 
Based on this conclusion we address two principles to achieve the 
reduction of the communication and the computation in large-scale 
teams. First, instead of sending all the information, send only the 
information that is relevant to the diagnosis. Second, diagnosing a 
limited number of agents that represent all others. These principles 
yield a novel diagnosis method which significantly reduces the 
runtime, while keeping communications overhead to a minimum.}, 
  wwwnote = {}, 
  OPTannote = 	 {} 
} 

