
Parallel and distributed systems possess inherent redundancy that can be
exploited to achieve fault tolerance. Our research is concerned with
the design and evaluation of ultra-reliable and highly-available parallel
and distributed systems. Specific problems studied include on-line fault
diagnosis, group membership, recovery techniques, fault-tolerant routing and
multicast, clock
synchronization, and reconfiguration in multicomputer systems. Evaluation is
both analytical and experimental. Testbeds include a UNIX workstation cluster,
a Windows NT cluster, and several commercial parallel computers.