A Timeout-Based Message Ordering Protocol for a Lightweight
Software Implementation of TMR Systems
Paul D. Ezhilchelvan†, Francisco V. Brasileiro‡, and Neil A. Speirs† ‡Departamento de Sistemas e Computação Universidade Federal da Paraíba - UFPb/Campus II Av. Aprígio Veloso s/n, Campina Grande - Pb, 58108-970, Brazil Abstract
Replicated processing with majority voting is a well known method for achieving reliability and availability - two key attributes of dependability. Triple Modular Redundant (TMR) processing is the most commonly used version of that method. Replicated processing requires that the replicas reach agreement on the order in which input requests are to be processed. Almost all synchronous and deterministic ordering protocols published in the literature are time-based in the sense that they require replicas’ clocks to be kept synchronised within some known bound. We present a protocol for TMR systems that is based on timeouts and does not require the clocks be kept in bounded synchronism. The protocol design exploits features that are so characteristic of a TMR system. Consequently, the worst-case input ordering delay turns out to be the smallest possible; i.e., no symmetric protocol that works only with unsynchronised clocks, can provide a smaller worst-case delay. Keywords and Phrases
Byzantine failures, fault tolerance, Triple Modular Redundancy (TMR), process replication, agreement, message ordering, physical and logical clocks. 1. Introduction
We consider the task of designing and implementing a system that continues to provide services in the presence of a bounded number of processor failures. One of the problems is that a faulty processor might fail in an arbitrary or Byzantine [Pease80] manner. N modular redundant (NMR) processing is one of the most effective ways to mask the effects of Byzantine failures [Powell88]. Considering that non-faulty processors can generate unforgeable signatures, the basic idea here is to use in place of a single processor a set of N, N ≥ 3, processors to mask the failures of at most (N - 1)/2 processors. A Triple modular redundant (TMR) system with N = 3 is the most practical version of an NMR system. The three processors of a TMR system communicate with each other only by message passing and are assumed to fail independent of each other. These processors execute every given task in parallel and the results produced by them are subject to a majority vote. If the voted result is regarded as the output from the TMR system, the system functions correctly provided (i) at least two of its constituent processors are non-faulty, (ii) all non-faulty ones produce identical results after executing any given task, and (iii) the voting performed is correct. Process replication within a TMR system assumes the well understood state machine model, for which precise requirements for supporting replicated processing are known [Sch90]. One of the requirements is that non-faulty processors of a TMR system process messages that are input to the system, in an identical order so that their results are identical. This means that non-faulty processors must agree on the relative order in which each input message is to be processed. In this paper, we develop a message ordering protocol for the authenticated Byzantine fault model – perhaps the most well-known sub-class of the Byzantine fault model – in which a faulty processor cannot undetectably forge a non-faulty processor’s signature; the protocol also assumes the following synchronous environment: (i) communication delays between non-faulty processors are bounded by a known constant; (ii) processing and scheduling delays within a non-faulty processor can also be bounded; and, (iii) each non-faulty processor has a local read-only physical clock whose running rate with respect to the passage of real by a small and known bound. Message ordering in the presence of failures requires ability to detect late and absent messages. In the time-based approach [Lampo84], this capability is usually achieved by synchronising non-faulty processors’ clocks within some known bound, e. Since non-faulty processors need not have clocks with identical running rates, they should periodically execute a synchronisation protocol (e.g., [Vasant88, Veris97]) to adjust their clock readings by appropriate amounts. This synchronisation involves (i) periodic exchange of messages, consuming network bandwidth, (ii) using data abstractions [Dolev84] to adjust the readings of a physical clock (a read-only object), and (iii) implementing amortisation techniques [Schmu90]. An alternative to using synchronised clocks (and therefore to having to keep the clock differences within a known bound) is to use timeouts and to employ the (unsynchronised) physical clocks only for measuring timeouts. We will adopt this timeout- based approach for designing the TMR message ordering protocol. In asynchronous systems, where the bounds on transmission, processing and scheduling delays cannot be known with certainty, timeouts are commonly employed for dealing with failures [Chandra96, Fetzer97] so that non-faulty processors can order messages identically in a fault-tolerant manner. However, for synchronous systems, the time-based approach is generally favoured. Practical systems, such as MARS [Kopetz89] and Air Traffic Control System [Cristian96], have taken the time-based approach. [Echtle87] and AMp [Veris91] are some of the early exceptions to this time-based trend in the synchronous context. AMp is a timeout-based, fault-tolerant message ordering protocol, designed and implemented for commercial applications. It provides the same message ordering guarantees as our protocol in 1Real time is measured in an assumed Newtonian time frame that cannot be directly observed. a general n-processor system but assumes a benign fault model where processors either crash or occasionally omit to produce responses. Our assumption of authenticated Byzantine faults is a weaker fault model and, as argued in [Pease80], any further weakening of this fault model would make the desired form of message ordering impossible to achieve in a three-processor One of the designers of AMp later analysed time vs. timeout-based approaches which led to the following conclusion that has both theoretical and practical significance [Veris96]: though the synchronous, timeout-based protocols cannot be perfect substitutes for their time- based counterparts in all circumstances, they can however provide attractive alternatives in a number of application settings. This conclusion is based on two observations: timeout-based protocols are less precise in preserving the temporal ordering of events (the order in which events occur according to Newtonian time); however, the precision they offer is sufficient in many settings where external events are being interpreted to the system by slow functioning interface devices such as valves or by cyclic processes operating with large periodicity; for example, a cyclic process that observes sensor changes every 10 ms, will interpret and report all changes that occurred in the last 10 ms period to have occurred at the ‘same time’; this means that, as far as the computing and ordering processes are concerned, two external events occur either at the same time or at least 10 ms apart, and that any protocol with precision smaller than 10 ms will be equally useful. Secondly, timeout-based protocols are faster under certain ‘favourable’ conditions. In the case of our protocol, these favourable conditions (derived in section 5) turn out to be: all processors are non-faulty; and, the maximum difference (λ) within which non-faulty processors receive a given input from the environment and the actual maximum transmission delay (da) that currently holds within the system, are such that λ=< 2(e- da). λ is typically defined as tightness [Veris00] and will be small when processors receive inputs (from the system environment) via a broadcast medium. We therefore believe the protocol presented here enhances the tool-kit available for a TMR system builder. Further, our implementation experience [Speirs93, Brasil95] suggests that building a TMR system using our protocol is easier as clocks need not be synchronised. The rest of the paper is organised as follows. Section 2 describes the structure of the TMR system, the basic assumptions and the message ordering requirements. Section 3 develops and presents the protocol. Section 4 describes the design efforts involved in keeping the ordering delays as small as possible. Section 5 presents proofs of correctness and also proves that no (symmetric) protocol that works only with unsynchronised clocks can guarantee smaller worst-case ordering delays. Section 6 concludes the paper. 2. System Description and Assumptions
We will suppose that the TMR system is made up of processors named Pi, Pj and Pk. These processors are uniquely ordered and the ordering is known to them. Each processor is connected directly to the other two processors of the system by internal links. Also, it is connected to the ‘outside world’ or the system environment from which input messages are received and output messages are sent to. Figure 1 shows a TMR system whose processors are connected to the environment via a bus. The unit (shown as a black square in the figure) that connects a processor to the bus is called the network attachment controller (NAC). Assumption 1: Within a TMR system, at least 2 out of 3 processors are non-faulty and never
Assumption 2: The internal links that connect processors never fail. (Alternatively, a link
failure can be thought of as the failure of one of the two processors connected by that link.) Processors are reliably connected to the system environment. We further assume that a NAC does not allow the attached processor to use the bus continuously for a long time, and thereby it contains the effects of the “babbling idiot” syndrome: the faulty processor may generate messages randomly and try to transmit them all onto the bus, thus preventing other processors Assumption 3: Each processor can sign the messages it sends, and authenticate the signed
messages it receives [Rivest78, Tsudik92] such that: A non-faulty processor’s signature for a given message is unique and cannot Any attempt to alter the contents of a non-faulty processor’s signed message is detected by any other non-faulty processor. In this paper we adopt the style of writing real time values in Greek letters and clock time values in italicised lower case Roman letters; the term ‘clock’ will always refer to a Assumption 4: If a non-faulty Pi prepares and transmits m at real time τi, every non-faulty
destination Pj will receive m at real time τj, τi ≤ τj < (τi + δ), where δ, δ > 0, is known. δ
bounds the message queuing delays at the sending and receiving processors, and the propagation delay from the sender to receivers. A violation of this assumption between a pair of processors means that one of the processors is faulty. Assumption 5: A non-faulty Pi’s clock measures an interval x in a real time interval
x(1 + ρi), where |ρi| ≤ ρ and ρ is a known positive constant.
Assumption 6: Processing and scheduling delays are bounded and known for a non-faulty
processor; more precisely, a non-faulty processor: i) performs a local computation (e.g., processing a received message) within a bounded and known amount of time; and, ii) schedules a computational task within a bounded and known amount of time. Remarks: The last three assumptions make the system context a synchronous one. In
asynchronous models, however, these assumptions can be violated infinitely often even within a non-faulty processor and consequently the order protocol will terminate only when such violations cease for a sufficiently long time (see [Chandra96, Cristian99]); an exception is the timed asynchronous model of [Cristian99] which does not permit violation of assumption 5 within a non-crashed processor. In our case, a violation of any of these three assumptions within a processor would render the processor faulty; consequently, the order protocol is always guaranteed to terminate in a known and bounded amount of time, provided the number of faulty processors is restricted to one (assumption 1). 2.1 Input Message Ordering
Non-faulty processors can receive the input messages in different order. Referring to figure 1, Pi can receive µ1 followed by µ2 and Pk in the reverse order. So, when a processor receives an input message from the system environment, it must first decide the processing order for that message. For that, it forms an internal message m that contains the received input message in the data field m.µ and sends m to all other processors in the system using the order protocol that guarantees the following two conditions: Validity: If a non-faulty Pi forms and sends m, all non-faulty processors (including Pi) will
decide on an order for m, within a known and bounded real time interval ∆;
Unanimity: If a non-faulty Pi decides on an order for a given m, then every non-faulty Pj
These conditions ensure that an input µ supplied to a non-faulty P gets identically ordered in the form of P’s internal message m, m.µ = µ, by all non-faulty processors of the TMR system within ∆. It is possible that the faulty processor generates and orders an internal message for a fictitious input or for an old input message that has already been processed by non-faulty processors. So, to ensure that no duplicate or fictitious input gets processed, an ordered internal m, m.µ = µ, can initiate the (ordered) processing of input µ only after the µ contained in it, has been verified to be valid, i.e. not a fictitious or duplicate input. We refer the reader to [Shri92] for details on how a non-faulty processor (i) derives an ordered stream of inputs from an ordered stream of internal messages, and (ii) generates a voted output from the results it computes. This paper will focus only on the (timeout-based) ordering of internal messages. Note that since the TMR system can have at most one faulty processor, an input µ must be supplied to at least two processors within the system; we will assume that every µ is sent to all three processors in the TMR system. 3. The protocol
The protocol has three aspects to it: (i) message counters maintained by processors, (ii) message diffusion that enables non-faulty processors to receive each other’s messages, and (iii) timeliness checks to assess the timeliness of a received message.
3.1. Message Counter and Diffusion
Each processor within the TMR system, say Pi, maintains a counter called the message counter, denoted as MCi, which holds an integer value and is initialised to INIT_VAL (usually 1) when the system is first started. For every given input µ received from the environment, Pi forms an internal m in the following manner: the data field m.µ is set to µ, the originator field m.O to i; the timestamp field m.TS to Mci; further, MCi is incremented by 1. Incrementing of MCi soon after timestamping m ensures that any message Pi later forms gets a timestamp larger than m.TS. An internal message m formed is accepted with its copy being entered into a message list called the acceptedi and is then sent to other processors using the send(m) primitive. An invocation of this primitive discards m if m.S contains two signatures, and performs the following actions if m.S contains no or one signature: it generates the signature of Pi for m, appends the generated signature to any signature that may already be in m.S and transmits the resulting m to processors in the system that have not signed m. Pi receives internal messages from other processors by executing receive(m) primitive which blocks until it can return an authentic m that is received via an internal link and has the authentic signatures of one or two distinct processors other than Pi. Whenever Pi receives m, it checks whether m is timely. (Procedures for checking the timeliness of a received message are described in the next sub-section.) If m is found untimely, it is discarded; otherwise, the following three actions are performed: (i) MCi is set to the maximum of {MCi, m.TS + 1}, (ii) m is accepted by entering a copy of m into acceptedi; and, (iii) m is ‘diffused’ by executing the primitive send(m). Thus, any accepted message that is received with one signature is ‘diffused’ to the processor that appears not to have ‘seen’ the message. As there can be at most one faulty processor in the TMR system, diffusion of single-signed messages ensures that if m of a faulty processor enters acceptedi of a non-faulty Pi, then the other non-faulty processor receives m at least once. From the description above, it is obvious that: (i) a message sent or received by a non- faulty processor will have either one or two authentic signatures; and (ii) every sent message, but no received message, carries the host processor’s signature. For a given message m, path(m) is defined as the ordered sequence of processors that have signed m. Thus, if m is a double-signed message that is formed by Pj and diffused by Pk, then path(m) = Pj:Pk. The first processor in path(m) is called the originator of m and the last processor the immediate sender of m. Note that the originator and the immediate sender of m are one and the same if m is single-signed. Two paths are said to intersect if they contain one or more processors commonly in them.
3.2. Timeliness Checks
These checks enable a processor to determine the timeliness of a received m. Before presenting them, we will define a clock time interval d such that by measuring d in its local clock a non-faulty processor is guaranteed to measure a real time interval of at least δ duration, i.e., d ≥ δ/(1 - ρ); d is known to, and identical for all non-faulty processors of the system. We will assume, for simplicity, that a processor takes zero time to execute any instruction of the protocol and the send(m) and receive(m) primitives. (Realising this assumption in practice will require an increase in the value of d, which is possible as the protocol does not impose any upper bound on the value of d.) Fixing the processor Pi in the TMR system to be non-faulty, we will describe the checks by which Pi determines the timeliness of a received message. Suppose that Pi receives a message m at its local clock time ti. There can arise one of three possible situations depending on the value of MCi at ti: MCi < m.TS or MCi = m.TS or MCi > m.TS. If MCi < m.TS or MCi = m.TS then m is a ‘future’ or a ‘present’ message respectively and is considered by Pi as timely; if, on the other hand, MCi is already larger than m.TS when m is being received, then m is a ‘past’ message, and its timeliness should be judged based on how much time has elapsed since MCi first became larger than m.TS. So, timeliness checks are needed only for messages received with past timestamps. Let us suppose that the m received at ti is a past Let m’, be the message whose acceptance by Pi caused MCi to become larger than m.TS for the first time. That is, just before Pi accepted m’, MCi ≤ m.TS. Note that Pi could have either received m’ from another processor or itself formed and sent m’; and also that m’.TS m.TS because acceptance of m’ has caused MCi to exceed m.TS. Let Pi accept m’ at its clock time ti, ti < ti. For m to be considered timely, (ti - ti) must be less than a certain bound, called the timeliness bound, whose length depends on path(m’) and path(m). Table 1 summarises these timeliness bounds for various combinations of path(m’) and path(m). For example, if m’ is a single signed message from Pk and m is a single signed message from Pj, then the entry B2 (row B, column 2) indicates that (ti - ti) must be less than 2d for Pi to consider m timely. We will denote an entry of table 1 by treating the table as a matrix: Table1[Pr, Pc] denotes the value in the row corresponding to the path Pr and in the column corresponding to the path Pc. Pk:Pj (E) d Table 1: Pi’s timeliness bounds for a past m given that MCi first exceeded m.TS due to accepting m. 3.3. Protocol Description
To perform the timeliness checks on received messages, non-faulty Pi maintains path counters, denoted as PCi[p], for every path p through which Pi can receive a message. Thus, Pi maintains four path counters, one for each p ∈ {Pk, Pj, Pj:Pk, Pk:Pj}. These counters have integer values which are initialised to (INIT_VAL - 1) when the system is formed. A path counter PCi[p] is set to T, to indicate that any m, m.TST and path(m) = p, to be received thereafter, must not be considered timely. A primitive update(path: p, timestamp: T) is used which, upon being invoked, sets PCi[p] = maximum of {PCi[p], T}. The instruction “schedule op at t” schedules the operation op at (local) clock time t, if t is a future time when the instruction is executed. Whenever Pi forms and sends m’ or receives a timely m’, executions of the update primitive are scheduled to set PCi[p] = maximum of {PCi[p], m’.TS}, for every p ∈ {Pk, Pj, Pj:Pk, Pk:Pj} after the elapse of time intervals indicated in Table1[path(m’), p]. With the path counters being updated at appropriate timing instances, the timeliness of a received m is checked simply by referring to PCi[path(m)]: m is timely only if m.TS > PCi[path(m)] when m was received. We define PCi,min = minimum of {PCi[p], ∀p | p ∈ {Pk, Pj, Pj:Pk, Pk:Pj}}. Any m that is received after PCi,min has become larger than or equal to m.TS is untimely irrespective of path(m) and cannot enter acceptedi. So, after PCi,min ≥ T, every m, m.TST, that is already in acceptedi is said to have become ‘stable’ and can be safely ordered. Stable messages with a given timestamp T, are grouped together into a list called stablei(T) for ordering. The entries of stablei(T) are first stripped of the signatures they have, and then duplicate entries are Two messages m and m’ are said to be spurious if m.O = m’.O, m.TS = m’.TS and m.µ ≠ m’.µ. Spurious messages are generated when the faulty processor gives the same timestamp to distinct messages m and m’ it forms. Spurious messages in stablei(T) are discarded. The stablei(T) will now contain at most one message originating from a given processor. If it contains more than one message, they are ordered according to their source processors. Messages of stablei(T) are ordered only after the contents of all stablei(T’), T < T, have been ordered. This is ensured by using the stability counter SC which is initialised to (INIT_VAL - 1). The protocol for Pi is presented in figure 2. do /* Broadcast process: handles external messages */
get µ from outside the system;
m.µ = µ; m.TS = MCi; m.O = i;
MCi = MCi + 1;
t = clocki;
for every p ∈ {Pk, Pj, Pj:Pk, Pk:Pj} do
schedule update(PCi[p], m.TS) at t + Table1[path(m), p]; end for
acceptedi = acceptedi ∪ {m};
end do /* Broadcast process */
do /* Diffuse process: handles internal messages */
if (m.TS ≤ PCi[path(m)]) then discard(m)
MCi = maximum of {MCi, m.TS + 1};
t = clocki;
for every p ∈ {Pk, Pj, Pj:Pk, Pk:Pj} do
schedule update(PCi[p], m.TS) at t + Table1[path(m), p]; end for
acceptedi = acceptedi ∪ {m};
end do /* Diffuse process */
do /* Order process: identifies and orders stable messages */
PCi,min = minimum of {PCi[p], ∀=p | p ∈={Pk, Pj, Pj:Pk, Pk:Pj}};
while SCi ≤ PCi,min do
SCi = SCi + 1;
stablei = {m | m ∈ acceptedi and m.TS = SCi};
acceptedi = acceptedi - stablei; /* remove stable messages */
if stablei ≠ ∅ then
distinct = ∅;
for every m in stablei do
m.S = ∅; /* strip off signatures from m */if m ∉ distinct then insert m in distinct end if end for
stablei = distinct;
spurious = {m, m’|m=∈=stablei ∧=m’∈=stablei ∧=m’.O=m.O ∧==m’µ≠m.µ};
stablei = stablei - spurious; /* remove spurious messages */
if stablei≠∅ then ∀=m|m=∈=stablei deliver m ordered by m.O end if
end while
end do /* Order process */
Figure 2. The Timeout-based Protocol for Message Ordering. 4. Estimating Timeliness Bounds
Central to protocol correctness is the accurate estimation of safe timeliness bounds which also need to be small if the protocol is to order messages fast. This section presents the arguments for, and the rationale behind the derivation of the smallest safe bounds. It is done in three parts. First, the basic condition that should be met for the protocol to be correct is stated. We then analyse how meeting this condition is made difficult by various possible failure modes of the faulty processor; this analysis leads to the identification of the safety requirement. The third part shows in two stages how the timeliness bounds used in the protocol are derived to be safe. First, the bounds are derived in the most straightforward manner; next, the contexts where bounds can be safely reduced are identified and the bounds reduced. Note that the smaller the timeliness bounds used, the earlier the accepted messages are stabilised and ordered, resulting in smaller ∆ (the upper bound on message ordering 4.1. The Basic Condition
The following condition must be met for the protocol to be correct: Unanimous acceptance: m enters the accepted list of a non-faulty processor if and only if m
or a message equivalent to m enters the accepted list of every other non-faulty processor in A message equivalent to m is denoted as equiv(m) and is defined as any other message that differs from m only in the signature field; that is, equiv(m).µ = m.µ, equiv(m).O = m.O, equiv(m).TS = m.TS and equiv(m).Sm.S. Since a non-faulty processor does not accept an incoming message containing its own signature, it accepts at most one equiv(m) after having accepted m. Note that equiv(m) and m become identical once their signatures are stripped off (i.e., once the signature fields are set to Φ). It can be noted from the protocol description that there exists a local time when a non- faulty processor stops accepting messages of certain timestamp value. When that time comes for a given timestamp value T, an already accepted message with timestamp T becomes stable. If the basic condition holds, then any two non-faulty processors, say, Pi and Pk, construct an identical stable(T) of signature-stripped and non-spurious messages for every given T, T INIT_VAL. Given that processor ordering is unique and known to non-faulty processors, the order determined by them for a (signature-stripped) m will be identical. 4.2. Effects of Failures
A faulty processor, say Pj, can fail in the following ways to prevent the above basic F1 (impersonating a non-faulty processor): Pj generates a signed message on behalf of a non-faulty processor and attempts to deceive the other non-faulty processor into accepting the forged message. For example, Pj fabricates a message with Pk’s (forged) signature and diffuses it to Pi, with path(m) = Pk:Pj. If Pi were to accept m, it will not diffuse m to Pk as the accepted message is already double-signed. However, by assumption A3, such attempts by faulty Pj are detectable and hence failures of this type are not a threat to meeting the basic F2 (delayed sending of own messages): Pj arbitrarily delays the sending of a message m it generates. A failure of this type can lead to one of the following outcomes: (i) both the non- faulty processors find m untimely and hence both discard m; (ii) both find m timely and enter a copy into their respective accepted list; or (iii) one non-faulty processor, say Pi, finds it timely and the other does not. Note that the first two outcomes do not pose any problem since both the non-faulty processors are unanimous in accepting or not accepting m. The last outcome lacks this unanimity and is depicted in figure 3(a). The situation of m being entered only in acceptedi and not in acceptedk, has to be dealt with if identical ordering is to be Figure 3. (a) Pi accepts m but not Pk. (b) diffusion failure. F3 (two-facing while sending own messages): Pj sends a properly signed m to say, Pi, and sends to Pk either (a) an inauthentic version of m, or (b) nothing, or (c) a different, authentic message m’ (that is never sent to Pi). The effect of a two-facing failure is the same as the last outcome of the previous category: a 1-signed message enters the accepted list of one non- faulty processor and neither that message nor its equivalent ever enters the accepted list of the F4 (failure during diffusion): Pj fails while diffusing a message m which it received from, say, Pi as shown in figure 3(b). It can fail by altering the contents of m (call this failure type F4.1) or by delaying the diffusion by an arbitrary amount of time (call this type F4.2). Assumption 3(b) reduces the impact of F4.1 by enabling Pk to detect Pj’s tampered message as inauthentic and discard the diffused message. F4.2 can result in Pk not receiving the diffused message at all, or receiving it but finding it untimely. Thus, both F4.1 and F4.2 can prevent equiv(m) from entering acceptedk. To deal with the problematic outcomes caused by failures of F2, F3, and F4 types, the timeliness checks are required to be safe: safety requirement: a non-faulty processor finds a received m timely if the immediate sender
Let us suppose that the timeliness checks meet the safety requirement. Consider the third outcome caused by an F2 type failure (see figure 3(a)). Pi will diffuse m after it has entered it into acceptedi. By the safety property of timeliness checks, Pk is guaranteed to find the Pi’s diffused message timely; thus, equiv(m) is guaranteed to enter acceptedk. This argument also shows that an F3 type failure is not a threat when timeliness checks are safe. Let us next consider figure 3(b), which depicts an F4 type failure. Since Pi, the originator as well as immediate sender of m, is non-faulty, the safety property guarantees that Pk finds m timely and that m enters acceptedk as well. Thus, if the timeliness checks are safe, the above- mentioned basic condition is met and therefore identical ordering is achieved despite the arbitrary failures of the faulty processor. It can be observed from the protocol description that the value of a non-faulty processor’s MC (message counter) is always one more than the largest timestamp of messages accepted so far; also, that the value of any of its PC (path counter) can never exceed the largest timestamp of the accepted messages. So, a non-faulty processor’s MC will be at least one more than any of its PC. This means that when a received m is a future or present message, it is guaranteed to be found timely. That is, the safety requirement is trivially met if the received message is a present or future one. So, in what follows, we will show that the timeliness bounds used are safe when a non-faulty Pi finds a received m to be a past message. 4.3. Derivation of Timeliness Bounds
We will present our derivations in the context in which the bounds of Table 1 are presented: Pi is non-faulty and accepts m’ at its clock time ti and receives m, m.TSm’.TS, at ti, ti > ti’; just before ti’, MCi ≤ m.TS, and at ti’ MCi > m.TS. To keep the description simple, we assume the following. (i) The minimum message transmission delay (as measured by a non-faulty clock) is zero. (ii) The clocks of all non-faulty processors have identical running rates. (This assumption is removed in subsection 4.3.3.) (iii) Unless stated explicitly, time is measured according to the clock of Pi. So, when we say an event happened at (time) t, it means that the event happened at time t according to Pi’s clock. (iv) Finally, the subscript i is dropped from ti and ti’, where the context is obvious. Lemma 4.1. Given that Pi accepts m’ at t’, any non-faulty Pk will have MCk > m’.TS before
Proof: If Pk has signed m’, then MCk > m’.TS when it signed m’. So, MCk > m’.TS becomes true before t’ or at t’ if m’ has experienced zero transmission delay in reaching Pi from Pk. If Pk has not signed m’, it will receive the diffused m’ from Pi before t’ + d. Upon the reception of m’, if MCk is not already greater than m’.TS then m’ is either a present or a future message, and therefore must be accepted by Pk which will then set MCk > m’.TS. Say, non-faulty Pk forms and sends m, m.TSm’.TS. By lemma 4.1, MCk becomes larger than m’.TS before t’ + d. So, m should have been sent by Pk before t’ + d. Even if m experiences the maximum delay of just less than d time, Pi should receive Pk’s single-signed m before t’ + 2d. It suggests that if Pi receives a single-signed message m at t such that (t - t’) < 2d, it must consider m timely. This scenario is depicted in figure 4(a) where time progresses from left to right and the labels “>m’.TS” and “=m.TS+1” indicate the earliest instances when a non-faulty message counter exceeds m’.TS and when it becomes equal to m.TS + 1, respectively; “<nd” labels an interval of length less than nd, for some integer n ≥ 1. On the time-line of Pk (in figure 4(a)), the timing instance labelled “=m.TS+1” cannot be on the right hand side of that labelled “>m’.TS” since m.TSm’.TS. So, Pi can receive a timely m from Pk at t, (t - t’) < 2d. Figure 4. m.TSm’.TS. (a) Non-faulty Pk forms and sends m. (b) Pk diffuses Pj’s m. Suppose that m originates from a faulty Pj and that Pk receives it within 2d time after MCk > m’.TS has become true. (See figure 4(b).) Just like in figure 4(a) where Pi accepts Pk’s m because m arrives within 2d time after MCi > m’.TS, Pk must now consider Pj’s m timely. (Note that a non-faulty processor cannot know whether another processor is non-faulty or faulty.) When Pk diffuses m, Pi must find the diffused message timely for the safety property to be met. From figure 4(b), Pi can receive the diffused m at t, t < t’ + 4d. So, Pi’s timeliness check for accepting a double signed m is (t - t’) < 4d. 4.3.1. Motivation for reducing the bounds
The timeliness bounds of 2d for 1-signed messages and 4d for 2-signed messages are safe for the protocol. Reducing these bounds further (without compromising safety) can lead to smaller ∆. To illustrate this, let us evaluate ∆ for the following scenario: Pi receives and accepts 1-signed m’ from Pk at time t’. Suppose that there is no m, m.TSm’.TS, left in the system for ordering. Non-faulty Pi deduces that it will never accept a 1-signed m and a 2-signed m at t’ + 2d and t’ + 4d respectively. So, if the 1-signed m’ from Pk had taken maximum transmission time to reach Pi, the delay (in real-time) for Pi to order m’ would be δ=+ 4d with ρ having been assumed to be zero; i.e., ∆ ≥ =δ +4 d. Suppose that we are able to show that it is enough for Pi to wait until t’ + 3d to deduce that no 1- signed or 2-signed m, m.TSm’.TS, will ever be accepted; this smaller waiting period reduces the ordering delay for m’ to δ=+ 3d. Suppose also that we are able to show that the ordering delay for all types of m’ that Pi can possibly receive and accept, can be reduced as well, then ∆ itself can be reduced, resulting in a faster ordering protocol. Below, we will explain the intuition behind our attempts at reducing timeliness bounds. Given that Pi has received and accepted m’ at t’, let the originator of m’ be Pk and m’ be 1-signed or 2-signed. Pi can now expect any of Pk’s 1-signed m, m.TS < m’.TS, to be received before t’+ d, if Pk is non-fa Pk sends the messages it forms, in the increasing order of message timestamps; another non-faulty Pi must receive the earlier one within at most d after it has received the later one. So, the conclusion is: given that m’ is accepted at t’, the timeliness bound for a 1-signed m such that m.TSm’.TS and the originator of m’ ≡ originator of m, is d. Note that this bound is d smaller than the generic bound we have earlier derived for a 1-signed m. It can be observed that we have achieved the above reduction (i) by considering the slowest possible behaviour of a message m from another non-faulty processor, with a hypothesis that Pi has already accepted m’, m’.TSm.TS; and, (ii) for the case where path(m) intersects path(m). It turns out that whenever (ii) holds, reduction is possible. Intuitively, the timing behaviour of a potential m from another non-faulty processor can be reasoned out more precisely, if the potential m and the accepted m’ (with m’.TSm.TS) had been handled by the same processor. Because there are only three processors in the TMR system, it is guaranteed that path(m) intersects path(m) whenever m is double signed and m’ is not formed by Pi. This means that when the accepted m’ is not locally formed, Pi needs to wait only for a smaller period (smaller by at least d) in order to be sure that no 2-signed m will any longer be accepted; so, even if Pi needs to wait until t’+ 2d for 1-signed messages, it can start ordering m’ by t’+ 3d. There are fourteen combinations of path(m) and path(m) for which reduction is feasible. They correspond to those entries in table 1 where the bounds are neither 2d nor 4d, and to the entries D3 and E4 which have 2d for a double-signed m. In the next sub-section, we explain the rationale behind the reduction by considering three representative path combinations; for each, we also indicate the dual cases for which the explanation can be readily applied - thus covering eight path combinations in total. For space reasons, we omit the explanations for the remaining six combinations which are not dissimilar to the combinations considered. Note that this omission does not leave the paper incomplete, as Section 5 proves that the protocol is correct, i.e., all reductions applied are correct. (A reader who is not interested in the reduction rationale but only in the overall correctness, can proceed straight to section 5.) 4.3.2. Rationale for reducing the bounds
We will make a default assumption that the immediate sender of m (taken mostly as Pk) is non-faulty and name a particular path combination after its corresponding entry in table 1. Case B1: path(m) = path(m) = Pk. Since both m and m’ originate from non-faulty Pk, m.TS and m’.TS cannot be equal. By hypothesis, m.TS cannot be larger than m’.TS; so, only m.TS < m’.TS is possible. So, Pk must have sent m first and then m’. (See figure 5(a).) Since 2Note that the safety requirement is void if the immediate sender of m is faulty. message transmission can take as small as zero time and as large as just less than d time, (t - t’) < d. Similar argument applies also for the case D1, where path(m) = Pj:Pk and path(m) = Pk. Here, the non-faulty Pk does not form m’ but receives it from Pj and diffuses it after setting its MCj > m’.TS. So, it must have sent m before diffusing m’ and therefore (t - t’) < d. Replacing Pk by Pj as the immediate sender of m (and therefore considering Pj as non-faulty) in these two cases leads to the bound of d for cases C2 (path(m) = path(m) = Pj) and E2 (path(m’) = Pk:Pj and path(m) = Pj). Case E1: path(m’) = Pk:Pj and path(m) = Pk. Both m and m’ originate from non-faulty Pk which must have sent m first and then m’ since m.TS < m’.TS. But Pi receives m’ first and then m. This can happen, as illustrated in figure 5(b), when Pk sends m and m’ close to each other and the transmission delay of m’ from Pk to Pj and then from Pj to Pi is less than the transmission delay of m from Pk to Pi. Since message delays between non-faulty processors are bounded by d, Pi must receive m before t’ + d. Hence the timeliness bound for this case is d. It also applies for the case D2 which is the dual of E1 with Pj being the non-faulty Case B4: path(m’) = Pk and path(m) = Pk:Pj. Observe that Pj here is the non-faulty immediate sender of m and Pk can be faulty or non-faulty. Since Pi accepts m’ (by hypothesis), it will diffuse the message to Pj. Pj also diffuses m to Pi. So, there are two cases to consider: Pj receives the single-signed m from Pk either (i) before or (ii) after it receives the diffused m’ from Pi. The subcase (i) is shown in figure 6(a). Pj can receive the diffused m’ from Pi at any time before t’ + d. Even if it diffuses m just before receiving m’, Pi will receive the diffused m before t’ + d + d; that is, (t - t’) < 2d. Figure 6(b) illustrates the second subcase where Pj receives the single-signed m from Pk after it has received the double-signed m’ from Pi. Since Pj diffuses m to Pi, it must have found the single-signed m timely. Suppose that it also finds the double-signed m’ timely. This means that MCj > m’.TS becomes true by the time Pj accepts the double-signed m’. When it receives the single-signed m after having accepted the double-signed m’, it is in the same situation as Pi in case E1 where path(m’) = Pk:Pj, path(m) = Pk and the timeliness bound is d. So, for the non-faulty Pj to find the single-signed m timely, it must have received m within d time after MCj > m’.TS became true; that is, the time elapsed between receiving m’ and m must be less than d. From figure 6(b), (t - t’) < 3d is possible for this subcase. We now have to show that Pj does find the double-signed m’ timely, which is done as lemma A1 in appendix A. The proof is based on the observation that the double-signed messages have larger timeliness bounds than the single-signed ones; so, it is not possible for a non-faulty processor to reject a double-signed message as untimely and then accept later a single-signed message of smaller or equal timestamp as timely. Choosing the largest of the bounds estimated for the two subcases, 3d becomes the timeliness bound for B4; this bound also applies for C3 where Pk is substituted for Pj as the non-faulty Figure 6. Case B4. (a) Pj receives m before m’. (b) Pj receives m after m’. 4.3.3. Adjusting for non-identical clock rates
We now focus on our earlier assumption that non-faulty processors’ clocks have identical running rates. When the readings of non-faulty clocks can drift apart with the passing of real time, the differences accumulated over a period of time must be compensated. The way we do this compensation is by increasing the value of d by the maximum difference which the non- faulty clocks could develop during the ‘life-time’ of a non-faulty processor’s message within the ordering protocol, i.e., during the maximum real-time within which a non-faulty processor’s message gets ordered by every non-faulty processor in the system. Appendix B presents proofs to show that δ/(1 - 5ρ) is the smallest and safe lower bound required on d to compensate for the non-identical running rates of non-faulty clocks. Below, we present the ‘worst-case’ scenario which gives rise to this bound on d. Suppose that non-faulty Pk’s clock is running fast with ρk = -ρ, and non-faulty Pi’s clock is running slow with ρi = ρ. Let Pi form and send m’ at local clock time ti’, and let m’ be transmitted in zero time; let Pk accept m’ at its clock time tk’. Since m’ is assumed to be transmitted in zero time, the clocks of Pi and Pk read ti’ and tk’ respectively at the same real Let faulty Pj behave in the most disruptive manner in sending its m, m.TS m.TS: it forms m, delays its sending for a while, and sends only to Pi - the processor with the slow clock. Let Pi receive the single-signed m from Pj at the latest time for accepting such m: just before its clock time ti’ + 2d. So, Pi accepts m and diffuses double-signed m to Pk. For Pk to accept the 2-signed m from Pi, its clock, despite running fast, should not be reading past tk’ + 3d (see entry B3 of table 1, interchanging the suffixes k and i) when it receives that 2-signed m, even if the transmission from Pi to Pk is to take just less than δ=- the maximum permitted time (see assumption 4). So, d must be large enough to ensure that τ + 3d(1 + ρk) ≥ τ + 2d(1 + ρi) + δ, i.e., 3d(1 - ρ) ≥ 2d(1 + ρ) + δ; that is d ≥ δ/(1 - 5ρ). 5. Protocol Correctness and Relative Performance
Lemma 1: The protocol guarantees the validity condition with ∆== 4d(1 + ρ), provided
d ≥ δ/(1 - 5ρ).
Proof: Consider an execution of the protocol in which Pi and Pk are non-faulty. Let Pi form
and send a message m at real-time τi. Pk receives m before τi + δ (by assumption 4). Lemmas
in the appendix B prove that the timeliness bounds of table 1 are safe when d ≥ δ/(1 - 5ρ). So, Pk will find the received m timely and accept it. Note that when a non-faulty processor accepts a message m, none of its path counters stays below m.TS after some finite time. So, the message m accepted by Pi and Pk, will be taken up for ordering and be ordered because a faulty processor cannot form and send another message that could make Pi and Pk consider m as spurious (due to assumption 3). This shows that the validity condition is met within a finite time after τi. The schedule instructions in the Broadcast process indicate that PCi,min reaches m.TS within 4d clock time after τi. So, Pi orders m no later than τi + 4d(1 + ρi). The schedule instructions in the Diffuse process indicate that PCk,min reaches m.TS within 3d clock time after Pk receives m which is before τi + δ. So, Pk orders m no later than τi + δ=+ 3d(1 + ρk). Thus the message m sent by Pi gets ordered by Pi and Pk, no later than τi + maximum of {4d(1 + ρi), δ + 3d(1 + ρk)}. By assumption 5, |ρi| ≤ ρ and |ρk| ≤ ρ. So, ∆== maximum of {4d(1 + ρ), δ + 3d(1 + ρ)} = 4d(1 + ρ). Let Stablei(T), for some T, TINIT_VAL, denote the set stablei which non-faulty Pi first
constructs during an execution of the protocol when SCi = T. The code for the Order process
indicates that (a1) Pi strips off signatures of every m in Stablei(T) and removes duplicates;
and then, (a2) it removes spurious messages from Stablei(T). Let SigFreei(T) and SpuFreei(T)
denote the resulting Stablei(T) after a1 and a2 are carried out, respectively:
Definition 1: SigFreei(T) = {m | m ∈=Stablei(T)=∧=m.S = ∅}.
Definition 2: SpuFreei(T) = {m | m ∈=SigFreei(T)=∧=(¬∃m’ ∈=SigFreei(T):=m.O = m’.O)}.
Lemma 2: Consider an execution of the protocol in which Pi and Pk are non-faulty. Suppose
that: m ∈=Stablei(T)
equiv(m) ∈=Stablek(T), and m ∈=Stablek(T) equiv(m) ∈=Stablei(T). Then, SpuFreei(T) = SpuFreek(T).
Proof: In reducing Stablei(T) to SigFreei(T), Pi empties the m.S field of every m in Stablei(T)
and then discards duplicates. m and equiv(m) are distinguished only by their signature fields, and setting this field to empty will make them identical. (See the definition of equiv(m) in section 4.1.) If Stablei(T) contains m and equiv(m), the duplicate removal ensures that only one of the identical copies is retained in SigFreei(T). So, by the hypothesis of the lemma, SigFreei(T) = SigFreek(T). Contrary to lemma, assume that SpuFreei(T) ≠ SpuFreek(T); also assume that, without loss of generality, there is an m such that m ∈=SpuFreei(T) and m ∉=SpuFreek(T). By definition 2, SpuFreei(T) ⊆ SigFreei(T). We have established that SigFreei(T) = SigFreek(T). So, m ∈=SigFreek(T). But m ∉=SpuFreek(T); so, by definition 2, there must have been an m’ in SigFreek(T) such that m.O = m’.O. Since SigFreei(T) = SigFreek(T), m’ ∈=SigFreei(T). By definition 2, m cannot be in SpuFreei(T). Lemma 3: The protocol guarantees the unanimity condition, provided d ≥ δ/(1 - 5ρ).
Proof: Consider an execution of the protocol in which Pi and Pk are non-faulty. Say, Pi
accepts some m’. If m’ is signed by Pk, then equiv(m’) is already accepted by Pk. If m’ is not
signed by Pk, then Pi will sign and diffuse m’ to Pk. Lemmas in the appendix B prove that
the timeliness bounds of table 1 are safe when d ≥ δ/(1 - 5ρ). So, Pk will accept the diffused
message due to the safety property of the timeliness bounds. Thus, for every m’ accepted by Pi, Pk accepts equiv(m’). Observe (from figure 2) that a non-faulty processor accepts no m, m.TST, once its stability counter (SC) becomes equal to T, TINIT_VAL. So, for every m’ accepted by Pi, Pk should accept equiv(m’) before SCk becomes equal to m’.TS. Therefore, for any TINIT_VAL, when Pi and Pk form Stablei(T) and Stablek(T) respectively: m ∈=Stablei(T) equiv(m) ∈=Stablek(T). By similar arguments, we can show: m” ∈=Stablek(T) equiv(m”) ∈=Stablei(T). By lemma 2, SpuFreei(T) = SpuFreek(T) = SpuFree(T) (say) for every TINIT_VAL. By definition 2, SpuFree(T) will contain at most one message originating from a given processor. Since processor ordering is unique and known, Pi and Pk will order the entries of SpuFree(T) identically. Further, entries of SpuFree(T) are ordered only after the entries of all SpuFree(T’), T’ < T, have been ordered. So, Pi orders m’ before m if and only if Pk orders m’ before m. Thus the protocol satisfies the unanimity condition. Theorem 1: The protocol guarantees unanimity and validity conditions with ∆== 4d(1 + ρ),
provided d ≥ δ/(1 - 5ρ).
Proof: Follows from lemmas 1 and 3. Ordering Delays with Synchronised Clocks
Let non-faulty processors’ clocks be synchronised within e. According to the classical time-based ordering protocol of [Cristian85], when a non-faulty Pi forms and broadcasts a message m, it sets m.TS to the current reading of its synchronised clock; any non-faulty processor can stabilise an accepted m only at its clock time m.TS + 2(d + e). Therefore, a message m whose broadcast has been initiated at real time τ is ordered by its non-faulty originator at τ + 2(d + e)(1 + ρm.O) and by any non-faulty Pj, jm.O, at some real time within the interval [τ + 2(d + e)(1 + ρj) - e(1 + ρj), τ + 2(d + e)(1 + ρj) + e(1 + ρj)], depending on the actual difference between the readings of Pj’s and m.O’s clocks (which is bounded by e) at the time m is ordered by Pj. So, the upper bound on time-based ordering delays is (2d + 3e)(1 + ρ), or 2d if e and ρ are negligibly small compared to d; if only ρ is negligibly small compared to d, then the upper bound becomes 2(d + e) + e. We will now establish the favourable conditions listed in section 1. Let λ and da be as defined there, and all processors be non-faulty. Suppose also that an input µ is first received by a processor within the TMR system at real-time τ, and that ρ is negligibly small. By τ + λ, every processor forms and broadcasts a message m with m.µ = µ. By τ + λ + 2da, every processor, say Pi, accepts two double-signed messages m’ and m” such that path(m’) ≠ path(m”) and m’.µ = m”.µ = µ; the one with a lower timestamp, say m, becomes stable at Pi by τ + λ + 2da + 2d: if Pi accepts m’ at τ’, τ’ ≤ τ + λ + 2da, say, with path(m’) = Pj:Pk, then by the entry D3 of Table 1, it will not accept a double-signed m with m.TS ≤ m’.TS and path(m) = Pj:Pk, after τ' + 2d ≤ τ + λ + 2da+ 2d; further, since Pi accepts m” with path(m”) = Pk:Pj by time τ + λ + 2da, it will not accept any double-signed m with path(m) = Pk:Pj and m.TS ≤ m”.TS > m’.TS, after τ + λ + 2da + 2d. Thus, every processor orders input µ by τ + λ + 2da + 2d; with the time based protocol, no correct processor can order µ before=τ + 2(d + e). Thus, our protocol is guaranteed to be faster when λ=< 2(e- da). 5.1. Optimal Upper Bound
We next show that the ordering bound of our protocol is the smallest achievable when (i) clocks are not synchronised, (ii) ordering is symmetric, and (iii) a processor cannot deduce temporal order between concurrent messages it receives. Each of these premises is defined Unsynchronised Clocks: Non-faulty processors’ clocks are not synchronised, where clock is a device which a processor uses for observing time. To state formally, let ci(τ) denote the reading of Pi’s clock at real time τ. Clocks of non-faulty Pi and Pj are said to be unsynchronised during an interval ι if |ci(τ) - cj(τ)| is arbitrary for every τ=in ι. Symmetric Ordering: Our protocol, like [Cristian85], is symmetric in the sense that the correct processors run the same program (except for process identities and signatures). In contrast, in an asymmetric protocol (e.g. [Echtle87]), correct processors can execute different code and hence play different roles: one processor, termed the sequencer, decides and disseminates the message ordering for other processors to accept what it has decided. In failure-free executions, asymmetric ordering will be the fastest if the non-sequencer processors can confirm the sequencer’s correct behaviour without delaying message ordering; such delay-free confirmation is possible only when the sequencer is guaranteed to fail in a benign manner (e.g. by crash). However, when (authenticated) Byzantine failures are permitted, the non-sequencer processors must exchange messages to confirm that the sequencer had behaved correctly so far; that is, message diffusion must precede message ordering. Further, when the sequencer fails, message ordering is delayed until the failure is detected and a new sequencer is elected. Though not proved here, it appears that asymmetric ordering cannot offer a better worst-case ordering delay. Non-deductibility of Temporal Order: Based on the definition of ‘happened before’ in [Lampo78], we define two messages to be concurrent iff neither one can be said to have happened before the other. Temporal order on messages is an order that is based on the Newtonian time instants at which messages were generated: for any two messages in the system, one message is before another in the temporal order if the first one was generated earlier in a Newtonian time-frame (see [Veris00] for a formal definition). We assume that no processor can deduce temporal order between concurrent messages it receives For simplicity we will assume ρ = 0 and consider a system in which communication delays between two non-faulty processors can be at most δm. Let us define δm- such that δm- < δm and (δm - δm-) is infinitely small. Theorem 2: Any symmetric ordering protocol that works only with unsynchronised clocks,
will have executions in which the ordering delay can be 3δm + δm-.
Proof: By contradiction. Assume that there is such a protocol which guarantees that ordering delays are always smaller than 3δm + δm-. Consider two distinct executions of this protocol during real-time intervals ι1 and ι2 respectively. By hypothesis, non-faulty processors’ clocks remain unsynchronised throughout each interval In the first execution (see figure 7(a)), Pi fails only by not sending its messages to Pj and not receiving Pj’s messages. Pi sends mi at its clock time ti. Let mi take zero time to reach Pk. Suppose that Pk’s clock reads tk when Pk receives mi. (Since mi takes zero time, when Pk’s clock reads tk, Pi’s clock reads ti.) Let Pk accept and diffuse mi to Pj and the diffused message take δm time to be received by Pj. Just before Pj receives the diffused mi, i.e. when Pk’s clock reads tk + δm-, suppose that Pj‘s clock reads tj and that Pj forms and sends mj which takes δm time to be received by Pk. So, Pk’s clock reads (tk + δm- + δm) when Pk receives mj. Note that Pj sends mj before it receives the diffused mi from Pk, and therefore neither mi nor mj happened before [Lampo78] the other. Since Pk cannot deduce that mi 3 According to [Lampo78], two messages need not be produced at the same newtonian time, for them to be deemed concurrent. 4 By this, we exclude a class of protocols which permit clock synchronisation messages to be piggybacked onto the order protocol messages and thus achieve clock synchronisation during the order protocol execution. originated before mj in real-time, we will assume (without loss of generality) that mj is ordered before mi by all non-faulty processors. Figure 7. Execution Scenarios. (a) First execution. (b) Second execution. In the second execution (see figure 7(b)), Pj fails only by not sending its messages to Pi and not receiving Pi’s messages. Pi sends mi at its clock time ti. mi takes δm time to reach Pk and is not received by Pj. Suppose that Pk’s clock reads tk when Pk receives mi, i.e., when Pi’s clock reads ti + δm. (Note: this is possible with unsynchronised clocks whose readings can differ by an arbitrary amount; to prove the impossibility we have chosen the difference to be a convenient amount.) Assume that the mi diffused by Pk takes δm time to be received by Pj. When Pk’s clock reads tk + δm-, suppose that Pj‘s clock reads tj and that Pj forms and sends mj only to Pk which takes δm time to be received. That is, Pk’s clock reads (tk + δm- + δm) when Pk receives mj. For Pk, this execution is indistinguishable from the first must order mj before mi. Since ordering delays are always smaller than 3δm + δm-, non-faulty Pi must order its own mi before ti + 3δm + δm-. Say Pk’s diffused mj takes δm time to be received by Pi. That is, Pi can receive mj (for the first time), only when its clock reads ti + 3δm + δm-. So, Pi can accept mj only at or after ti + 3δm + δm-. Hence Pi cannot order mj before ti + 3δm + δm-, and therefore before mi. This violates the unanimity condition. By theorem 2, the upper bound ∆ on ordering delays must be at least (3δm + δm-), i.e. ∆ ≥ (3δm + δm-). Since δm is not known directly, but only its upper bound δ (see assumption 4), ∆ ≥ 4δ. Theorem 1 establishes ∆ of our protocol to be 4d(1 + ρ) with d recommended to be d ≥ δ/(1 - 5ρ). When d is chosen to be δ/(1 - 5ρ) and when ρ = 0, we have ∆ = 4δ. Thus, if we ignore the (small) increments on d made to compensate for non-zero ρ, our protocol is The compensation needed to account for non-zero ρ varies from protocol to protocol and is influenced very much by the design and the structure of a given protocol. We do not therefore claim that the compensation needed for our protocol is the smallest required by any 5 This is true even if the sender of a message m timestamps m with the local send time. protocol. Instead, we show in Appendix B that the compensation factor of (1/(1 - 5ρ)) is the smallest needed for our protocol. 6. Concluding Remarks
Replicated processing with majority voting is a well-known technique for masking the effects of Byzantine faults in processors. Message ordering is a central requirement for implementing this concept in a multi-client environment. We have presented a timeout-based message ordering protocol for TMR systems. It is a two-round protocol, as it has to be [Dolev82], since there can be at most one faulty processor in the TMR system. We have also presented in detail the design efforts in keeping the timeliness bounds small; consequently the protocol provides an upper bound on ordering delays, which is the smallest achievable when non-faulty processor’s clocks are not synchronised within a known and finite bound. We conclude the paper by exposing two drawbacks in generalising the protocol to a system of N processors, where at most π out of N processors can fail. We restrict the context of generalisation to the one that preserves the minimality of the message cost of the ordering protocol. (This restriction rules out generalisations into protocols that guarantee early ordering or smaller latencies at the expense of increased message cost.) A generalised protocol (with minimum message cost) must however allow (π=+ 1) rounds of messages exchange. This means that the number of path counters which a processor has to maintain increases exponentially with π and N: one counter for every possible path through which a message with at most (π + 1) signatures can be received. Secondly, the generalisation must first assume that a processor must allow a timeliness bound of (2d)i, 1 ≤ i ≤ π + 1, for any i-signed message it receives. This follows from the arguments presented (in Section 4.3) for the TMR protocol before the bounds were reduced. Thus, when a non-faulty P first receives a timely message m from another processor Q, it has to wait for 2d(π + 1) time before m can be ordered. This means that the ordering delay bound is ∆ = 2d(π + 1) + δ. Reducing the bound to 2d(π + 1) as we did for the TMR protocol in Section 4.3.2, is possible only in the following circumstance: N and π must be such that for any m and any (π + 1)-signed m’ that P may accept, m’.TS m.TS, path(m) must intersect with path(m’). To guarantee that a particular m and all possible (π + 1)-signed m’ have intersecting paths, we must have (π + 1) ≥ N – 1. Note that the ordering problem is vacuous for π > N – 2, and hence the realistic upper bound on π is N - 2. So, the ordering bound can be reduced to 2d(π + 1), only when π = N – 2. With π > 1, the reduction is not possible for an NMR system where N has to be at least 2π + 1. In other words, π = 1 is the only case for which ordering delays can be reduced by δ and replicated processing with majority voting is feasible. That is, the design of the protocol presented exploits features that are so characteristic of the TMR system in providing the smallest upper bound on message ordering delays.
This work has been supported in part by grants from CNPq/Brazil. References
F. V. Brasileiro, P. D. Ezhilchelvan and N. A. Speirs, “TMR processing without explicit clock synchronisation,” Proceedings of the 14th Symposium on Reliable Distributed Systems, Bad Neuenahr, Germany, pp. 186-195, September 1995. [Chandra96] T. D. Chandra and S. Toueg, "Unreliable Failure Detectors for Reliable Distributed Systems", Journal of the ACM, 43(2), pp. 225 - 267, March 1996. [Cristian85] F. Cristian, H. Aghili, R. Strong and D. Dolev, “Atomic Broadcast: from Simple Message Diffusion to Byzantine Agreement,” Digest of Papers, FTCS-15, Ann Arbor, USA, pp. 200-206, June 1985. [Cristian96] F. Cristian, B. Dancey, and J Dehn, "Fault-Tolerance in Air Traffic Control Systems", ACM Transactions on Computer Systems, 14(3), pp. 265-286, August 1996. [Cristian99] F. Cristian and C. Fetzer, "The Timed Asynchronous Distributed System Model", IEEE Transactions on Parallel and Distributed Systems, 10(6), June 1999. D. Dolev and H. R. Strong, “Requirements for Agreement in a Distributed System,” Proceedings of the 2nd Symposium on Distributed Databases, pp. 115-129, September 1982. Dolev, J. Halpern, and H. R. Strong, “On the Possibility and Impossibility of Achieving Clock Synchronisation,” Proceedings of the 16th Annual ACM STOC, pp. 504-511, April 1984. K. Echtle, “Fault Masking and Sequence Agreement by a Voting Protocol with low Message Number,” Proceedings of the 6th Symposium on Reliability in Distributed Software and Database Systems, pp. 149-160, March 1987. C. Fetzer and F. Cristian, "Fail Awareness: An Approach to Construct Fail-safe Applications", Digest of Papers, FTCS-27, Seattle, June 1997. [Kopetz89] H. Kopetz, A. Damm, C. Koza, M. Mulazzani, W. Schwabl, C. Senft, and R. Zainlinger, "Distributed Fault-Tolerant Real-Time Systems: The Mars Approach", IEEE Micro, pp. 25-41, 1989. [Lampo78] L. Lamport, “Time, Clocks, and Ordering of Events in a Distributed System,” Communications of the ACM, 21(7), pp. 558-565, July 1978. L. Lamport, “Using Time Instead of Timeout for Fault-Tolerant Distributed Systems,” ACM Transactions on Programming Languages and Systems, 6(2), pp. 254-280, April 1984. M. Pease, R. Shostak, and L. Lamport, “Reaching Agreement in the Presence of Faults,” Journal of the ACM, 27(2), pp. 228-234, April 1980. [Powell88] D. Powell, P. Verissimo, G. Bonn, F. Waeselynck, and D. Seaton, “The Delta-4 Approach to Dependability in Open Distributed Computing Systems,” Digest of Papers, FTCS-18, Tokyo, pp. 246-251, June 1988. R. L. Rivest, A. Shamir, and L. Adleman, “A Method for Obtaining Digital Signatures and Public Key Cryptosystems,” Communications of the ACM, 31(2), pp. 120-126, February 1978. F. Schmuck and F. Cristian, “Continuous Clock Amortization Need Not Affect the Precision of a Clock Synchronisation Algorithm,” Proceedings of the 9th ACM Symposium on Principles of Distributed Computing, pp. 133-141, August 1990. F. B. Schneider, “Implementing Fault Tolerant Services Using the State Machine Approach: a Tutorial,” ACM Computing Surveys, 22(4), pp. 299-319, December 1990. S. K. Shrivastava, P. D. Ezhilchelvan, N. A. Speirs, S. Tao, and A. Tully, “Principle Features of the Voltan Family of Reliable System Architectures for Distributed Systems,” IEEE Transactions on Computers, 41(5), pp. 542-549, May 1992. N.A. Speirs, S. Tao, F.V. Brasileiro, P.D. Ezhilchelvan, and S.K Shrivastava, “The Design and Implementation of Voltan Fault-Tolerant Systems for Distributed Systems, Transputer Communications, 1(2), pp. 1-17, November 1993. G Tsudik, “Message Authentication With One-Way Hash Functions”, ACM Computer Communications Review, 22(5), 1992. N. Vasanthavada and P. N. Marinos, “Synchronisation of Fault-Tolerant Clocks in the Presence of Malicious Failures,” IEEE Transactions on Computers, Vol. C-37(4), pp. 440-448, April 1988. P. Verissimo, L. Rodrigues, and J. Rufino, “The Atomic Multicast Protocol (AMp),” In Delta-4: A Generic Architecture for Dependable Distributed Computing, D. Powell (Ed.) ESPRIT Research Papers, Springer-Verlag, pp. 267-294, 1991. P. Verissimo, “Causal Delivery Protocols in Real-Time Systems: a Generic Model,” Journal of Real Time Systems, 10(1), pp. 45-73, 1996. P. Verissimo, L. Rodrigues, and A. Casimoro, “Cesium Spray: A Precise and Accurate Global Clock Service of Large Scale Systems,” Journal of Real Time Systems, 11(3), 1997. P. Verissimo and M. Raynal, “Time in Distributed System Models and Algorithms,” in ‘Advances in Distributed Systems’, S. Krakowiak and S. K. Shrivastava (Eds.) LNCS 1752, Springer-Verlag, pp. 1-32, 2000. Appendix
A. Relationship Between Timeliness Checks of Messages
Lemma A1: Let t1 and t2 be the local clock times when a non-faulty Pi receives a single-
signed m1 and a double-signed m2, respectively; also, let m1.TSm2.TS and t1 > t2. If Pi
finds m1 timely then it must also find m2 timely.
Figure A1. m1.TSm2.TSm’.TS. Proof: We will measure time according to Pi’s local clock and prove the lemma by
contradiction. Suppose that Pi finds m1 timely and m2 late. Let m2 be late by y, y > 0, time
units, i.e., if Pi had received m2 at any time before (t2 - y) then it would have found m2
timely. (See figure A1.) That m2 received at t2 was found late implies that there exists a
message m’, m’.TS m2.TS, which was accepted by Pi at (t2 - y - tb2), where tb2 is the
timeliness bound indicated by the entry of table 1 whose row corresponds to the path(m’) and column to the path(m2). Let tb1 be the timeliness bound indicated by the entry of table 1 whose row corresponds to the path(m’) and column to the path(m1). Since m1.TSm2.TSm’.TS, m1 must be received by Pi before (t2 - y - tb2 + tb1) for it to be considered timely. For any given path(m’), i.e. in any given row of table 1, the timeliness bound for a single-signed message is smaller than that for a double-signed-message. That is, tb1 < tb2. This means, (t2 - y - tb2 + tb1) < t2; by given, t2 < t1. So, t1 > (t2 - y - tb2 + tb1). This means that m1 is received by Pi after (t2 - y - tb2 + tb1) and cannot be found timely by Pi. B. Safe Timeliness Bounds
We next show that the timeliness bounds satisfy the safety requirement of section 4.2 when d ≥ δ/(1 - 5ρ). This is done in two stages: lemma B1 proves that a non-faulty processor always finds another non-faulty processor’s 1-signed message timely; and lemma B3 shows that when a non-faulty processor diffuses a 2-signed message to another non-faulty processor, the latter finds the diffused message timely. As in the main paper, we adopt the style of writing real time values in Greek and clock time values in italicised lower case Roman letters. The term ‘clock’ always refers to a processor’s physical clock. We assume the following notations: STARTi(p,≥T) denotes the smallest real time instance when PCi[p] for path p becomes larger than or equal to T, TINIT_VAL. That is, just before
real time STARTi(p,≥T), PCi[p] is less than T. ENDi(≤T) denotes the largest real time
instance when MCi is less than or equal to T. That is, just after real time ENDi(≤T), MCi is
larger than T and Pi will not form and send any m, m.TT. We also retain the notation ρi,
|ρi| ≤ ρ, to denote the rate with which the clock of a processor Pi drifts from real time.
Lemma B0: Say a non-faulty Pi accepts m at real time τi. If m is not signed by a non-faulty
Pk, Pk then receives m or equiv(m) at real time τk such that τk < τi + δ.
Proof: If m is Pi’s own message, Pi is required to send m to both Pj and Pk. If m is signed by
Pj, Pi is required to diffuse double-signed equiv(m) to Pk. By assumption 4, Pk then receives
m or equiv(m) before τi + δ.
Lemma B1: When a non-faulty Pk forms and sends m, another non-faulty Pi finds m timely,
provided d ≥ δ/(1 - ρ).
Proof: By lemma B0, Pi receives m or equiv(m) within δ real-time after Pk sends m. Say,
m.TS = T. If Pi has not accepted any message with timestamp larger than or equal to T until it
receives m, then MCi ≤ T when it receives m and Pi finds m timely. Let us suppose that MCi
is already larger than T when Pi receives m. Since Pi has accepted one or more messages with
timestamp larger than or equal to T, there must exist a message, say m’, m’.TST, whose
acceptance causes Pi to set PCi[Pk] ≥ T at real time STARTi(Pk,T). If we show that
STARTi(Pk,T) - ENDk(T) ≥ δ for every possible path(m’), then the lemma is proved. We
present the proof by categorising the values of path(m’) in two cases. For each case, we construct the proof in the following manner. We define τi to be the real time when Pi accepts m’ and α = STARTi(Pk,T) - τi. (Figure B1 shows these values along the real-time axis for Pi.) We derive two inequalities – inequality (1) involving τi and ENDk(≤T) and inequality (2) involving τi and STARTi(Pk,T). We then relate these two inequalities through the common element τi, to show what is required. Figure B1. Pi accepting m’ at τi sets PCi(Pk) ≥=T at STARTi(Pk,T). Case I. path(m’) = Pk or Pk:Pj or Pj:Pk. Just before sending m’, m’.TST, Pk sets MCk to m’.TS + 1. Therefore, ENDk(T) ≤ τi
For all the considered values of path(m’), α = d(1 + ρi). (See entries B1, E1 and D1 of table STARTi(Pk,T) = τi + d(1 + ρi)
STARTi(Pk,T) ≥ ENDk(≤T) + d(1 + ρi). Since d ≥ δ/(1 - ρ) implies d(1 + ρi) ≥ δ, we have STARTi(Pk,T) - ENDk(T) ≥ δ.
Case II. path(m’) = Pj or Pi. Let Pk receive m’ (when path(m’) = Pi) or equiv(m’) (when path(m’) = Pj) from Pi, at real time τk. By lemma B0, τk < τi + δ. Just before τk, we have either MCk ≤=T or MCk > T. In the first case ENDk(≤T) = τk, and in the latter case ENDk(≤T) < τk. So, ENDk(T) - δ < τi
From entries A1 and C1 of table 1, α = 2d(1 + ρi). So, STARTi(Pk,T) = τi + 2d(1 + ρi)
STARTi(Pk,T) > ENDk(≤T) - δ=+=2d(1 + ρi).= When d ≥ δ/(1 - ρ), 2d(1 + ρi) ≥ 2δ, thus STARTi(Pk,T) - ENDk(T) ≥ δ.
Lemma B2: STARTi(Pj:Pk,T) ≥ STARTi(Pj,T) for a non-faulty Pi and TINIT_VAL.
Proof: Let m’, m’.TST be the message whose acceptance at time τ causes Pi to set
PCi[Pj:Pk] ≥ T at STARTi(Pj:Pk,T). From table 1, τ ≥ STARTi(Pj:Pk,T) – 2d(1 + ρi); also,
from the protocol in figure 2, Pi will schedule an update of PCi[Pj] to at least m’.TS at latest
by τ +2d(1 + ρi), thus STARTi(Pk,T) ≤ τ + 2d(1 + ρi), and since d > 0,
STARTi(Pj:Pk,T) ≥ STARTi(Pk,T).
Lemma B3: When a non-faulty Pi diffuses a double-signed m to non-faulty Pk, Pk finds m
timely, provided d ≥ δ/(1 - 5ρ).
Proof: As in lemma B1, we will let m.TS = T and suppose that MCk is already larger than T
when Pk receives m. Since Pk has accepted message(s) with timestamp larger than or equal to
T, there must exist a message, say m’, m’.TS ≥=T, whose acceptance causes Pk to set
PCk[Pj:Pi] ≥=T at real time STARTk(Pj:Pi,T). Let τk be the real time when Pk accepts m’
and α = STARTk(Pj:Pi,T) - τk. Figure B2 shows these values along the real-time axis for Pk.
STARTk(Pj:Pi,≥T) k[Pj:Pi] < T) PC Figure B2. Pk accepting m’ at τk sets PCk(Pj:Pi) ≥=T at STARTk(Pj:Pi,T). Note that Pi must have diffused m to Pk before STARTi(Pj,T), and that Pk will find m timely if it receives m before STARTk(Pj:Pi,T). We need only to show that
STARTk(Pj:Pi,T) - STARTi(Pj,T) ≥ δ for every possible path(m’). We show this by
considering three cases regarding the originator of m’, m’.O. In all cases, there is a real time τi when Pi either sends m’ or receives equiv(m’); in the same way we did to prove lemma B1, for each case we construct two inequalities: inequality (3) involving τi and STARTk(Pj:Pi,T), and inequality 4 involving τi and STARTi(Pj,T). These equalities are then related through τi to show what is required. We first present the timeliness bounds that a non-faulty Pk uses to check the timeliness of a message m it receives. They are presented in the same way that we did before for a non-faulty Pi, and are obtained simply by interchanging the suffixes k and i in table 1. Table B1 shows the timeliness bounds used by Pk. Pi:Pj (E) d Table B1: Pk’s timeliness bounds for a past m given that MCk first exceeded m.TS due to accepting m. Case I. m’.O = Pk. From entry A3 of table B1, α = 4d(1 + ρk) and Pk must have formed and sent m’ at real time STARTk(Pj:Pi,T) - 4d(1 + ρk). Let Pi receive m’ at τi, by lemma B0, τi < STARTk(Pj:Pi,T) - 4d(1 + ρk) + δ=
From lemma B1, Pi accepts m’; so from entry B2 of table 1, we have STARTi(Pj,T) ≤ τi + 2d(1 + ρi)
STARTi(Pj,T) < STARTk(Pj:Pi,T) - 4d(1 + ρk) + δ + 2d(1 + ρi), and STARTk(Pj:Pi,T) - STARTi(Pj,T) > 2d(1 + 2ρk - ρi) - δ STARTk(Pj:Pi,T) - STARTi(Pj,T) > δ,
whenever d ≥ δ/(1 + 2ρk - ρi), which is always true since d ≥ δ/(1 - 5ρ) > δ/(1 + 2ρk - ρi). Case II. m’.O = Pi. Note that the m’ that Pk accepts can be single or double-signed. Let Pi form and send m’ (if m’ is single-signed) or the single-signed equiv(m’) (if m’ is double-signed) at real time τi. Whether m’ is single or double-signed, the entries B3 and E3 of table B1 indicate that α = 3d(1 + ρk). Since τi ≤ τk, τi STARTk(Pj:Pi,T) - 3d(1 + ρk)
STARTi(Pj,T) ≤ τi + 2d(1 + ρi)
STARTi(Pj,T) ≤ STARTk(Pj:Pi,T) - 3d(1 + ρk) + 2d(1 + ρi), and STARTk(Pj:Pi,T) - STARTi(Pj,T) ≥ d(1 + 3ρk - 2ρi) STARTk(Pj:Pi,T) - STARTi(Pj,T) ≥ δδ,
whenever d ≥ δ/(1 + 3ρk - 2ρi), which is true since d ≥=δ/(1 - 5ρ) ≥=δ/(1 + 3ρk - 2ρi). Case III: m’.O = Pj. Say path(m’) = Pj. From entry C3 of table B1, α = 3d(1 + ρk). Let Pi receive the diffused equiv(m’) from Pk at real time τi; from lemma B0, τi < STARTk(Pj:Pi,T) - 3d(1 + ρk) + δ= =
If Pi accepts equiv(m’) at τi, then from entry D2 of table 1, τi ≥ STARTi(Pj,T) - d(1 + ρi); otherwise, τi > STARTi(Pj:Pk,T) and from lemma B2, τi > STARTi(Pj,T). Since d > 0, in STARTi(Pj,T) ≤ τi + d(1 + ρi)
STARTi(Pj,T) < STARTk(Pj:Pi,T) - 3d(1 + ρk) + δ=+ d(1 + ρi), and STARTk(Pj:Pi,T) - STARTi(Pj,T) > 2d(1 + 3ρk/2 - ρi/2) - δ, STARTk(Pj:Pi,T) - STARTi(Pj,T) > δδ,
whenever d ≥ δ/(1 + 3ρk/2 - ρi/2), which is true since d ≥=δ/(1 - 5ρ) > δ/(1 + 3ρk/2 - ρi/2). Say path(m’) = Pj:Pi. α = 2d(1 + ρk) from entry D3 of table B1. Let Pi diffuse m’ to Pk at τi STARTk(Pj:Pi,T) - 2d(1 + ρk)
STARTi(Pj,T) ≤ τi + d(1 + ρi)
STARTi(Pj,T) ≤ STARTk(Pj:Pi,T) - 2d(1 + ρk) + d(1 + ρi), and STARTk(Pj:Pi,T) - STARTi(Pj,T) ≥ d(1 + 2ρk - ρi), STARTk(Pj:Pi,T) - STARTi(Pj,T) ≥ δδ,
whenever d ≥ δ/(1 + 2ρk - ρi), which is true since d ≥=δ/(1 - 5ρ) > δ/(1 + 2ρk - ρi).


Microsoft word - 02144802.doc

GRANT PROGRESS REPORT SUMMARY 01602: Longitudinal Study Investigating the Progression and Pathogenesis of Atypical Hyperadrenocorticism in Scottish Terriers Principal Investigator: Research Institution: Virginia-Maryland Regional College of Veterinary Medicine Grant Amount: Start Date: 1/1/2012 End Date: 12/31/2013 Progress Report: Mid-Year 2 Report Due: Receiv

How To Treat Digital DermatitisDigital Dermatitis (DD), also known as Hairy Foot Warts, Strawberry Foot Rot, Mortellaro’s Disease and Rasperry Heel, has become the most prevalent infectious hoof disorder on many Canadian dairy farms. Dr. Paul Greenough, in his new book ‘Bovine Laminitis and Lameness’ (see sidebar on next page) recom-mends the following treatment for animals infected wit

Copyright © 2018 Medical Abstracts