# Radix-2 Division Algorithms with an Over-Redundant Digit Set 

Jo Ebergen, Member, IEEE and Navaneeth Jamadagni, Student Member, IEEE


#### Abstract

This paper presents a derivation of four radix-2 division algorithms by digit recurrence. Each division algorithm selects a quotient digit from the over-redundant digit set $\{-2,-1,0,1,2\}$, and the selection of each quotient digit depends only on the two most-significant digits of the partial remainder in a redundant representation. Two algorithms use a two's complement representation for the partial remainder and carry-save additions, and the other two algorithms use a binary signed-digit representation for the partial remainder and carry-free additions. Three algorithms are novel. The fourth algorithm has been presented before. Results from the synthesized netlists show that two of our fastest algorithms achieve an improvement of 10 percent in latency per iteration over a standard radix-2 SRT algorithm at the cost of 36 percent more power and 50 percent more area.


Index Terms-Signed-digit arithmetic, two's complement arithmetic, carry-save division, carry-free division

## 1 Introduction

DIVISION is one of the most complex and the slowest arithmetic operations performed in microprocessors. Although division occurs less frequently than other arithmetic operations, having an efficient divider is necessary for a good system performance [1]. There are several division algorithms available to implement in hardware. The digitrecurrence SRT division algorithm is the most frequently implemented algorithm in general purpose processors. A standard radix-2 SRT algorithm retires a quotient digit from the set $\{-1,0,1\}$. Typically, the selection of a quotient digit relies on the four most significant bits of the partial remainder in a redundant representation. The logic that selects a quotient digit is called the quotient selection logic and it usually appears in the critical path of a divider design. Therefore, simplifying the quotient selection logic potentially leads to a faster divider design which is the main motivation for this work.

### 1.1 Division Preliminaries

A division algorithm must compute an approximation to $Q=R / D$, where $Q$ is the quotient, $D$ is the divisor and $R$ is the dividend and the initial partial-remainder. We assume that

$$
\begin{equation*}
R, D \in[1,2) \tag{1}
\end{equation*}
$$

For binary representations of $R$ and $D$, performing the appropriate shift operations before the start of a division algorithm can satisfy these assumptions.

- J. Ebergen is with Oracle Laboratories, Redwood Shores, CA 94404. E-mail: jo.ebergen@oracle.com.
- N. Jamadagni is with Oracle Laboratories, Redwood Shores, CA 94404 and Portland State University, Portland, OR 97201.
E-mail: navaja@cecs.pdx.edu.
Manuscript received 17 Nov. 2013; revised 24 Sept. 2014; accepted 1 Oct. 2014. Date of publication 13 Nov. 2014; date of current version 12 Aug. 2015. Recommended for acceptance by P. R. Schaumont.
For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TC.2014.2366738

In general, digit-recurrence division algorithms can be described by a recurrence relation

$$
\begin{equation*}
r_{n+1}=2 * r_{n}-q_{n} * D \tag{2}
\end{equation*}
$$

where, $n$ represents the iteration index and $r_{n}$ is the remainder after the $n$th iteration with initially $r_{0}=R / 2$, and $q_{n}$ is the $n$th quotient digit selected from the set $\{-1,0,1\}$. In each iteration, the algorithm doubles the remainder, then selects a quotient digit $q_{n}$, and subtracts $q_{n} * D$ from $r_{n}$. Alternatively, if we start with a different initialization $r_{0}=R$, then we can use the recurrence relation

$$
\begin{equation*}
r_{n+1}=2 *\left(r_{n}-q_{n} * D\right) \tag{3}
\end{equation*}
$$

Here, each repetition step starts with selecting a quotient digit $q_{n}$, then subtracting $q_{n} * D$, and finally doubling the result. The rest of this paper assumes the latter recurrence relation (3) for describing all the algorithms.

Additionally, we require that the error interval of the computed quotient be less than one unit of least precision ( $u l p$ ), where $u l p=2^{-L}$ for some $L>0$. In other words, if $q$ is the computed quotient and the error, $\epsilon$, is given by $\epsilon=q-R / D$, then we require that $\epsilon \in(-u l p / 2, u l p / 2)$. Alternatively, the error interval may include one of the bounds, but not both bounds. For IEEE single-precision format, $L=23$, and for IEEE double-precision format, $L=52$.

## 2 Related Work

The idea of using an over-redundant quotient-digit set, $\{-2,-1,0,1,2\}$, to simplify the quotient selection logic was first presented for a radix-4 SRT algorithm in [2]. In [3], the authors develop a radix-2 SRT algorithm using a signed-digit representation for the partial remainder and an over-redundant quotient-digit set. The quotient selection logic in [3] inspects only the two most-significant digits of the partial remainder in a signed-digit representation to determine an appropriate quotient digit. In [4], Burgess presents a radix-2 Svoboda division algorithm that also inspects only the two most-significant digits of
the partial remainder in a signed-digit representation to retire a quotient digit. This algorithm, however, requires pre-scaling of both divisor and dividend for divisors in the range $[1.5,2)$.

We derive four radix-2 division algorithms that use an over-redundant quotient-digit set. We denote the four algorithms as A1, A2, B1 and B2. The three algorithms, A1, A2 and B2, are novel and the fourth algorithm, B1, is the same as the one presented in [3]. All four algorithms inspect only the two most-significant bits of the partial remainder to select a quotient digit. Two algorithms use a two's complement representation for the partial-remainder and carrysave additions. The other two algorithms use a signed-digit representation for the partial-remainder and carry-free additions. The major difference between our algorithms and the SRT algorithms in [3], [5], [6] and [7] is the range invariant for the partial remainder. We also use an alternative method to analyze the algorithms presented in this paper and in [3], [5], [6] and [7] by using invariants and highlighting the differences between the algorithms. The analysis method used in this paper is similar to the one described in [8]. This paper extends the algorithm of [8] in several ways: we present three new algorithms and we compare synchronous implementations for all the algorithms discussed in this paper in terms of latency, power, and area. In [9], the authors focus on an asynchronous implementation of the algorithm in [8].

### 2.1 An Alternative Analysis Method

A conventional method for analyzing a digit-recurrence division algorithm uses a Robertson diagram or a P-D diagram [5], [6]. Both these diagrams show a non-redundant value for the partial remainder, $r$, even when $r$ is in a redundant form. In an SRT algorithm, $r$, can be represented in a two's complement carry-save form such that $r=r_{s}+r_{c}$ or in a signed-digit carry-free form such that $r=r_{c}-r_{s}$, where $r_{s}$ is called the sum or the parity bits and $r_{c}$ is called the carry or the majority bits. While the properties of division algorithms can be analyzed at the digit-level [10], when we seek a hardware implementation, the actual encoding of the bits determine the complexity of the circuit. Therefore, we believe that a diagram that clearly shows the redundant representation of the partial remainder will help us design an efficient quotient selection logic. In this paper we present one such diagram and use this diagram to derive and analyze our division algorithms. For clarity, we also show the P-D diagrams for our algorithms.

### 2.2 An Alternative Range Invariant for the Partial Remainder

The range invariant for the partial remainder depends on the choice of a recurrence relationship and a division algorithm. The SRT algorithms that use the recurrence relation in (2) and a two's complement representation for the partial remainder, have a range invariant of

$$
\begin{equation*}
r=r_{s}+r_{c} \in[-D, D) \tag{4}
\end{equation*}
$$

The SRT algorithms that use the recurrence relation in (3) and a two's complement representation for the partial remainder, have a range invariant of

$$
\begin{equation*}
r=r_{s}+r_{c} \in[-2 D, 2 D) \tag{5}
\end{equation*}
$$

When we use a signed-digit representation for the partial remainder, the range invariant for the partial remainder excludes the lower bound, that is, $r \in(-D, D)$ for recurrence relation in (2) and $r \in(-2 D, 2 D)$ for recurrence relation in (3). Because we use the recurrence relation in (3) for the algorithms presented in this paper, we consider invariant (5) for SRT algorithms.

Our algorithms, A1, A2, B1 and B2, have a different range invariant. The algorithms that use a two's complement carry-save representation for the partial remainder, $r=r_{s}+r_{c}$, have a range invariant of

$$
\begin{equation*}
r_{s}, r_{c} \in[-2,2) \quad \text { and } \quad r=r_{s}+r_{c} \in[-4,4) \tag{6}
\end{equation*}
$$

The algorithms that use a signed-digit carry-free representation for the partial remainder, $r=r_{c}-r_{s}$ have a range invariant of

$$
\begin{equation*}
r_{s}, r_{c} \in[0,4) \quad \text { and } \quad r=r_{c}-r_{s} \in(-4,4) . \tag{7}
\end{equation*}
$$

In Section 4, we prove the correctness of our algorithms that use range invariants (6) and (7). In Section 7, we show that algorithms, A1 and B1, also maintain the range invariant (5) for the partial remainder in addition to maintaining invariant (6) or (7).

## 3 Radix-2 SRT Algorithm

A standard radix-2 SRT algorithm uses a two's complement representation for the partial remainder, $r$, and carrysave additions (subtractions). The result of a carry-save addition is two numbers, $r_{s}$ and $r_{c}$, whose sum is the actual value. Therefore, $r=r_{s}+r_{c}$, where $r_{s}$ represents the sum or parity bits and $r_{c}$ represents the carry or majority bits. The standard SRT algorithm has four non-fractional bits and carry-save addition is done modulo $2^{4}$. More information on SRT algorithms can be found in [5], [6] and [7]. The selection of the quotient digit is based on the values of the four most-significant digits of the remainder in carry-save form $r_{s}, r_{c}$. Let $\operatorname{cpa4}\left(r_{s}, r_{c}\right)$ denote the result of a carry-propagate addition of only the four most significant digits of $r_{s}$ and $r_{c}$. The algorithm selects a quotient digit $q_{n}$ according to the following conditions

$$
\begin{array}{ll}
q_{n}=0 & \text { if } \operatorname{cpa} 4\left(r_{s}, r_{c}\right)=-1 \\
q_{n}=+1 & \text { if } \operatorname{cpa} 4\left(r_{s}, r_{c}\right)>-1 \\
q_{n}=-1 & \text { if } \operatorname{cpa} 4\left(r_{s}, r_{c}\right)<-1
\end{array}
$$

Figs. 1a and 1 b show the quotient selection function in a conventional P-D diagram and in the $r_{s}, r_{c}$ plane respectively. In Fig. 1a, the $x$-axis represents the value of the divisor and the $y$-axis represents the value of the partial remainder. In Fig. 1b, the $x$-axis represents the value of the sum bits and is labeled with both the absolute value of $r_{s}$ (top) as well as the four non-fractional bits of the two's complement representation (bottom). The $y$-axis represents the value of the carry bits and is labeled with both the absolute value of $r_{c}$ (right) as well as the four non-fractional bits of the two's complement representation (left). A point in this figure has coordinates $\left(r_{s}, r_{c}\right)$.


Fig. 1. Standard radix-2 SRT algorithm: Figure (a) shows the P-D plot for a standard radix-2 SRT algorithm. The $x$-axis represents the value of the divisor and $y$-axis represents the value of the partial remainder. Figure (b) illustrates the quotient selection areas for a standard radix-2 SRT division in the $\left(r_{s}, r_{c}\right)$ plane using only the four most-significant bits. In the ( $r_{s}, r_{c}$ ) plane, each point has coordinates ( $r_{s}, r_{c}$ ). Each diagonal line $r_{s}+r_{c}=r$ modulo $2^{4}$ represents a set of points with different $r_{s}$ and $r_{c}$ values but the same remainder value. Addition is modulo $2^{4}$, so diagonal bands wrap around the square. For radix-2 SRT division, the remainder $r_{s}+r_{c}$ remains within the range $[-2 D, 2 D)$. In the figure $D_{\max }=2-u l p$.

Each diagonal line $r_{s}+r_{c}=r$ modulo $2^{4}$ represents a set of points with the same remainder value. The area labeled 2 X is the area where $\operatorname{cpa} 4\left(r_{s}, r_{c}\right)=-1$. For every remainder in this area, the SRT algorithm selects the quotient digit 0 and performs a doubling. The area labeled SUB1\&2X is the area where $\operatorname{cpa} 4\left(r_{s}, r_{c}\right)>-1$. For every remainder in this area, the SRT algorithm selects quotient digit 1 and performs a subtraction with $D$ followed by a doubling. The area labeled ADD1\&2X is the area where $\operatorname{cpa4} 4\left(r_{s}, r_{c}\right)<-1$. For every remainder in this area, the SRT algorithm selects quotient digit -1 and performs an addition with $D$ followed by a doubling. Because addition is calculated modulo $2^{4}$, the diagonal bands wrap around the square.

Because the SRT algorithm satisfies the invariant $r_{s}+r_{c} \in[-2 D, 2 D)$, only the shaded areas are accessible. There are large inaccessible areas. In fact, at least half the area is inaccessible. These large inaccessible areas suggest that there may be more efficient quotient selections that utilize fewer digits. In the following sections we derive such quotient selection algorithms.

## 4 The Invariants and Termination

We use recurrence relations and invariants to prove the correctness of our division algorithms and calculate the error in the computed quotient. In this section we make explicit which invariants we use.

The formula

$$
\begin{equation*}
Q * D=R \tag{8}
\end{equation*}
$$

expresses the desired relation between $Q, D$, and $R$, where $Q$ is the exact quotient. In our algorithms, we use lowercase variables $q$ and $r$, where $q$ represents the quotient calculated 'thus far,' and $r$ represents the remainder calculated 'thus far.' The invariant for all the variables is as follows:

$$
\begin{equation*}
q * D+2^{-n} * r=R . \tag{9}
\end{equation*}
$$

In this invariant, $n$ is the iteration index and $q_{n} * 2^{n}$ is added to $q$ in the $n$th iteration, where $q_{n}$ is the quotient digit selected in $n$th iteration.

We look for a number of program statements for the program variables $q, r$, and $c$ that establish or maintain invariant (9). Once we have these program statements, we can then combine the statements in various ways to obtain a division algorithm.

The initialization $\mathrm{q}=0$; $\mathrm{r}=\mathrm{R}$; $\mathrm{n}=0$ establishes invariant (9). Any quotient digit from an over-redundant digit-set $\{-2,-1,0,1,2\}$ and the recurrence equation (3) will maintain the invariant (9). The challenge is to choose a quotient digit that will maintain the range invariant (6) if the partial remainder is in a two's complement representation or the range invariant (7) if the partial remainder is in a signeddigit representation.

TABLE 1
Labels to Denote the Choice of a Quotient Digit and the Statements Executed

| Label | Quotient Digit, $q_{n}$ | Statements executed |
| :---: | :---: | :---: |
| ADD2 \& 2X | -2 | $\begin{gathered} \mathrm{r}=2 *(\mathrm{r}+2 \mathrm{D}) ; \\ \mathrm{q}=\mathrm{q}-2 * 2^{-\mathrm{n}} ; \mathrm{n}=\mathrm{n}+1 \end{gathered}$ |
| ADD1 \& 2X | -1 | $\begin{gathered} \mathrm{r}=2 *(\mathrm{r}+\mathrm{D}) ; \mathrm{q}=\mathrm{q}-1 * 2^{-\mathrm{n}} ; \\ \mathrm{n}=\mathrm{n}+1 \end{gathered}$ |
| 2X | 0 | $\begin{gathered} r=2 * r ; q=q-0 * 2^{-n} ; \\ n=n+1 \end{gathered}$ |
| SUB1 \& 2X | +1 | $\begin{gathered} \mathrm{r}=2 *(\mathrm{r}-\mathrm{D}) ; \mathrm{q}=\mathrm{q}+1 * 2^{-\mathrm{n}} ; \\ \mathrm{n}=\mathrm{n}+1 \end{gathered}$ |
| SUB2 \& 2X | +2 | $\begin{gathered} \mathrm{r}=2 *(\mathrm{r}-2 \mathrm{D}) ; \\ \mathrm{q}=\mathrm{q}+2 * 2^{-\mathrm{n}} ; \mathrm{n}=\mathrm{n}+1 \end{gathered}$ |



Fig. 2. The area for partial remainders $\left(r_{s}, r_{c}\right)$ where the value of the remainder $r$ is given by $r=r_{s}+r_{c}$. The partial remainders $r_{s}$ and $r_{c}$ are in a two's complement representation. The points $\left(r_{s}, r_{c}\right)$ on a diagonal line, like $r_{s}+r_{c}=3$, can have different values for $r_{s}$ and $r_{c}$ but have the same remainder value. The center square, Q0 $\cup Q 1 \cup Q 2 \cup Q 3$, satisfies range invariant (6).

We use labels to denote the choice of a quotient digit and the statements executed to update the partial remainder (according to (3)), quotient, and the iteration index. Table 1 lists the labels corresponding to the choice of a quotient digit and the statements executed.

Now we need to make sure that the error in the computed quotient is small enough. Using invariant (9) we can express the error, $\epsilon$, in the computed quotient as

$$
\begin{equation*}
\epsilon=R / D-q=2^{-n} * r / D . \tag{10}
\end{equation*}
$$

Consequently, by requiring that $2^{-n} * r / D \in[-u l p / 2, u l p / 2)$, we require that the computed quotient $q$ has an error interval of length less than $u l p$. For a given $u l p$, the condition $2^{-n} * r / D \in[-u l p / 2, u l p / 2)$ translates into a condition which determines the value of $n$, and a condition for the range of the remainder $r$, which determines the value of $r / D$.

For example, if we consider the range invariant (5), the condition $2^{-n} * r / D \in[-u l p / 2, u l p / 2)$ translates into the condition $2^{-n} * 2 \leq u l p / 2$, where $u l p=2^{-L}$. Thus, the termination condition becomes $n \geq L+2$. The range invariant may also exclude the lower bound, that is, $(-2 D, 2 D)$ instead of $[-2 D, 2 D)$

If we consider the range invariants (6) or (7), the termination condition $2^{-n} * r / D \in[-u l p / 2$, ulp/2) translates into the condition $2^{-n} *(4 / D) \leq u l p / 2$, where $u l p=2^{-L}$ and $D \in[1,2)$. In this case the termination condition becomes $n \geq L+3$. Consequently a division algorithm using the range invariants (6) or (7) requires one more iteration to obtain the same accuracy than a division algorithm using the range invariant (5).

## 5 Two's Complement Implementation and Carry-Save Addition

For the derivation of our first set of algorithms we analyze what happens with doublings and additions in the $\left(r_{s}, r_{c}\right)$


Fig. 3. The effect of doubling all points in Q1 or Q3.
plane, when numbers are represented in two's complement and additions are implemented by means of carry-save additions. For clarity, we also provide the P-D diagrams for our algorithms. A two's complement representation of $m$ non-fractional bits can represent numbers in the range $\left[-2^{m-1}, 2^{m-1}\right)$. Note that the lower bound is inclusive while the upper bound is exclusive. Adding and subtracting numbers in two's complement arithmetic can be done by means of carry-save addition modulo $2^{m}$.

We observe that in order to represent the initial value of $D \in[1,2)$ and partial remainders $r_{s}$ and $r_{c}$ in the range $[-2,2)$ we only need two non-fractional bits rather than four in the SRT algorithm of Fig. 1b. Let us look at the effect of the addition and doubling operations on all points that satisfy range invariant (6). This is the bold centersquare, $(Q 0 \cup Q 1 \cup Q 2 \cup Q 3)$, in Fig. 2. To illustrate the effect of an addition and a doubling operation we use the $\left(r_{s}, r_{c}\right)$ plane and consider three non-fractional bits. An addition or a doubling operation can yield a point inside the center square or outside the center square. For the points that are outside the center square we introduce an extra operation that returns the point to the center square while maintaining invariant (9).

### 5.1 Doublings and Translations

Fig. 3 illustrates the effect of doubling any point in the small square Q1. A doubling expands the smaller square Q1 into the larger square $\mathbf{C} 1$. Notice that in square $\mathbf{C} 1, r \in[-4,4)$ but $r_{s} \in[-4,0)$ and $r_{c} \in[0,4)$ which violates invariant (6). To maintain invariant (6), we need to bring back the points in the square C 1 to the center square. To bring back the points in the square C 1 to the center square, we perform a translation over $(2,-2)$, as illustrated in Fig. 4. Translation over $(2,-2)$ is implemented by adding 2 to $r_{s}$ and subtracting 2 from $r_{c}$. Note that the translation keeps the value of $r_{s}+r_{c}$ unchanged. In fact, any translation of a point $\left(r_{s}, r_{c}\right)$ over distance $(t,-t)$ for any number $t$ maintains the value of $r_{s}+r_{c}$. Note that because translations involve addition and subtraction with a constant, the translations can be implemented by a simple recoding of $r_{s}$ and $r_{c}$.


Fig. 4. The effect of translating C1 over (2, -2 ).

Any doubling of square Q1 followed by a translation over $(2,-2)$ in effect expands the square Q1 to the center square. Similarly, doubling of square Q3 followed by a translation over $(-2,2)$ expands the square Q3 to the center square. In both cases, the doubling followed by the translation maintain invariant (9) and range invariant (6).

How do we implement these doublings and translations? Doublings can be implemented by left shifting the partial remainders $r_{s}$ and $r_{c}$ by one position. The translations over $(2,-2)$ and $(-2,2)$ can be implemented by a simple recoding of the most-significant bits of $r_{s}$ and $r_{c}$ as follows.

$$
\begin{array}{lll}
10 & \rightarrow & 11 \\
11 & \rightarrow & 00 \\
01 & \rightarrow & 00 \\
00 & \rightarrow & 11
\end{array}
$$

Notice that the second-most significant bit in each case changes and the most significant bit is a copy of the secondmost significant bit.

If all operations start and end in the center square, we can apply some simplifications to the doubling and translation implementations. First, because the two mostsignificant bits of $r_{s}$ and $r_{c}$ are always the same in the center square, we may omit the most significant bit. Second, if we omit the most significant bit, a doubling followed by a translation of a point $\left(r_{s}, r_{c}\right)$ in the center square simply becomes a left shift by one followed by an inversion of the most significant bit of both $r_{s}$ and $r_{c}$. Because of the extra inversion of the most significant bit, we refer to a doubling followed by a translation as a $2 \mathrm{X}^{*}$ operation.

Here is an example of a doubling followed by a translation operation $\left(2 \mathrm{X}^{*}\right)$. Consider a point $\left(r_{s}, r_{c}\right)$ with three non-fractional bits (000.u,110.v), where $u$ and $v$ are some bit-sequences. Doubling the point (000.u, 110.v) yields a point ( $00 ? . u^{\prime}, 10 ? . v^{\prime}$ ), where $u^{\prime}$ and $v^{\prime}$ are $u$ and $v$ left shifted by 1 position respectively, and ? represents a bit value of either 1 or 0 corresponding to the most-significant bit of $u$ or $v$. Translation of the point $\left(00 ? . u^{\prime}, 10 ? . v^{\prime}\right)$ yields a point (11?. $\left.u^{\prime}, 11 ? . v^{\prime}\right)$ in square Q2 of Fig. 2.


Fig. 5. The effects of carry-save additions and subtractions with $D$. The division algorithms can perform carry-save additions or subtractions in the shaded squares.

### 5.2 Carry-Save Addition

Let us look at what happens to a point in the center square when a subtraction with $D$ occurs. We assume that $r_{s}$ represents the sum bits and $r_{c}$ represents the carry bits after a carry-save addition. Fig. 5 shows the ( $r_{s}, r_{c}$ ) plane, where we have partitioned the center square in small squares. What happens when we subtract $D$ from a point $\left(r_{s}, r_{c}\right)$ in each of the small squares? Here is the calculation for a point in square S1 where we consider only the three most significant bits of each number. In two's complement representation $D=001 . x$, for some bit vector $x$, thus $-D$ is represented by the bit-wise complement of $D$ plus ulp, i.e., $-D=110 . y+u l p$, where $y$ is the bit-wise complement of $x$. For square S1 we get the following calculation. Note that we shift the majority bits one position to the left to correspond with 'carrying' this bit to the next more significant bit position,

$$
\begin{array}{rl}
\mathrm{r}_{\mathrm{s}} & 111 \\
\mathrm{r}_{\mathrm{c}} & 001 \\
-\mathrm{D} & 110 \cdot \mathrm{y}+\mathrm{ulp} \\
--------- \\
\text { sum } & 000 \\
\text { carry } & 11 ?
\end{array}
$$

For square S 2 we get the following calculation:

| $r_{s}$ | 000 |
| ---: | :--- |
| $r_{c}$ | 001 |
| -D | $110 . \mathrm{y}+\mathrm{ulp}$ |
| --------- |  |
| sum | 111 |
| carry | $00 ?$ |

Consequently, subtracting $D$ from a point in square S1 yields a point in squares S10 $\cup$ S14. Subtracting $D$ from a point in the square $S 2$ yields a point in squares $S 1 \cup S 5$. Because carry-save addition is symmetrical in $r_{s}$ and $r_{c}$, subtracting $D$ from a point in square S 11 also yields a point in squares S10US14.

TABLE 2
The Effect of Subtracting or Adding $D$

| square | -D | square | +D |
| :---: | :---: | :---: | :---: |
| S0 | T2UT3 | S0 | T0 ${ }^{\text {P1 }}$ |
| S1 | S10US14 | S4 | S1US5 |
| S2 | S1US5 | S5 | T0UT1 |
| S3 | T0UT1 | S8 | S10 S $^{\text {d }} 4$ |
| S5 | T2 UT3 | S9 | S11US15 |
| S6 | S0US4 | S10 | T0UT1 |
| S7 | S1US5 | S12 | T2UT3 |
| S10 | T2UT3 | S13 | S10 S14 $^{\text {d }}$ |
| S11 | S10US14 | S14 | S1US5 |
| S15 | T2UT3 | S15 | T0UT1 |

Similarly, addition of $D$ to a point in square S 4 yields a point in squares S1US5, and addition of $D$ to any point in square S8 or S13 yields a point in squares S10 US14.

Table 2 gives a summary of adding and subtracting $D$ from most small squares. Observe that a subtraction of $D$ from points in the squares $S 1, S 2, S 6, S 7$, and $S 11$ always ends in the squares Q1 or Q3 of Fig. 2. This means that any such point can subsequently undergo a doubling and a translation (i.e., a $2 \mathrm{X}^{*}$ operation) and land in the center square.

Square S3 is different. Subtraction of $D$ from points in square S3 yields a point in squares T0UT1. Translating a point in T0UT1 yields a point in S2US6, where a point must undergo another subtraction before a doubling. Instead, let us calculate what happens when we subtract $2 D$, instead of $D$, from any point in S3. First, recall that in a two's complement representation with 3 non-fractional bits $D=001 . b x$ for some bit $b$ and bit vector $x$. Thus $2 D=01 b \cdot x 0$, and $-2 D$ is represented by the bit-wise complement of $2 D$ plus ulp, that is, $-2 D=10 d . y+u l p$, where $d$ is the bit complement of $b$ and $y$ is the bit-wise complement of $x 0$,

| $r_{s}$ | 001 |
| :---: | :--- |
| $r_{\mathrm{c}}$ | 001 |
| -2 D | $110 . \mathrm{y}+\mathrm{ulp}$ |
| --- | ----- |
| sum | $10 ?$ |
| carry | $01 ?$ |

As a consequence, subtracting $2 D$ from any point in square S3 yields a point in square T4 of Fig. 5.

We can translate each point in T4 over $(2,-2)$ and the final result lands in square Q1 of Fig. 2. Subsequently, for each point in Q1 we can perform a doubling and translation (ie a $2 \mathrm{X}^{*}$ operation) and obtain a point in the center square again.

Similar remarks can be made for the additions of $D$ or $2 D$ to points in squares S4, S8, S9, S12, S13 and S14 of Fig. 5. Addition of $D$ to points in squares S4, S8, S9, S13, and S14 always yields a point inside squares Q1 or Q3 of Fig. 2. This means that any such point can undergo a doubling and a translation and again land in the center square.

Addition of $D$ to any point in square S12 yields a point in square T2 $\cup T 3$. Translating a point in square T2 $\cup T 3$ yields a point in S9 $\cup S 13$ where a point must undergo another addition before a doubling. Adding $2 D$, however, to any point in square S12 yields a point in square T5. Furthermore, T5 can be translated over $(-2,2)$ to obtain square Q3 in the
center square of Fig. 2. Each point in Q3 can be doubled and translated to obtain a point in the center square again.

In summary, for each small square in the center square we have found a sequence of operations that maintain invariants (9) and (6). Each sequence of operations ends with a doubling and a translation, which may be preceded by a subtraction or addition of $D$ or $2 D$. For example, for a point in small square S12, the operations are an addition of 2D, followed by a translation, then a doubling and a translation. Some squares even have two possible sequences of operations that maintain invariant (9) and (6) (keeps a point in the center square of Fig. 2). For example, for points in square S4 the sequence of operations may be a doubling followed by a translation $\left(2 \mathrm{X}^{*}\right)$ or an addition of $D$ followed by a doubling and a translation. Squares S1, S11, and S14 also have two possible sequences of operations.

If we confine ourselves to the points in the center square, each of these operations is easy to implement. Because we deal with the points only in the center square, we can omit the most-significant bit and consider the remaining two non-fractional bits. The operations are implemented as follows.

- Each addition is simply a carry-save addition in two's complement arithmetic.
- Each doubling is a left shift of both $r_{s}$ and $r_{c}$ by one position.
- Each translation is an inversion of the most-significant bit for $r_{s}$ and $r_{c}$.
Note that a translation followed by a doubling and then another translation is the same as a doubling followed by a translation, because each doubling throws away the most significant bit.


## 6 Putting It All Together

With the analysis of the previous section, we can put together various division algorithms. For each division algorithm we specify what sequence of operations must be performed on the points $\left(r_{s}, r_{c}\right)$ in each of the small squares, S0 to S15, in Fig. 5. There are five sequences of operations to choose from which are described as follows:

- $2 \mathrm{X}^{*}$ : A doubling followed by a translation. The selected quotient digit is 0 .
- SUB1\&2X*: A subtraction of $D$ followed by a doubling and then a translation. The selected quotient digit is +1 .
- SUB2\& $2 X^{*}$ : A subtraction of $2 D$ followed by a doubling and then a translation. The selected quotient digit is +2 .
- ADD1\&2 $X^{*}$ : An addition of $D$ followed by a doubling and then a translation. The selected quotient digit is -1 .
- ADD2\&2 $X^{*}$ : An addition of $2 D$ followed by a doubling and then a translation. The selected quotient digit is -2 .
Recall that a translation (over $(2,-2)$ or $(-2,2)$ ) is implemented by an inversion of the most-significant bit.

Fig. 6 illustrates two possible choices for a division algorithm. Other algorithms can be derived by making different choices for the squares S1, S4, S11, and S14. Algorithms A1


Fig. 6. Algorithms A1 and A2: Figures (a) and (c) show the quotient selection function for algorithms A1 and A2 in the ( $r_{s}, r_{c}$ ) plane respectively. Figures (a) and (b) show P-D plots for algorithms A1 and A2 respectively.
and A2, however, are the most symmetric choices. The selection of a quotient digit relies on only the two most-significant bits of $r_{s}$ and $r_{c}$. Algorithm A2 has a simpler selection logic than Algorithm A1, which can lead to a faster divider implementation.

Thus far, we have shown that Algorithms A1 and A2 maintain the range invariant (6) for the partial remainder. In the next section, we show that Algorithm A1 also maintains the range invariant (5) for the partial remainder. As discussed in Section 4, because algorithm A1 maintains the range invariant (6), algorithm A1 must execute at least $L+3$ iterations. Because algorithm A2 maintains the range invariant (5), algorithm A2 must execute at least $L+2$ iterations.

## 7 A Different Range Invariant

In this section, we prove that Algorithm A1 maintains the range invariant (5):

This means that Algorithm A1 needs one fewer iteration than Algorithm A2 to satisfy the accuracy requirement for the computed quotient.

First, we observe that the range invariant (5) holds after initialization $r_{s}=R ; r_{c}=0$ for $R, D \in[1,2)$.

Second, we observe that each of the operations ADD2\&2 $X^{*}$, ADD $1 \& 2 X^{*}$, SUB $1 \& 2 X^{*}$, and SUB2\& $2 X^{*}$ maintains range invariant (5). Here is a proof for the sequence of operations SUB1\&2X* and SUB2\&2X*. Assume that the invariant (5) holds before each of those sequences of operations. Hence, in the regions of Fig. 6a where Algorithm A1 executes the sequence of operations SUB $1 \& 2 X^{*}$ we have

$$
r_{s}+r_{c} \in[0,2 D)
$$

After subtracting $D$ from $r_{s}+r_{c}$ and doubling $r_{s}$ and $r_{c}$, we have

$$
r_{s}+r_{c} \in[-2 D, 2 D),
$$

which is our range invariant (5).


Fig. 7. The area for the partial remainders $\left(r_{s}, r_{c}\right)$, where the value of the partial remainder $r$ is given by $r=r_{c}-r_{s}$. All points $\left(r_{s}, r_{c}\right)$ on a diagonal line, like $r_{c}-r_{s}=-3$, represent the same remainder value. The bottomleft square consisting of Q0, Q1, Q2, and Q3 squares satisfies the range invariant (7).

In the regions of Fig. 6a where Algorithm A1 executes the operations SUB2\&2X*, we have before the operations

$$
r_{s}+r_{c} \in[2,2 D)
$$

After subtracting $2 D$ from $r_{s}+r_{c}$ and doubling $r_{s}$ and $r_{c}$, we have

$$
r_{s}+r_{c} \in[2(2-2 D), 0) .
$$

Furthermore, we have $2(2-2 D)=(4-2 D)-2 D>-2 D$, because $4-2 D>0$ for $D \in[1,2)$. Consequently, after the operations SUB2\&2X*, we have $r_{s}+r_{c} \in[-2 D, 2 D)$.

In a similar way, we can prove that the operations ADD $1 \& 2 X^{*}$ and ADD2 $\& 2 X^{*}$ maintain invariant (5). Finally we prove that also the operations $2 X^{*}$ maintain invariant (5). Observe that if Algorithm A1 executes the operation $2 X^{*}$ then $r_{s}, r_{c}$ are in one of the diagonal squares, which means that $r_{s}+r_{c} \in[-1,1)$. Because $D \in[1,2)$, we have $r_{s}+r_{c} \in$ $[-D, D)$. Consequently, after the doubling operation we have $r_{s}+r_{c} \in[-2 D, 2 D)$, our range invariant (5).

## 8 BSD Representations and Carry-Free Additions

Instead of using a two's complement representation for the partial remainder, we can use a binary signed digit (BSD) representation. In a BSD representation, the partial remainder $r$ is represented by a vector of signed bits from the set $\{-1,0,1\}$. Instead of using a single vector of signed bits, we use two vectors of unsigned bits $r_{s}$ and $r_{c}$, such that $r=r_{c}-r_{s}$. Please see [6] for more details on BSD representation and carry-free addition.

For the BSD representation of the remainder, we use the range invariant (7) The partial remainder $r=r_{c}-r_{s}$ is in the range $(-4,4)$. This range is similar to the range $[-4,4)$ for the remainder $r$ when using a two's complement


Fig. 8. Translation of square Q 5 over $(-4,-4)$.
representation. For the BSD representation, however, both bounds are excluded.

The bottom-left bold square, Q0טQ1 $\cup$ Q2 $\cup Q 3$, in Fig. 7 consists of all numbers $\left(r_{s}, r_{c}\right)$ satisfying range invariant (7).

We call vector $r_{c}$ the carry and vector $r_{s}$ the sum, where the sum vector has a negative weight.

To see what happens when we perform doublings and additions on points satisfying range invariant (7), we first look at a larger square in Fig. 7, where each binary vector is represented with one extra digit at the most-significant position. We are interested in a series of operations that takes a point in the bottom-left bold square and returns a point in the bottom-left bold square. As with the two's complement representations, the operations must end with a doubling and may be preceded by an addition or subtraction. Each sequence of operations must maintain invariant (9).

### 8.1 Doublings and Translations

The effects of a doubling is straightforward. Doubling a point in square Q0 of Fig. 7 returns a point in the bottomleft square, $Q 0 \cup Q 1 \cup Q 2 \cup Q 3$. Doubling of a point in square Q2 returns a point in the top-right square Q5. Each point in Q5 can be translated back to the bottom-left square by a translation over $(-4,-4)$, as illustrated in Fig. 8. Any translation over $(-t,-t)$ leaves the value of $r=r_{c}-r_{s}$ invariant and thus each translation maintains invariant (9)). Furthermore, a translation over $(-4,-4)$ for any point in Q5 is easy to implement by inverting the most-significant bit (bit position with weight $2^{3}$ ).

### 8.2 Additions

Squares Q1 and Q3 in Fig. 7 require more operations than just doublings and translations. We analyze the effect of additions to the points in each of the small squares, S 0 to S15, in Fig. 9.

We implement the addition $r+z$ of a remainder $r=r_{c}-r_{s}$ and a choice $z \in\{-2 D,-D, D, 2 D\}$ by means of a carry-free addition [6]. As shown in [6], we have


Fig. 9. The effect of carry-free additions and subtractions with $D$ or $2 D$. The division algorithms can perform carry-free additions or subtraction in the shaded squares.

$$
r_{c}-r_{s}+z=B S D c a r r y-B S D s u m,
$$

where BSDcarry and BSDsum are the result of a carry-save addition of $\neg r_{s}, r_{c}$, and $z$, with an inversion of the parity result:

$$
\begin{aligned}
B S D c a r r y & =2 * \text { majority }\left(\neg r_{s}, r_{c}, z\right) \\
B S D s u m & =\neg \operatorname{parity}\left(\neg r_{s}, r_{c}, z\right) .
\end{aligned}
$$

The majority and parity functions, forming the carry-free addition, can be done modulo $2^{m}$ if there are $m$ non-fractional bits. In our case, $m=3$. This implementation of carryfree addition applies only if we take a two's complement representation for $z$ [6]. Thus $-D$ and $-2 D$ can be represented by $-D=110 . y+u l p$ and $-2 D=10 b . y+u l p$ respectively, for some bit vector $y$ and bit $b$.

For example, for subtracting $D$ from a point in square S1 we get the following carry-free addition:

| $r_{s}$ | 001 |
| ---: | :--- |
| $r_{c}$ | 011 |
| -D | $110 \cdot \mathrm{y}+\mathrm{ulp}$ |
| --------- |  |
| sum | 100 |
| carry | $10 ?$ |

These values for BSDsum and BSDcarry correspond to a point $\left(r_{s}, r_{c}\right)$ in squares T 7 or T10. For adding $D$ to a point in square S13 we get the following carry-free addition:

$$
\begin{array}{rc}
r_{s} & 001 \\
r_{c} & 000 \\
+\mathrm{D} & 001 . \mathrm{y} \\
--- & --- \\
\text { BSDsum } & 000 \\
\text { BSDcarry } & 00 ?
\end{array}
$$

These values for BSDsum and BSDcarry correspond to a point in squares S8 or S12.

TABLE 3 The Effect of Adding or Subtracting $D$

| square | -D | square | +D |
| :---: | :---: | :---: | :---: |
| S0 | T0○T4 | S3 | S10US14 |
| S1 | T7UT10 | S6 | S10US14 |
| S2 | T3 $\cup$ T6 | S7 | S8US12 |
| S3 | T9UT12 | S9 | S10US14 |
| S4 | T7UT10 | S10 | S2US6 |
| S5 | T8UT10 | S11 | S3US7 |
| S6 | T9UT12 | S12 | S1US5 |
| S8 | T3 T6 $^{\text {a }}$ | S13 | S8US12 |
| S9 | T9UT12 | S14 | S3US7 |
| S12 | T9 T12 $^{\text {a }}$ | S15 | S10US14 |

Table 3 shows the results of carry-free addition in squares S0 to S15 in Fig. 9. All subtractions of $D$ from squares S1, S2, S4, S5, and S8 yield points in small squares T3, T6, T7, T8, T10, or T11 which can be translated, doubled, and translated again to return a point in one of the S 0 to S 15 squares.

Subtracting $D$ from any point in square S 0 yields a point in square T 0 or T 4 . Translating a point in the squares T 0 or T 4 yields a point in squares S1 or S5 where the point must undergo another subtraction. Therefore, instead of subtracting $D$, we subtract $2 D$ from a point in S 0 using carry-free addition

$$
\begin{array}{cc}
\text { rs } & 000 \\
\text { rc } & 011 \\
-2 \mathrm{D} & 10 ? . \mathrm{y}+\mathrm{ulp} \\
-------- \\
\text { BSDsum } & 11 ? \\
\text { BSDcarry } & 11 ?
\end{array}
$$

The resulting values for BSDsum and BSDcarry correspond a point in squares $\mathrm{T} 1, \mathrm{~T} 3, \mathrm{~T} 5$, or T 6 . Points in these squares can be translated, doubled and translated again to return a point in one of the $S$ squares.

All additions of $D$ to points in squares $\mathrm{S} 7, \mathrm{~S} 10, \mathrm{~S} 11, \mathrm{~S} 13$, and S14 yield points in small squares S2, S3, S6, S7, S8, or S12. Points in these squares can all be doubled and translated to return a point in one of the S 0 to S 15 squares.

Adding $D$ to a point in square S 15 yields a point in squares S10 or S14 where the point must undergo another addition rather than a doubling. Adding $2 D$, however, to any point in square S 15 returns a point in squares $\mathrm{S} 8, \mathrm{~S} 9$, S12, or S13, which can be doubled and translated to remain in one of the S0 to S15 squares.

With the analysis of the doublings, translations, and additions, we can put together a number of division algorithms based on the BSD representation for the partial remainder and carry-free additions.

Before giving the five possible choices for sequences of operations, we consider a few simplifications. Because each set of operations in our division algorithm takes a point in one of the S0 to S15 squares and returns a point in one of the S0 to S15 squares, we omit the most-significant bit, which is always 0 , and use only two non-fractional bits. The omission of the most-significant bit simplifies the implementation of the translation to an operation that is implemented automatically.

The five choices for the sequences of operations are: $2 \mathrm{X}, \mathrm{SUB} 1 \& 2 \mathrm{X}, \mathrm{SUB} 2 \& 2 \mathrm{X}, \mathrm{ADD} 1 \& 2 \mathrm{X}$ and ADD2\&2X. The


Fig. 10. Two division algorithms, B 1 and B 2 , based on BSD representations and carry-free additions. Figures (a) and (c) illustrate the quotient selection function of algorithms B 1 and B 2 in $\left(r_{s}, r_{c}\right)$ plane. Figures $(\mathrm{b})$ and (d) show the quotient selection function of algorithm B1 and B2 using P-D plots respectively. Algorithm B1 is the same algorithm as presented in [3].
quotient digit selected and the statements executed for each of these operation is listed in Table 1. Furthermore, each of these operations maintain invariant (9) and range invariant (7).

We can compose a division algorithm by choosing one sequence of operations for each small square S0 through S15. For the squares $\mathrm{S} 2, \mathrm{~S} 7, \mathrm{~S} 8$, and S 13 there are two choices for selecting a quotient digit. For the squares S 2 and S 8 the two choices are as follows: selecting a quotient digit 0 and performing a 2 X operation or selecting a 1 and performing a SUB1\&2X operation on the partial remainder. For the squares S7 and S13 the two choices are as follows: selecting a quotient digit 0 and performing a $2 X$ operation or selecting a -1 and performing a ADD1\&2X operation on the partial remainder. For all other squares there is only one choice for selecting a quotient digit. The two most symmetric algorithms appear in Fig. 10. We characterize each algorithm by the choice of sequence of operations that is performed for each point in a small square. The operations are similar to the operations for a two's complement representation and carry-save additions.

Algorithms B1 and B2 in Fig. 10 satisfy the range invariant (7). Algorithm B1, also satisfies the range invariant $r=r_{c}-r_{s} \in(-2 D, 2 D)$. The proof that Algorithm B1 satisfies the range invariant $r \in(-2 D, 2 D)$ is essentially the same as in Section 7. Therefore, algorithms B1 and B2 require at least $L+2$ and $L+3$ iterations, respectively, to terminate with an error $\epsilon \in(-u l p / 2, u l p / 2)$.

Algorithm B1 in Fig. 10a is the same algorithm as presented in [3]. Note that in [3], the authors use the recurrence relation in (2) and hence the authors use the range invariant $r \in(-D, D)$ for the partial remainder. Because we have considered the recurrence relation in (3) throughout this paper, the range invariant for the partial remainder in [3] translates to $r \in(-2 D, 2 D)$.

## 9 Implementation Results

In this section we compare the radix-2 division algorithms presented in this paper for latency per iteration, power and area. While the implementation of an algorithm can be

TABLE 4
Comparison of the Five Radix-2 Algorithms Discussed in This Paper

| Algorithm | Partial remainder <br> representation | Partial <br> remainder <br> range | Type of <br> Adder | Quotient <br> digit set | Latency <br> per <br> iteration in $p s$ | Avg. power <br> per iteration <br> in $\mathbf{m W}$ | Area in $\mu m^{2}$ |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| SRT | Two's Complement | $r \in[-2 D, 2 D)$ | Carry-save | $\{-1,0,1\}$ | 250 | 7.60 | 7,900 |
| A1 | Two's Complement | $r \in[-2 D, 2 D)$ | Carry-save | $\{-2,-1,0,1,2\}$ | 240 | 10.08 | 11,376 |
| A2 | Two's Complement | $r \in[-4,4)$ | Carry-save | $\{-2,-1,0,1,2\}$ | 225 | 10.35 | 11,676 |
| B1 | Signed-Digit | $r \in(-2 D, 2 D)$ | Carry-free | $\{-2,-1,0,1,2\}$ | 240 | 10.08 | 11376 |
| B2 | Signed-Digit | $r \in(-4,4)$ | Carry-free | $\{-2,-1,0,1,2\}$ | 225 | 10.35 | 11,676 |

further optimized for low-latency or low-power [11], [12] and [13], we focus on comparing the algorithms in the same design environment. For comparison, we synthesized the behavioral verilog code for all the algorithms using Synopsys Design Compiler and a production quality TSMC 40 nm standard cell library. To estimate the latency per iteration and area, we used Synopsys Design Compiler's static timing analysis engine and for power estimates we used an internal proprietary tool. Note that the latency per iteration determines the clock period for the divider. We assumed 53-bit division for all the algorithms. Table 4 shows the comparison. From Table 4, Algorithms A1 and B1 offer a very small improvement of 4 percent in latency per iteration compared to a standard radix-2 SRT algorithm at the cost of 32 percent more power and 44 percent more area. Algorithms A2 and B2 offer an improvement of 10 percent in latency per iteration compared to the SRT algorithm at the cost of 36 percent more power and almost 50 percent more area. Algorithms A1, A2, B1 and B2 consume more power and area than the standard radix-2 SRT algorithm because of the following reasons: First, algorithms A1, A2, B1 and B2 must perform one of five alternatives every iteration. In comparison, the SRT algorithm must perform one of three alternatives every iteration. More alternatives directly translates to additional hardware required to update the partial remainder and quotient, which results in more power and area consumption. Second, on-the-fly conversion of a quotient digit from the set $\{-2,-1,0,1,2\}$ to $\{0,1\}$ is more complex than on-the-fly conversion of a quotient digit from the set $\{-1,0,1\}$ to $\{0,1\}$.

## 10 Differences

The division algorithms for the two's complement representation and the division algorithms for the BSD representations look very similar, but there are some important differences between the two. The first difference is that the range invariants for the algorithms are different. For the two's complement implementation, one range invariant for the remainder is $r \in[-2 D, 2 D)$. The other range invariant is $r_{s} \in[-2,2)$ and $r_{c} \in[-2,2)$, implying $r=r_{s}+r_{c} \in[-4,4)$. For the BSD implementation, one range invariant for the remainder is $r \in(-2 D, 2 D)$. The other range invariant is $r_{s} \in[0,4)$ and $r_{c} \in[0,4)$, implying $r=r_{c}-r_{s} \in(-4,4)$. Note the exclusion of both bounds in the last interval. For rounding purposes and for determining whether the computed quotient is exact, a range invariant for $r$ that excludes both bounds is much preferable. A range that excludes the bounds can potentially save an extra clock cycle [14].

## 11 Concluding Remarks

In this paper we presented the derivation of four division algorithms, three of which are new. All four algorithms choose a quotient digit from the set $\{-2,-1,0,1,2\}$ and the selection of a quotient digit relies on only the two mostsignificant bits of the partial-remainder in a redundant representation. We also presented an alternative analysis method that looks at the effects of doubling and carry-save or carry-free additions in the $\left(r_{s}, r_{c}\right)$ plane and uses invariants to prove the correctness of the division algorithms. Our method, along with the P-D diagrams, can be applied to derive higher-radix division algorithms with efficient quo-tient-selection functions. Because the quotient selection function for higher-radix division algorithms depend on the value of the divisor [6], we expect that there will be several $\left(r_{s}, r_{c}\right)$ planes, each describing the quotient selection function for a particular range of the divisor.

From Table 4, the algorithms derived in this paper offer an improvement of at most 10 percent in latency per iteration which could help achieve a faster timing-closure for speed-oriented designs. The 10 percent improvement in speed comes at the cost of 50 percent more area and 36 percent more power compared to the SRT algorithm. Because a division instruction is a less frequently executed instruction, we believe that the speed of a divider has more impact on the overall system-performance than a divider's power or area consumption on the overall system's power or area. Therefore, the 10 percent improvement in speed may be worth paying for extra power and area.

Finally, the conversion of a quotient digit from a redundant digit set $\{-2,-1,0,1,2$,$\} to the non-redundant digit set$ $\{0,1\}$ can be done by means of various on-the-fly conversions [5], [6] and [3].

## AckNOWLEDGMENTS

We thank Dr. Ivan Sutherland and Dr. Robert Daasch for their valuable comments on the previous version of this paper. We also thank our reviewers for their suggestions.

## References

[1] S. Oberman and M. Flynn, "Design issues in division and other floating-point operations," IEEE Trans. Comput., vol. 46, no. 2, pp. 154-161, Feb. 1997.
[2] P. Montuschi and L. Ciminiera, "Over-redundant digit sets and the design of digit-by-digit division units," IEEE Trans. Comput., vol. 43, no. 3, pp. 269-277, Mar. 1994.
[3] H. Srinivas, K. Parhi, and L. Montalvo, "Radix 2 division with over-redundant quotient selection," IEEE Trans. Comput., vol. 46, no. 1, pp. 85-92, Jan. 1997.
[4] N. Burgess, "A fast division algorithm for VLSI," in Proc. IEEE Int. Conf. Comput. Des.: VLSI Comput. Processors, Oct. 1991, pp. 560-563.
[5] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, London, U.K.: Oxford Univ. Press, 2000.
[6] M. Ercegovac and T. Lang, Digital Arithmetic, San Mateo, CA, USA: Morgan Kaufmann, 2003.
[7] P. Kornerup, "Digit selection for SRT division and square root," IEEE Trans. Comput., vol. 54, no. 3, pp. 294-303, Mar. 2005.
[8] J. Ebergen, I. Sutherland, and A. Chakraborty, "New division algorithms by digit recurrence," in Proc. 38th Conf. Asilomar Signals, Syst. Comput., vol. 2, Nov. 2004, pp. 1849-1855.
[9] N. Jamadagni and J. Ebergen, "An asynchronous divider implementation," in Proc. 18th IEEE Int. Symp. Asynchronous Circuits Syst, May 2012, pp. 97-104.
[10] P. Kornerup, "Digit-set conversions: Generalizations and applications," IEEE Trans. Comput., vol. 43, no. 5, pp. 622-629, May 1994.
[11] D. Harris, S. Oberman, and M. Horowitz, "SRT Division architectures and implementations," in Proc. 13th IEEE Symp. Comput. Arithmetic, Jul. 1997, p. 18.
[12] T. N. Pham and E. E. J. Swartzlander, "Design of radix-4 SRT dividers in 65 nanometer CMOS technology," in Proc. Int. Conf. Appl.-Specific Syst., Arch. Processors, Sep. 2006, pp. 105-108.
[13] W. Liu and A. Nannarelli, "Power efficient division and square root unit," IEEE Trans. Comput., vol. 61, no. 8, pp. 1059-1070, Aug. 2012.
[14] J. Prabhu and G. Zyner, " 167 MHz radix- 8 divide and square root using overlapped radix-2 stages," in Proc. 12th Symp. Comput. Arithmetic, Jul. 1995, pp. 155-162.
[15] M. Kuhlmann and K. K. Parhi, "A novel low-power shared division and square-root architecture using the GST algorithm," VLSI Des., vol. 12, no. 3, pp. 365-376, 2001.
[16] A. Avizienis, "Signed-digit number representations for fast parallel arithmetic," IRE Trans. Electron. Comput., vol. EC-10, no. 3, pp. 389-400, Sep. 1961.
[17] C. Tung, "A division algorithm for signed-digit arithmetic," IEEE Trans. Comput., vol. C-17, no. 9, pp. 887-889, Sep. 1968.
[18] K. D. Tocher, "Techniques of multiplication and division for automatic binary computers," Quarterly J. Mechanics Appl. Math., vol. 11, no. 3, pp. 364-384, 1958.
[19] M. Ercegovac and T. Lang, "On-the-fly conversion of redundant into conventional representations," IEEE Trans. Comput., vol. C-36, no. 7, pp. 895-897, Jul. 1987.


Jo Ebergen received the PhD degree in mathematics and computing science from the Eindhoven University of Technology. He is a senior consulting hardware engineer in the VLSI Research Group at Oracle Labs. His research interests include asynchronous circuit design, VLSI design, performance analysis of digital systems, and formal verification. He is a member of the IEEE and a distinguished engineer of the ACM.


Navaneeth Jamadagni received the BE degree from Visvesvaraya Technological University, Belgaum, India in 2004, and the MS degree from Portland State University, Portland, Oregon in 2010. He is a working towards the PhD degree at Portland State University. He is an intern at Oracle Labs, Redwood Shores, CA, and a graduate research assistant at the Asynchronous Research Center at Portland State University. His research interests include high-speed digital circuit design, asynchronous circuit design, and computer arithmetic. He is a student member of the IEEE Computer Society.
$\triangleright$ For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

