Next: Sample Programs... Up: Main Previous: Numerical Errors

Computer Arithmetic

The most common computer arithmetic are integer arithmetic and floating point arithmetic. Now these arithmetic systems will be briefly discussed.

Integer Arithmetic :

The result of any integer arithmetic operation is always an integer. The range of integers that can be represented on a given computer is finite. The result of an integer division is usually given as a quotient. The remainder is truncated as fractional quantities which cannot be represented under the integer representation.
Eg:
$ 23 \div11 =2$

$ 10 \div13=0 $

Remark:
(1) Simple rules like         $ \displaystyle{ \frac{\alpha+\beta}{\gamma}=\frac{\alpha}{\gamma}+\frac{\beta}{\gamma}}$ , where $ \alpha, \beta, \gamma $ are integers may not hold under computer integer arithmetic due to the truncation of the remainder.

  $\displaystyle {\rm e.g:}$
$\displaystyle \ \alpha = 6 \ , \qquad \beta=9 \ , \qquad
\gamma=5$
 
   
$\displaystyle \frac{\alpha+\beta}{\gamma} = \frac{6+9}{5} = \frac{15}{9} = 3$
 
  $\displaystyle {\rm but}$    
   
$\displaystyle \frac{\alpha}{\gamma} + \frac{\beta}{\gamma} = \frac{6}{5} +
\frac{9}{5} = 1+1 = 2$
 

(2) An integer operation may result in a very small or a very large number which is beyond the range of that the computer can handle. When the result is larger than the maximum limit , it is referred to as an overflow and when it is less than the lower limit , it is referred to as underflow.

Floating Point Arithmetic:
In the floating point arithmetic all the numbers are stored and processed in normalized exponential form . Firstly the process of addition under floating point arithmetic will be discussed.

Addition under Floating Point Arithmetic:

Let and be the two numbers to be added and be the result. The normalized floating point representation of $ x,y,$ and are $ M_{x}\times10^{E_{x}}$ , $ M_{y}\times10^{E_{y}}$ , $ M_{z}\times10^{E_{z}}$ respectively. The rules for carrying out the addition are as follows :

(a) Set = maximum $ ( E_{x},E_{y})$.

Say $ E_{x}> E_{y} $ then $ E_{z}= E_{x}$.

b) Right shift $ M_{y}$ by $ E_{x}- E_{y}$ places, so that the exponent of $ M_{x},M_{y}$ are the same and call it

c) Set

d) Normalize and let $ M_{z}\times10^{E_{z}}$ be its normalized representation.

e) Set $ z= M_{z}\times10^{E_{z}}$

E.g : Add the numbers $ 0.692745E5 $ and $ 0.853516E2 $

$ M_{x}= 0.692745 , \qquad \qquad \qquad E_{x}= 5 $

$ M_{y}=0.853516 , \qquad \qquad \qquad E_{y}= 2 $

a)

b) on right shifting $ M_{y}$ by 3 we get

c)

                 $ = 0.692745+0.000853516 $

                

d) which is already in normalized form

i.e , $ E_{z}=E_{x}.$

e) $ z=M_{z}\times10^{E_{z}}= 0.6928303516\times10^{5}.$

Remark:     Substraction is nothing but addition of numbers with different signs.

Multiplication Under Floating Point Arithmetic:

If $ x=M_{x}\times10^{E_{x}}$ , $ y=M_{y}\times10^{E_{y}}$ are two real numbers in normalized form then their product

  $\displaystyle z$ $\displaystyle = x\times y$  
    $\displaystyle = (M_{x}\times M_{y})\times(10^{E_{x}+E_{y}})$  
    $\displaystyle = \tilde{M}_{z}\times10 ^{\tilde{E}_{z}}$  
    $\displaystyle = M_{z}\times10^{E_{z}}\
{\rm (after \ normalization) }$  

E.g : Say $ x= 0.8102E5 $, $ y=0.2E-2$ then

Since $ z=\tilde{M_{z}}\times10^{\tilde{E}_{z}}=0.16204\times10^{3}$ is already in normalized form , $ M_{z}= 0.16204 $
$ \qquad\qquad E_{z}=3$
% latex2html id marker 588
$ \therefore z=x\times y = 0.16204\times 10^{3}$.
Remark:
(1)    
                 $ =\tilde{M_{z}}\times10^{\tilde{E_{z}}}$
                 $ =M_{z}\times10^{E_{z}}$ (after normalization)

During the floating point arithmetic mantissa 'M' may be truncated due to the limitation on the number of bits available for its representation on a computer.

(2) Floating point arithmetic is prone to the following errors:

a) Errors due to inexact representation of a decimal number in binary form. For example $ (0.1)_{10}= (
0.0001001\overline{1001}1...)_{2}$. Since binary equivalent of $ (0.1)_{10}$ has a repeating fraction, it has to be terminated at some point.
b) Error due to round-off-effect
c) Subtractive cancellation : It is possible that some mantissa positions are unspecified. These unspecified positions may be arbitrarily filled by the computer.This may lead to serious loss of significance when two nearly equal numbers are subtracted.


For example if and $ y=0.500000$ then $ x-y= 0.000002= 0.200000\times10^{-5}$ has only one significant digit. However the mantissa will have provision to store more number of significant digits, which may get arbitrarily filled as they may be specified. Further if the operands themselves are approximate representation due to this non-specification problem the overall loss of significance will get serious.
d) Basic laws of arithmetic such as associative, distributive may not be satisfied i.e

$ x +( y+z)\neq (x+y)+ z $
$ x \times ( y \times z) \neq ( x\times y)\times z $
$ x \times ( y + z) \neq ( x\times y ) + ( x\times z) $
(3) Numerical computation involves a series of computations consisting of basic arithmetic operation. There may be round-off or truncation error at every step of the computation. These errors accumulate with the increasing number of computations in a process. There can be situations where even a single operation may magnify the roundoff errors to a level that completely ruins the result.

A computation process in which the cumulative effect of all input errors is grossly magnified is said to be numerically unstable. It is important to understand the conditions under which the process is likely to be 'sensitive' to input errors and become unstable. Investigations to see how small changes in input parameters influence the output are termed as sensitivity analysis.

(4) Roundoff and truncation errors effect on the final numerical result may be reduced by

a) Increasing the significant figures of the computer either through hardware or through software manipulations.For instance one may use double precision for floating point arithmetic operations.
b) Minimizing the number of arithmetic operations. Here one may try to rearrange a formula to reduce the number of arithmetic operations. For example in the evaluation of a polynomial , $ a_{n}x^{n}+a_{n-1}x^{n-1}+......+ a_{0,}$ it may be rearranged as
$ (.....(( a_{n}x+a_{n-1})x+a_{n-2})x......+a_{0})$
which requires less arithmetic operations.
c)A formula like $ \displaystyle{
\frac{x^{2}- y^{2}}{x-y}}$ may be replaced by $ x+y$ to avoid substractive cancellation
d) While finding the sum of set of numbers, arrange the set so that they are in ascending order of absolute value. i.e when $ \vert a\vert> \vert b\vert> \vert c\vert $ then $ (c-b)+a $ is better than $ (a-b)+c$ .

5) It may not be possible to simultaneously reduce both the truncation and round-off error effects on the final result of a numerical computation. For instance in an iterative procedure when one tries to reduce the round-off error by increasing the step size , it may lead to higher truncation error and vice-versa. Hence proper care has to be taken to reduce both the errors simultaneously.


Next: Sample Programs... Up: Main Previous: Numerical Errors