## Mercurial > dropbear

### comparison tommath.src @ 190:d8254fc979e9 libtommath-orig LTM_0.35

Find changesets by keywords (author, files, the commit message), revision
number or hash, or revset expression.

Initial import of libtommath 0.35

author | Matt Johnston <matt@ucc.asn.au> |
---|---|

date | Fri, 06 May 2005 08:59:30 +0000 |

parents | d29b64170cf0 |

children |

comparison

equal
deleted
inserted
replaced

142:d29b64170cf0 | 190:d8254fc979e9 |
---|---|

47 \def\gap{\vspace{0.5ex}} | 47 \def\gap{\vspace{0.5ex}} |

48 \makeindex | 48 \makeindex |

49 \begin{document} | 49 \begin{document} |

50 \frontmatter | 50 \frontmatter |

51 \pagestyle{empty} | 51 \pagestyle{empty} |

52 \title{Implementing Multiple Precision Arithmetic \\ ~ \\ Draft Edition } | 52 \title{Multi--Precision Math} |

53 \author{\mbox{ | 53 \author{\mbox{ |

54 %\begin{small} | 54 %\begin{small} |

55 \begin{tabular}{c} | 55 \begin{tabular}{c} |

56 Tom St Denis \\ | 56 Tom St Denis \\ |

57 Algonquin College \\ | 57 Algonquin College \\ |

64 \end{tabular} | 64 \end{tabular} |

65 %\end{small} | 65 %\end{small} |

66 } | 66 } |

67 } | 67 } |

68 \maketitle | 68 \maketitle |

69 This text has been placed in the public domain. This text corresponds to the v0.30 release of the | 69 This text has been placed in the public domain. This text corresponds to the v0.35 release of the |

70 LibTomMath project. | 70 LibTomMath project. |

71 | 71 |

72 \begin{alltt} | 72 \begin{alltt} |

73 Tom St Denis | 73 Tom St Denis |

74 111 Banning Rd | 74 111 Banning Rd |

83 This text is formatted to the international B5 paper size of 176mm wide by 250mm tall using the \LaTeX{} | 83 This text is formatted to the international B5 paper size of 176mm wide by 250mm tall using the \LaTeX{} |

84 {\em book} macro package and the Perl {\em booker} package. | 84 {\em book} macro package and the Perl {\em booker} package. |

85 | 85 |

86 \tableofcontents | 86 \tableofcontents |

87 \listoffigures | 87 \listoffigures |

88 \chapter*{Prefaces to the Draft Edition} | 88 \chapter*{Prefaces} |

89 I started this text in April 2003 to complement my LibTomMath library. That is, explain how to implement the functions | 89 When I tell people about my LibTom projects and that I release them as public domain they are often puzzled. |

90 contained in LibTomMath. The goal is to have a textbook that any Computer Science student can use when implementing their | 90 They ask why I did it and especially why I continue to work on them for free. The best I can explain it is ``Because I can.'' |

91 own multiple precision arithmetic. The plan I wanted to follow was flesh out all the | 91 Which seems odd and perhaps too terse for adult conversation. I often qualify it with ``I am able, I am willing.'' which |

92 ideas and concepts I had floating around in my head and then work on it afterwards refining a little bit at a time. Chance | 92 perhaps explains it better. I am the first to admit there is not anything that special with what I have done. Perhaps |

93 would have it that I ended up with my summer off from Algonquin College and I was given four months solid to work on the | 93 others can see that too and then we would have a society to be proud of. My LibTom projects are what I am doing to give |

94 text. | 94 back to society in the form of tools and knowledge that can help others in their endeavours. |

95 | 95 |

96 Choosing to not waste any time I dove right into the project even before my spring semester was finished. I wrote a bit | 96 I started writing this book because it was the most logical task to further my goal of open academia. The LibTomMath source |

97 off and on at first. The moment my exams were finished I jumped into long 12 to 16 hour days. The result after only | 97 code itself was written to be easy to follow and learn from. There are times, however, where pure C source code does not |

98 a couple of months was a ten chapter, three hundred page draft that I quickly had distributed to anyone who wanted | 98 explain the algorithms properly. Hence this book. The book literally starts with the foundation of the library and works |

99 to read it. I had Jean-Luc Cooke print copies for me and I brought them to Crypto'03 in Santa Barbara. So far I have | 99 itself outwards to the more complicated algorithms. The use of both pseudo--code and verbatim source code provides a duality |

100 managed to grab a certain level of attention having people from around the world ask me for copies of the text was certain | 100 of ``theory'' and ``practice'' that the computer science students of the world shall appreciate. I never deviate too far |

101 rewarding. | 101 from relatively straightforward algebra and I hope that this book can be a valuable learning asset. |

102 | 102 |

103 Now we are past December 2003. By this time I had pictured that I would have at least finished my second draft of the text. | 103 This book and indeed much of the LibTom projects would not exist in their current form if it was not for a plethora |

104 Currently I am far off from this goal. I've done partial re-writes of chapters one, two and three but they are not even | 104 of kind people donating their time, resources and kind words to help support my work. Writing a text of significant |

105 finished yet. I haven't given up on the project, only had some setbacks. First O'Reilly declined to publish the text then | 105 length (along with the source code) is a tiresome and lengthy process. Currently the LibTom project is four years old, |

106 Addison-Wesley and Greg is tried another which I don't know the name of. However, at this point I want to focus my energy | 106 comprises of literally thousands of users and over 100,000 lines of source code, TeX and other material. People like Mads and Greg |

107 onto finishing the book not securing a contract. | 107 were there at the beginning to encourage me to work well. It is amazing how timely validation from others can boost morale to |

108 | 108 continue the project. Definitely my parents were there for me by providing room and board during the many months of work in 2003. |

109 So why am I writing this text? It seems like a lot of work right? Most certainly it is a lot of work writing a textbook. | 109 |

110 Even the simplest introductory material has to be lined with references and figures. A lot of the text has to be re-written | 110 To my many friends whom I have met through the years I thank you for the good times and the words of encouragement. I hope I |

111 from point form to prose form to ensure an easier read. Why am I doing all this work for free then? Simple. My philosophy | 111 honour your kind gestures with this project. |

112 is quite simply ``Open Source. Open Academia. Open Minds'' which means that to achieve a goal of open minds, that is, | 112 |

113 people willing to accept new ideas and explore the unknown you have to make available material they can access freely | 113 Open Source. Open Academia. Open Minds. |

114 without hinderance. | |

115 | |

116 I've been writing free software since I was about sixteen but only recently have I hit upon software that people have come | |

117 to depend upon. I started LibTomCrypt in December 2001 and now several major companies use it as integral portions of their | |

118 software. Several educational institutions use it as a matter of course and many freelance developers use it as | |

119 part of their projects. To further my contributions I started the LibTomMath project in December 2002 aimed at providing | |

120 multiple precision arithmetic routines that students could learn from. That is write routines that are not only easy | |

121 to understand and follow but provide quite impressive performance considering they are all in standard portable ISO C. | |

122 | |

123 The second leg of my philosophy is ``Open Academia'' which is where this textbook comes in. In the end, when all is | |

124 said and done the text will be useable by educational institutions as a reference on multiple precision arithmetic. | |

125 | |

126 At this time I feel I should share a little information about myself. The most common question I was asked at | |

127 Crypto'03, perhaps just out of professional courtesy, was which school I either taught at or attended. The unfortunate | |

128 truth is that I neither teach at or attend a school of academic reputation. I'm currently at Algonquin College which | |

129 is what I'd like to call ``somewhat academic but mostly vocational'' college. In otherwords, job training. | |

130 | |

131 I'm a 21 year old computer science student mostly self-taught in the areas I am aware of (which includes a half-dozen | |

132 computer science fields, a few fields of mathematics and some English). I look forward to teaching someday but I am | |

133 still far off from that goal. | |

134 | |

135 Now it would be improper for me to not introduce the rest of the texts co-authors. While they are only contributing | |

136 corrections and editorial feedback their support has been tremendously helpful in presenting the concepts laid out | |

137 in the text so far. Greg has always been there for me. He has tracked my LibTom projects since their inception and even | |

138 sent cheques to help pay tuition from time to time. His background has provided a wonderful source to bounce ideas off | |

139 of and improve the quality of my writing. Mads is another fellow who has just ``been there''. I don't even recall what | |

140 his interest in the LibTom projects is but I'm definitely glad he has been around. His ability to catch logical errors | |

141 in my written English have saved me on several occasions to say the least. | |

142 | |

143 What to expect next? Well this is still a rough draft. I've only had the chance to update a few chapters. However, I've | |

144 been getting the feeling that people are starting to use my text and I owe them some updated material. My current tenative | |

145 plan is to edit one chapter every two weeks starting January 4th. It seems insane but my lower course load at college | |

146 should provide ample time. By Crypto'04 I plan to have a 2nd draft of the text polished and ready to hand out to as many | |

147 people who will take it. | |

148 | 114 |

149 \begin{flushright} Tom St Denis \end{flushright} | 115 \begin{flushright} Tom St Denis \end{flushright} |

150 | 116 |

151 \newpage | 117 \newpage |

152 I found the opportunity to work with Tom appealing for several reasons, not only could I broaden my own horizons, but also | 118 I found the opportunity to work with Tom appealing for several reasons, not only could I broaden my own horizons, but also |

935 akin to how the \textit{realloc} function from the standard C library works. Since the newly allocated digits are | 901 akin to how the \textit{realloc} function from the standard C library works. Since the newly allocated digits are |

936 assumed to contain undefined values they are initially set to zero. | 902 assumed to contain undefined values they are initially set to zero. |

937 | 903 |

938 EXAM,bn_mp_grow.c | 904 EXAM,bn_mp_grow.c |

939 | 905 |

940 A quick optimization is to first determine if a memory re-allocation is required at all. The if statement (line @23,[email protected]) checks | 906 A quick optimization is to first determine if a memory re-allocation is required at all. The if statement (line @24,[email protected]) checks |

941 if the \textbf{alloc} member of the mp\_int is smaller than the requested digit count. If the count is not larger than \textbf{alloc} | 907 if the \textbf{alloc} member of the mp\_int is smaller than the requested digit count. If the count is not larger than \textbf{alloc} |

942 the function skips the re-allocation part thus saving time. | 908 the function skips the re-allocation part thus saving time. |

943 | 909 |

944 When a re-allocation is performed it is turned into an optimal request to save time in the future. The requested digit count is | 910 When a re-allocation is performed it is turned into an optimal request to save time in the future. The requested digit count is |

945 padded upwards to 2nd multiple of \textbf{MP\_PREC} larger than \textbf{alloc} (line @25, [email protected]). The XREALLOC function is used | 911 padded upwards to 2nd multiple of \textbf{MP\_PREC} larger than \textbf{alloc} (line @25, [email protected]). The XREALLOC function is used |

1308 \section{Sign Manipulation} | 1274 \section{Sign Manipulation} |

1309 \subsection{Absolute Value} | 1275 \subsection{Absolute Value} |

1310 With the mp\_int representation of an integer, calculating the absolute value is trivial. The mp\_abs algorithm will compute | 1276 With the mp\_int representation of an integer, calculating the absolute value is trivial. The mp\_abs algorithm will compute |

1311 the absolute value of an mp\_int. | 1277 the absolute value of an mp\_int. |

1312 | 1278 |

1313 \newpage\begin{figure}[here] | 1279 \begin{figure}[here] |

1314 \begin{center} | 1280 \begin{center} |

1315 \begin{tabular}{l} | 1281 \begin{tabular}{l} |

1316 \hline Algorithm \textbf{mp\_abs}. \\ | 1282 \hline Algorithm \textbf{mp\_abs}. \\ |

1317 \textbf{Input}. An mp\_int $a$ \\ | 1283 \textbf{Input}. An mp\_int $a$ \\ |

1318 \textbf{Output}. Computes $b = \vert a \vert$ \\ | 1284 \textbf{Output}. Computes $b = \vert a \vert$ \\ |

1332 algorithm where the check in mp\_copy that determines if the source and destination are equal proves useful. This allows, | 1298 algorithm where the check in mp\_copy that determines if the source and destination are equal proves useful. This allows, |

1333 for instance, the developer to pass the same mp\_int as the source and destination to this function without addition | 1299 for instance, the developer to pass the same mp\_int as the source and destination to this function without addition |

1334 logic to handle it. | 1300 logic to handle it. |

1335 | 1301 |

1336 EXAM,bn_mp_abs.c | 1302 EXAM,bn_mp_abs.c |

1303 | |

1304 This fairly trivial algorithm first eliminates non--required duplications (line @27,a != [email protected]) and then sets the | |

1305 \textbf{sign} flag to \textbf{MP\_ZPOS}. | |

1337 | 1306 |

1338 \subsection{Integer Negation} | 1307 \subsection{Integer Negation} |

1339 With the mp\_int representation of an integer, calculating the negation is also trivial. The mp\_neg algorithm will compute | 1308 With the mp\_int representation of an integer, calculating the negation is also trivial. The mp\_neg algorithm will compute |

1340 the negative of an mp\_int input. | 1309 the negative of an mp\_int input. |

1341 | 1310 |

1366 $a$ had no digits then it must be positive by definition. Had step three been omitted then the algorithm would return | 1335 $a$ had no digits then it must be positive by definition. Had step three been omitted then the algorithm would return |

1367 zero as negative. | 1336 zero as negative. |

1368 | 1337 |

1369 EXAM,bn_mp_neg.c | 1338 EXAM,bn_mp_neg.c |

1370 | 1339 |

1340 Like mp\_abs() this function avoids non--required duplications (line @21,a != [email protected]) and then sets the sign. We | |

1341 have to make sure that only non--zero values get a \textbf{sign} of \textbf{MP\_NEG}. If the mp\_int is zero | |

1342 than the \textbf{sign} is hard--coded to \textbf{MP\_ZPOS}. | |

1343 | |

1371 \section{Small Constants} | 1344 \section{Small Constants} |

1372 \subsection{Setting Small Constants} | 1345 \subsection{Setting Small Constants} |

1373 Often a mp\_int must be set to a relatively small value such as $1$ or $2$. For these cases the mp\_set algorithm is useful. | 1346 Often a mp\_int must be set to a relatively small value such as $1$ or $2$. For these cases the mp\_set algorithm is useful. |

1374 | 1347 |

1375 \begin{figure}[here] | 1348 \newpage\begin{figure}[here] |

1376 \begin{center} | 1349 \begin{center} |

1377 \begin{tabular}{l} | 1350 \begin{tabular}{l} |

1378 \hline Algorithm \textbf{mp\_set}. \\ | 1351 \hline Algorithm \textbf{mp\_set}. \\ |

1379 \textbf{Input}. An mp\_int $a$ and a digit $b$ \\ | 1352 \textbf{Input}. An mp\_int $a$ and a digit $b$ \\ |

1380 \textbf{Output}. Make $a$ equivalent to $b$ \\ | 1353 \textbf{Output}. Make $a$ equivalent to $b$ \\ |

1395 This algorithm sets a mp\_int to a small single digit value. Step number 1 ensures that the integer is reset to the default state. The | 1368 This algorithm sets a mp\_int to a small single digit value. Step number 1 ensures that the integer is reset to the default state. The |

1396 single digit is set (\textit{modulo $\beta$}) and the \textbf{used} count is adjusted accordingly. | 1369 single digit is set (\textit{modulo $\beta$}) and the \textbf{used} count is adjusted accordingly. |

1397 | 1370 |

1398 EXAM,bn_mp_set.c | 1371 EXAM,bn_mp_set.c |

1399 | 1372 |

1400 Line @21,[email protected] calls mp\_zero() to clear the mp\_int and reset the sign. Line @22,[email protected] copies the digit | 1373 First we zero (line @21,[email protected]) the mp\_int to make sure that the other members are initialized for a |

1401 into the least significant location. Note the usage of a new constant \textbf{MP\_MASK}. This constant is used to quickly | 1374 small positive constant. mp\_zero() ensures that the \textbf{sign} is positive and the \textbf{used} count |

1402 reduce an integer modulo $\beta$. Since $\beta$ is of the form $2^k$ for any suitable $k$ it suffices to perform a binary AND with | 1375 is zero. Next we set the digit and reduce it modulo $\beta$ (line @22,[email protected]). After this step we have to |

1403 $MP\_MASK = 2^k - 1$ to perform the reduction. Finally line @23,a->[email protected] will set the \textbf{used} member with respect to the | 1376 check if the resulting digit is zero or not. If it is not then we set the \textbf{used} count to one, otherwise |

1404 digit actually set. This function will always make the integer positive. | 1377 to zero. |

1378 | |

1379 We can quickly reduce modulo $\beta$ since it is of the form $2^k$ and a quick binary AND operation with | |

1380 $2^k - 1$ will perform the same operation. | |

1405 | 1381 |

1406 One important limitation of this function is that it will only set one digit. The size of a digit is not fixed, meaning source that uses | 1382 One important limitation of this function is that it will only set one digit. The size of a digit is not fixed, meaning source that uses |

1407 this function should take that into account. Only trivially small constants can be set using this function. | 1383 this function should take that into account. Only trivially small constants can be set using this function. |

1408 | 1384 |

1409 \subsection{Setting Large Constants} | 1385 \subsection{Setting Large Constants} |

1501 By step three both inputs must have the same number of digits so its safe to start from either $a.used - 1$ or $b.used - 1$ and count down to | 1477 By step three both inputs must have the same number of digits so its safe to start from either $a.used - 1$ or $b.used - 1$ and count down to |

1502 the zero'th digit. If after all of the digits have been compared, no difference is found, the algorithm returns \textbf{MP\_EQ}. | 1478 the zero'th digit. If after all of the digits have been compared, no difference is found, the algorithm returns \textbf{MP\_EQ}. |

1503 | 1479 |

1504 EXAM,bn_mp_cmp_mag.c | 1480 EXAM,bn_mp_cmp_mag.c |

1505 | 1481 |

1506 The two if statements on lines @24,[email protected] and @28,[email protected] compare the number of digits in the two inputs. These two are performed before all of the digits | 1482 The two if statements (lines @24,[email protected] and @28,[email protected]) compare the number of digits in the two inputs. These two are |

1507 are compared since it is a very cheap test to perform and can potentially save considerable time. The implementation given is also not valid | 1483 performed before all of the digits are compared since it is a very cheap test to perform and can potentially save |

1508 without those two statements. $b.alloc$ may be smaller than $a.used$, meaning that undefined values will be read from $b$ past the end of the | 1484 considerable time. The implementation given is also not valid without those two statements. $b.alloc$ may be |

1509 array of digits. | 1485 smaller than $a.used$, meaning that undefined values will be read from $b$ past the end of the array of digits. |

1486 | |

1487 | |

1510 | 1488 |

1511 \subsection{Signed Comparisons} | 1489 \subsection{Signed Comparisons} |

1512 Comparing with sign considerations is also fairly critical in several routines (\textit{division for example}). Based on an unsigned magnitude | 1490 Comparing with sign considerations is also fairly critical in several routines (\textit{division for example}). Based on an unsigned magnitude |

1513 comparison a trivial signed comparison algorithm can be written. | 1491 comparison a trivial signed comparison algorithm can be written. |

1514 | 1492 |

1537 three the unsigned comparision flips the order of the arguments since they are both negative. For instance, if $-a > -b$ then | 1515 three the unsigned comparision flips the order of the arguments since they are both negative. For instance, if $-a > -b$ then |

1538 $\vert a \vert < \vert b \vert$. Step number four will compare the two when they are both positive. | 1516 $\vert a \vert < \vert b \vert$. Step number four will compare the two when they are both positive. |

1539 | 1517 |

1540 EXAM,bn_mp_cmp.c | 1518 EXAM,bn_mp_cmp.c |

1541 | 1519 |

1542 The two if statements on lines @22,[email protected] and @26,[email protected] perform the initial sign comparison. If the signs are not the equal then which ever | 1520 The two if statements (lines @22,[email protected] and @26,[email protected]) perform the initial sign comparison. If the signs are not the equal then which ever |

1543 has the positive sign is larger. At line @30,[email protected], the inputs are compared based on magnitudes. If the signs were both negative then | 1521 has the positive sign is larger. The inputs are compared (line @30,[email protected]) based on magnitudes. If the signs were both |

1544 the unsigned comparison is performed in the opposite direction (\textit{line @31,[email protected]}). Otherwise, the signs are assumed to | 1522 negative then the unsigned comparison is performed in the opposite direction (line @31,[email protected]). Otherwise, the signs are assumed to |

1545 be both positive and a forward direction unsigned comparison is performed. | 1523 be both positive and a forward direction unsigned comparison is performed. |

1546 | 1524 |

1547 \section*{Exercises} | 1525 \section*{Exercises} |

1548 \begin{tabular}{cl} | 1526 \begin{tabular}{cl} |

1549 $\left [ 2 \right ]$ & Modify algorithm mp\_set\_int to accept as input a variable length array of bits. \\ | 1527 $\left [ 2 \right ]$ & Modify algorithm mp\_set\_int to accept as input a variable length array of bits. \\ |

1662 The final carry is stored in $c_{max}$ and digits above $max$ upto $oldused$ are zeroed which completes the addition. | 1640 The final carry is stored in $c_{max}$ and digits above $max$ upto $oldused$ are zeroed which completes the addition. |

1663 | 1641 |

1664 | 1642 |

1665 EXAM,bn_s_mp_add.c | 1643 EXAM,bn_s_mp_add.c |

1666 | 1644 |

1667 Lines @27,[email protected] to @35,}@ perform the initial sorting of the inputs and determine the $min$ and $max$ variables. Note that $x$ is a pointer to a | 1645 We first sort (lines @27,[email protected] to @35,}@) the inputs based on magnitude and determine the $min$ and $max$ variables. |

1668 mp\_int assigned to the largest input, in effect it is a local alias. Lines @37,[email protected] to @42,}@ ensure that the destination is grown to | 1646 Note that $x$ is a pointer to an mp\_int assigned to the largest input, in effect it is a local alias. Next we |

1669 accomodate the result of the addition. | 1647 grow the destination (@37,[email protected] to @42,}@) ensure that it can accomodate the result of the addition. |

1670 | 1648 |

1671 Similar to the implementation of mp\_copy this function uses the braced code and local aliases coding style. The three aliases that are on | 1649 Similar to the implementation of mp\_copy this function uses the braced code and local aliases coding style. The three aliases that are on |

1672 lines @56,[email protected], @59,[email protected] and @62,[email protected] represent the two inputs and destination variables respectively. These aliases are used to ensure the | 1650 lines @56,[email protected], @59,[email protected] and @62,[email protected] represent the two inputs and destination variables respectively. These aliases are used to ensure the |

1673 compiler does not have to dereference $a$, $b$ or $c$ (respectively) to access the digits of the respective mp\_int. | 1651 compiler does not have to dereference $a$, $b$ or $c$ (respectively) to access the digits of the respective mp\_int. |

1674 | 1652 |

1675 The initial carry $u$ is cleared on line @65,u = [email protected], note that $u$ is of type mp\_digit which ensures type compatibility within the | 1653 The initial carry $u$ will be cleared (line @65,u = [email protected]), note that $u$ is of type mp\_digit which ensures type |

1676 implementation. The initial addition loop begins on line @66,[email protected] and ends on line @75,}@. Similarly the conditional addition loop | 1654 compatibility within the implementation. The initial addition (line @66,[email protected] to @75,}@) adds digits from |

1677 begins on line @81,[email protected] and ends on line @90,}@. The addition is finished with the final carry being stored in $tmpc$ on line @94,[email protected] | 1655 both inputs until the smallest input runs out of digits. Similarly the conditional addition loop |

1678 Note the ``++'' operator on the same line. After line @94,[email protected] $tmpc$ will point to the $c.used$'th digit of the mp\_int $c$. This is useful | 1656 (line @81,[email protected] to @90,}@) adds the remaining digits from the larger of the two inputs. The addition is finished |

1679 for the next loop on lines @97,[email protected] to @99,}@ which set any old upper digits to zero. | 1657 with the final carry being stored in $tmpc$ (line @94,[email protected]). Note the ``++'' operator within the same expression. |

1658 After line @94,[email protected], $tmpc$ will point to the $c.used$'th digit of the mp\_int $c$. This is useful | |

1659 for the next loop (line @97,[email protected] to @99,}@) which set any old upper digits to zero. | |

1680 | 1660 |

1681 \subsection{Low Level Subtraction} | 1661 \subsection{Low Level Subtraction} |

1682 The low level unsigned subtraction algorithm is very similar to the low level unsigned addition algorithm. The principle difference is that the | 1662 The low level unsigned subtraction algorithm is very similar to the low level unsigned addition algorithm. The principle difference is that the |

1683 unsigned subtraction algorithm requires the result to be positive. That is when computing $a - b$ the condition $\vert a \vert \ge \vert b\vert$ must | 1663 unsigned subtraction algorithm requires the result to be positive. That is when computing $a - b$ the condition $\vert a \vert \ge \vert b\vert$ must |

1684 be met for this algorithm to function properly. Keep in mind this low level algorithm is not meant to be used in higher level algorithms directly. | 1664 be met for this algorithm to function properly. Keep in mind this low level algorithm is not meant to be used in higher level algorithms directly. |

1690 the range $0 \le x < 2\beta$ for the algorithms to work correctly. However, it is allowable that a mp\_digit represent a larger range of values. For | 1670 the range $0 \le x < 2\beta$ for the algorithms to work correctly. However, it is allowable that a mp\_digit represent a larger range of values. For |

1691 this algorithm we will assume that the variable $\gamma$ represents the number of bits available in a | 1671 this algorithm we will assume that the variable $\gamma$ represents the number of bits available in a |

1692 mp\_digit (\textit{this implies $2^{\gamma} > \beta$}). | 1672 mp\_digit (\textit{this implies $2^{\gamma} > \beta$}). |

1693 | 1673 |

1694 For example, the default for LibTomMath is to use a ``unsigned long'' for the mp\_digit ``type'' while $\beta = 2^{28}$. In ISO C an ``unsigned long'' | 1674 For example, the default for LibTomMath is to use a ``unsigned long'' for the mp\_digit ``type'' while $\beta = 2^{28}$. In ISO C an ``unsigned long'' |

1695 data type must be able to represent $0 \le x < 2^{32}$ meaning that in this case $\gamma = 32$. | 1675 data type must be able to represent $0 \le x < 2^{32}$ meaning that in this case $\gamma \ge 32$. |

1696 | 1676 |

1697 \newpage\begin{figure}[!here] | 1677 \newpage\begin{figure}[!here] |

1698 \begin{center} | 1678 \begin{center} |

1699 \begin{small} | 1679 \begin{small} |

1700 \begin{tabular}{l} | 1680 \begin{tabular}{l} |

1757 If $b$ has a smaller magnitude than $a$ then step 9 will force the carry and copy operation to propagate through the larger input $a$ into $c$. Step | 1737 If $b$ has a smaller magnitude than $a$ then step 9 will force the carry and copy operation to propagate through the larger input $a$ into $c$. Step |

1758 10 will ensure that any leading digits of $c$ above the $max$'th position are zeroed. | 1738 10 will ensure that any leading digits of $c$ above the $max$'th position are zeroed. |

1759 | 1739 |

1760 EXAM,bn_s_mp_sub.c | 1740 EXAM,bn_s_mp_sub.c |

1761 | 1741 |

1762 Line @24,[email protected] and @25,[email protected] perform the initial hardcoded sorting of the inputs. In reality the $min$ and $max$ variables are only aliases and are only | 1742 Like low level addition we ``sort'' the inputs. Except in this case the sorting is hardcoded |

1763 used to make the source code easier to read. Again the pointer alias optimization is used within this algorithm. Lines @42,[email protected], @43,[email protected] and @44,[email protected] initialize the aliases for | 1743 (lines @24,[email protected] and @25,[email protected]). In reality the $min$ and $max$ variables are only aliases and are only |

1764 $a$, $b$ and $c$ respectively. | 1744 used to make the source code easier to read. Again the pointer alias optimization is used |

1765 | 1745 within this algorithm. The aliases $tmpa$, $tmpb$ and $tmpc$ are initialized |

1766 The first subtraction loop occurs on lines @47,u = [email protected] through @61,}@. The theory behind the subtraction loop is exactly the same as that for | 1746 (lines @42,[email protected], @43,[email protected] and @44,[email protected]) for $a$, $b$ and $c$ respectively. |

1767 the addition loop. As remarked earlier there is an implementation reason for using the ``awkward'' method of extracting the carry | 1747 |

1768 (\textit{see line @57, >>@}). The traditional method for extracting the carry would be to shift by $lg(\beta)$ positions and logically AND | 1748 The first subtraction loop (lines @47,u = [email protected] through @61,}@) subtract digits from both inputs until the smaller of |

1769 the least significant bit. The AND operation is required because all of the bits above the $\lg(\beta)$'th bit will be set to one after a carry | 1749 the two inputs has been exhausted. As remarked earlier there is an implementation reason for using the ``awkward'' |

1770 occurs from subtraction. This carry extraction requires two relatively cheap operations to extract the carry. The other method is to simply | 1750 method of extracting the carry (line @57, >>@). The traditional method for extracting the carry would be to shift |

1771 shift the most significant bit to the least significant bit thus extracting the carry with a single cheap operation. This optimization only works on | 1751 by $lg(\beta)$ positions and logically AND the least significant bit. The AND operation is required because all of |

1772 twos compliment machines which is a safe assumption to make. | 1752 the bits above the $\lg(\beta)$'th bit will be set to one after a carry occurs from subtraction. This carry |

1773 | 1753 extraction requires two relatively cheap operations to extract the carry. The other method is to simply shift the |

1774 If $a$ has a larger magnitude than $b$ an additional loop (\textit{see lines @64,[email protected] through @73,}@}) is required to propagate the carry through | 1754 most significant bit to the least significant bit thus extracting the carry with a single cheap operation. This |

1775 $a$ and copy the result to $c$. | 1755 optimization only works on twos compliment machines which is a safe assumption to make. |

1756 | |

1757 If $a$ has a larger magnitude than $b$ an additional loop (lines @64,[email protected] through @73,}@) is required to propagate | |

1758 the carry through $a$ and copy the result to $c$. | |

1776 | 1759 |

1777 \subsection{High Level Addition} | 1760 \subsection{High Level Addition} |

1778 Now that both lower level addition and subtraction algorithms have been established an effective high level signed addition algorithm can be | 1761 Now that both lower level addition and subtraction algorithms have been established an effective high level signed addition algorithm can be |

1779 established. This high level addition algorithm will be what other algorithms and developers will use to perform addition of mp\_int data | 1762 established. This high level addition algorithm will be what other algorithms and developers will use to perform addition of mp\_int data |

1780 types. | 1763 types. |

2096 \newpage | 2079 \newpage |

2097 FIGU,sliding_window,Sliding Window Movement | 2080 FIGU,sliding_window,Sliding Window Movement |

2098 | 2081 |

2099 EXAM,bn_mp_lshd.c | 2082 EXAM,bn_mp_lshd.c |

2100 | 2083 |

2101 The if statement on line @24,[email protected] ensures that the $b$ variable is greater than zero. The \textbf{used} count is incremented by $b$ before | 2084 The if statement (line @24,[email protected]) ensures that the $b$ variable is greater than zero since we do not interpret negative |

2102 the copy loop begins. This elminates the need for an additional variable in the for loop. The variable $top$ on line @42,[email protected] is an alias | 2085 shift counts properly. The \textbf{used} count is incremented by $b$ before the copy loop begins. This elminates |

2103 for the leading digit while $bottom$ on line @45,[email protected] is an alias for the trailing edge. The aliases form a window of exactly $b$ digits | 2086 the need for an additional variable in the for loop. The variable $top$ (line @42,[email protected]) is an alias |

2104 over the input. | 2087 for the leading digit while $bottom$ (line @45,[email protected]) is an alias for the trailing edge. The aliases form a |

2088 window of exactly $b$ digits over the input. | |

2105 | 2089 |

2106 \subsection{Division by $x$} | 2090 \subsection{Division by $x$} |

2107 | 2091 |

2108 Division by powers of $x$ is easily achieved by shifting the digits right and removing any that will end up to the right of the zero'th digit. | 2092 Division by powers of $x$ is easily achieved by shifting the digits right and removing any that will end up to the right of the zero'th digit. |

2109 | 2093 |

2149 | 2133 |

2150 Once the window copy is complete the upper digits must be zeroed and the \textbf{used} count decremented. | 2134 Once the window copy is complete the upper digits must be zeroed and the \textbf{used} count decremented. |

2151 | 2135 |

2152 EXAM,bn_mp_rshd.c | 2136 EXAM,bn_mp_rshd.c |

2153 | 2137 |

2154 The only noteworthy element of this routine is the lack of a return type. | 2138 The only noteworthy element of this routine is the lack of a return type since it cannot fail. Like mp\_lshd() we |

2155 | 2139 form a sliding window except we copy in the other direction. After the window (line @59,for (;@) we then zero |

2156 -- Will update later to give it a return type...Tom | 2140 the upper digits of the input to make sure the result is correct. |

2157 | 2141 |

2158 \section{Powers of Two} | 2142 \section{Powers of Two} |

2159 | 2143 |

2160 Now that algorithms for moving single bits as well as whole digits exist algorithms for moving the ``in between'' distances are required. For | 2144 Now that algorithms for moving single bits as well as whole digits exist algorithms for moving the ``in between'' distances are required. For |

2161 example, to quickly multiply by $2^k$ for any $k$ without using a full multiplier algorithm would prove useful. Instead of performing single | 2145 example, to quickly multiply by $2^k$ for any $k$ without using a full multiplier algorithm would prove useful. Instead of performing single |

2212 This algorithm is loosely measured as a $O(2n)$ algorithm which means that if the input is $n$-digits that it takes $2n$ ``time'' to | 2196 This algorithm is loosely measured as a $O(2n)$ algorithm which means that if the input is $n$-digits that it takes $2n$ ``time'' to |

2213 complete. It is possible to optimize this algorithm down to a $O(n)$ algorithm at a cost of making the algorithm slightly harder to follow. | 2197 complete. It is possible to optimize this algorithm down to a $O(n)$ algorithm at a cost of making the algorithm slightly harder to follow. |

2214 | 2198 |

2215 EXAM,bn_mp_mul_2d.c | 2199 EXAM,bn_mp_mul_2d.c |

2216 | 2200 |

2217 Notes to be revised when code is updated. -- Tom | 2201 The shifting is performed in--place which means the first step (line @24,a != [email protected]) is to copy the input to the |

2202 destination. We avoid calling mp\_copy() by making sure the mp\_ints are different. The destination then | |

2203 has to be grown (line @31,[email protected]) to accomodate the result. | |

2204 | |

2205 If the shift count $b$ is larger than $lg(\beta)$ then a call to mp\_lshd() is used to handle all of the multiples | |

2206 of $lg(\beta)$. Leaving only a remaining shift of $lg(\beta) - 1$ or fewer bits left. Inside the actual shift | |

2207 loop (lines @45,[email protected] to @76,}@) we make use of pre--computed values $shift$ and $mask$. These are used to | |

2208 extract the carry bit(s) to pass into the next iteration of the loop. The $r$ and $rr$ variables form a | |

2209 chain between consecutive iterations to propagate the carry. | |

2218 | 2210 |

2219 \subsection{Division by Power of Two} | 2211 \subsection{Division by Power of Two} |

2220 | 2212 |

2221 \newpage\begin{figure}[!here] | 2213 \newpage\begin{figure}[!here] |

2222 \begin{small} | 2214 \begin{small} |

2261 The implementation of algorithm mp\_div\_2d is slightly different than the algorithm specifies. The remainder $d$ may be optionally | 2253 The implementation of algorithm mp\_div\_2d is slightly different than the algorithm specifies. The remainder $d$ may be optionally |

2262 ignored by passing \textbf{NULL} as the pointer to the mp\_int variable. The temporary mp\_int variable $t$ is used to hold the | 2254 ignored by passing \textbf{NULL} as the pointer to the mp\_int variable. The temporary mp\_int variable $t$ is used to hold the |

2263 result of the remainder operation until the end. This allows $d$ and $a$ to represent the same mp\_int without modifying $a$ before | 2255 result of the remainder operation until the end. This allows $d$ and $a$ to represent the same mp\_int without modifying $a$ before |

2264 the quotient is obtained. | 2256 the quotient is obtained. |

2265 | 2257 |

2266 The remainder of the source code is essentially the same as the source code for mp\_mul\_2d. (-- Fix this paragraph up later, Tom). | 2258 The remainder of the source code is essentially the same as the source code for mp\_mul\_2d. The only significant difference is |

2259 the direction of the shifts. | |

2267 | 2260 |

2268 \subsection{Remainder of Division by Power of Two} | 2261 \subsection{Remainder of Division by Power of Two} |

2269 | 2262 |

2270 The last algorithm in the series of polynomial basis power of two algorithms is calculating the remainder of division by $2^b$. This | 2263 The last algorithm in the series of polynomial basis power of two algorithms is calculating the remainder of division by $2^b$. This |

2271 algorithm benefits from the fact that in twos complement arithmetic $a \mbox{ (mod }2^b\mbox{)}$ is the same as $a$ AND $2^b - 1$. | 2264 algorithm benefits from the fact that in twos complement arithmetic $a \mbox{ (mod }2^b\mbox{)}$ is the same as $a$ AND $2^b - 1$. |

2304 result is set to zero. If $b$ is greater than the number of bits in $a$ then it simply copies $a$ to $c$ and returns. Otherwise, $a$ | 2297 result is set to zero. If $b$ is greater than the number of bits in $a$ then it simply copies $a$ to $c$ and returns. Otherwise, $a$ |

2305 is copied to $b$, leading digits are removed and the remaining leading digit is trimed to the exact bit count. | 2298 is copied to $b$, leading digits are removed and the remaining leading digit is trimed to the exact bit count. |

2306 | 2299 |

2307 EXAM,bn_mp_mod_2d.c | 2300 EXAM,bn_mp_mod_2d.c |

2308 | 2301 |

2309 -- Add comments later, Tom. | 2302 We first avoid cases of $b \le 0$ by simply mp\_zero()'ing the destination in such cases. Next if $2^b$ is larger |

2303 than the input we just mp\_copy() the input and return right away. After this point we know we must actually | |

2304 perform some work to produce the remainder. | |

2305 | |

2306 Recalling that reducing modulo $2^k$ and a binary ``and'' with $2^k - 1$ are numerically equivalent we can quickly reduce | |

2307 the number. First we zero any digits above the last digit in $2^b$ (line @41,[email protected]). Next we reduce the | |

2308 leading digit of both (line @45,&[email protected]) and then mp\_clamp(). | |

2310 | 2309 |

2311 \section*{Exercises} | 2310 \section*{Exercises} |

2312 \begin{tabular}{cl} | 2311 \begin{tabular}{cl} |

2313 $\left [ 3 \right ] $ & Devise an algorithm that performs $a \cdot 2^b$ for generic values of $b$ \\ | 2312 $\left [ 3 \right ] $ & Devise an algorithm that performs $a \cdot 2^b$ for generic values of $b$ \\ |

2314 & in $O(n)$ time. \\ | 2313 & in $O(n)$ time. \\ |

2462 digit since that digit is assumed to be zero at this point. However, if $ix + pb \ge digs$ the carry is not set as it would make the result | 2461 digit since that digit is assumed to be zero at this point. However, if $ix + pb \ge digs$ the carry is not set as it would make the result |

2463 exceed the precision requested. | 2462 exceed the precision requested. |

2464 | 2463 |

2465 EXAM,bn_s_mp_mul_digs.c | 2464 EXAM,bn_s_mp_mul_digs.c |

2466 | 2465 |

2467 Lines @31,[email protected] to @35,}@ determine if the Comba method can be used first. The conditions for using the Comba routine are that min$(a.used, b.used) < \delta$ and | 2466 First we determine (line @30,[email protected]) if the Comba method can be used first since it's faster. The conditions for |

2468 the number of digits of output is less than \textbf{MP\_WARRAY}. This new constant is used to control | 2467 sing the Comba routine are that min$(a.used, b.used) < \delta$ and the number of digits of output is less than |

2469 the stack usage in the Comba routines. By default it is set to $\delta$ but can be reduced when memory is at a premium. | 2468 \textbf{MP\_WARRAY}. This new constant is used to control the stack usage in the Comba routines. By default it is |

2470 | 2469 set to $\delta$ but can be reduced when memory is at a premium. |

2471 Of particular importance is the calculation of the $ix+iy$'th column on lines @64,[email protected], @65,[email protected] and @66,[email protected] Note how all of the | 2470 |

2472 variables are cast to the type \textbf{mp\_word}, which is also the type of variable $\hat r$. That is to ensure that double precision operations | 2471 If we cannot use the Comba method we proceed to setup the baseline routine. We allocate the the destination mp\_int |

2473 are used instead of single precision. The multiplication on line @65,) * (@ makes use of a specific GCC optimizer behaviour. On the outset it looks like | 2472 $t$ (line @36,[email protected]) to the exact size of the output to avoid further re--allocations. At this point we now |

2474 the compiler will have to use a double precision multiplication to produce the result required. Such an operation would be horribly slow on most | 2473 begin the $O(n^2)$ loop. |

2475 processors and drag this to a crawl. However, GCC is smart enough to realize that double wide output single precision multipliers can be used. For | 2474 |

2476 example, the instruction ``MUL'' on the x86 processor can multiply two 32-bit values and produce a 64-bit result. | 2475 This implementation of multiplication has the caveat that it can be trimmed to only produce a variable number of |

2476 digits as output. In each iteration of the outer loop the $pb$ variable is set (line @48,[email protected]) to the maximum | |

2477 number of inner loop iterations. | |

2478 | |

2479 Inside the inner loop we calculate $\hat r$ as the mp\_word product of the two mp\_digits and the addition of the | |

2480 carry from the previous iteration. A particularly important observation is that most modern optimizing | |

2481 C compilers (GCC for instance) can recognize that a $N \times N \rightarrow 2N$ multiplication is all that | |

2482 is required for the product. In x86 terms for example, this means using the MUL instruction. | |

2483 | |

2484 Each digit of the product is stored in turn (line @68,[email protected]) and the carry propagated (line @71,>>@) to the | |

2485 next iteration. | |

2477 | 2486 |

2478 \subsection{Faster Multiplication by the ``Comba'' Method} | 2487 \subsection{Faster Multiplication by the ``Comba'' Method} |

2479 MARK,COMBA | 2488 MARK,COMBA |

2480 | 2489 |

2481 One of the huge drawbacks of the ``baseline'' algorithms is that at the $O(n^2)$ level the carry must be computed and propagated upwards. This | 2490 One of the huge drawbacks of the ``baseline'' algorithms is that at the $O(n^2)$ level the carry must be |

2482 makes the nested loop very sequential and hard to unroll and implement in parallel. The ``Comba'' \cite{COMBA} method is named after little known | 2491 computed and propagated upwards. This makes the nested loop very sequential and hard to unroll and implement |

2483 (\textit{in cryptographic venues}) Paul G. Comba who described a method of implementing fast multipliers that do not require nested | 2492 in parallel. The ``Comba'' \cite{COMBA} method is named after little known (\textit{in cryptographic venues}) Paul G. |

2484 carry fixup operations. As an interesting aside it seems that Paul Barrett describes a similar technique in | 2493 Comba who described a method of implementing fast multipliers that do not require nested carry fixup operations. As an |

2485 his 1986 paper \cite{BARRETT} written five years before. | 2494 interesting aside it seems that Paul Barrett describes a similar technique in his 1986 paper \cite{BARRETT} written |

2486 | 2495 five years before. |

2487 At the heart of the Comba technique is once again the long-hand algorithm. Except in this case a slight twist is placed on how | 2496 |

2488 the columns of the result are produced. In the standard long-hand algorithm rows of products are produced then added together to form the | 2497 At the heart of the Comba technique is once again the long-hand algorithm. Except in this case a slight |

2489 final result. In the baseline algorithm the columns are added together after each iteration to get the result instantaneously. | 2498 twist is placed on how the columns of the result are produced. In the standard long-hand algorithm rows of products |

2490 | 2499 are produced then added together to form the final result. In the baseline algorithm the columns are added together |

2491 In the Comba algorithm the columns of the result are produced entirely independently of each other. That is at the $O(n^2)$ level a | 2500 after each iteration to get the result instantaneously. |

2492 simple multiplication and addition step is performed. The carries of the columns are propagated after the nested loop to reduce the amount | 2501 |

2493 of work requiored. Succintly the first step of the algorithm is to compute the product vector $\vec x$ as follows. | 2502 In the Comba algorithm the columns of the result are produced entirely independently of each other. That is at |

2503 the $O(n^2)$ level a simple multiplication and addition step is performed. The carries of the columns are propagated | |

2504 after the nested loop to reduce the amount of work requiored. Succintly the first step of the algorithm is to compute | |

2505 the product vector $\vec x$ as follows. | |

2494 | 2506 |

2495 \begin{equation} | 2507 \begin{equation} |

2496 \vec x_n = \sum_{i+j = n} a_ib_j, \forall n \in \lbrace 0, 1, 2, \ldots, i + j \rbrace | 2508 \vec x_n = \sum_{i+j = n} a_ib_j, \forall n \in \lbrace 0, 1, 2, \ldots, i + j \rbrace |

2497 \end{equation} | 2509 \end{equation} |

2498 | 2510 |

2582 \begin{tabular}{l} | 2594 \begin{tabular}{l} |

2583 \hline Algorithm \textbf{fast\_s\_mp\_mul\_digs}. \\ | 2595 \hline Algorithm \textbf{fast\_s\_mp\_mul\_digs}. \\ |

2584 \textbf{Input}. mp\_int $a$, mp\_int $b$ and an integer $digs$ \\ | 2596 \textbf{Input}. mp\_int $a$, mp\_int $b$ and an integer $digs$ \\ |

2585 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\ | 2597 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\ |

2586 \hline \\ | 2598 \hline \\ |

2587 Place an array of \textbf{MP\_WARRAY} double precision digits named $\hat W$ on the stack. \\ | 2599 Place an array of \textbf{MP\_WARRAY} single precision digits named $W$ on the stack. \\ |

2588 1. If $c.alloc < digs$ then grow $c$ to $digs$ digits. (\textit{mp\_grow}) \\ | 2600 1. If $c.alloc < digs$ then grow $c$ to $digs$ digits. (\textit{mp\_grow}) \\ |

2589 2. If step 1 failed return(\textit{MP\_MEM}).\\ | 2601 2. If step 1 failed return(\textit{MP\_MEM}).\\ |

2590 \\ | 2602 \\ |

2591 Zero the temporary array $\hat W$. \\ | 2603 3. $pa \leftarrow \mbox{MIN}(digs, a.used + b.used)$ \\ |

2592 3. for $n$ from $0$ to $digs - 1$ do \\ | |

2593 \hspace{3mm}3.1 $\hat W_n \leftarrow 0$ \\ | |

2594 \\ | 2604 \\ |

2595 Compute the columns. \\ | 2605 4. $\_ \hat W \leftarrow 0$ \\ |

2596 4. for $ix$ from $0$ to $a.used - 1$ do \\ | 2606 5. for $ix$ from 0 to $pa - 1$ do \\ |

2597 \hspace{3mm}4.1 $pb \leftarrow \mbox{min}(b.used, digs - ix)$ \\ | 2607 \hspace{3mm}5.1 $ty \leftarrow \mbox{MIN}(b.used - 1, ix)$ \\ |

2598 \hspace{3mm}4.2 If $pb < 1$ then goto step 5. \\ | 2608 \hspace{3mm}5.2 $tx \leftarrow ix - ty$ \\ |

2599 \hspace{3mm}4.3 for $iy$ from $0$ to $pb - 1$ do \\ | 2609 \hspace{3mm}5.3 $iy \leftarrow \mbox{MIN}(a.used - tx, ty + 1)$ \\ |

2600 \hspace{6mm}4.3.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}b_{iy}$ \\ | 2610 \hspace{3mm}5.4 for $iz$ from 0 to $iy - 1$ do \\ |

2611 \hspace{6mm}5.4.1 $\_ \hat W \leftarrow \_ \hat W + a_{tx+iy}b_{ty-iy}$ \\ | |

2612 \hspace{3mm}5.5 $W_{ix} \leftarrow \_ \hat W (\mbox{mod }\beta)$\\ | |

2613 \hspace{3mm}5.6 $\_ \hat W \leftarrow \lfloor \_ \hat W / \beta \rfloor$ \\ | |

2614 6. $W_{pa} \leftarrow \_ \hat W (\mbox{mod }\beta)$ \\ | |

2601 \\ | 2615 \\ |

2602 Propagate the carries upwards. \\ | 2616 7. $oldused \leftarrow c.used$ \\ |

2603 5. $oldused \leftarrow c.used$ \\ | 2617 8. $c.used \leftarrow digs$ \\ |

2604 6. $c.used \leftarrow digs$ \\ | 2618 9. for $ix$ from $0$ to $pa$ do \\ |

2605 7. If $digs > 1$ then do \\ | 2619 \hspace{3mm}9.1 $c_{ix} \leftarrow W_{ix}$ \\ |

2606 \hspace{3mm}7.1. for $ix$ from $1$ to $digs - 1$ do \\ | 2620 10. for $ix$ from $pa + 1$ to $oldused - 1$ do \\ |

2607 \hspace{6mm}7.1.1 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix-1} / \beta \rfloor$ \\ | 2621 \hspace{3mm}10.1 $c_{ix} \leftarrow 0$ \\ |

2608 \hspace{6mm}7.1.2 $c_{ix - 1} \leftarrow \hat W_{ix - 1} \mbox{ (mod }\beta\mbox{)}$ \\ | |

2609 8. else do \\ | |

2610 \hspace{3mm}8.1 $ix \leftarrow 0$ \\ | |

2611 9. $c_{ix} \leftarrow \hat W_{ix} \mbox{ (mod }\beta\mbox{)}$ \\ | |

2612 \\ | 2622 \\ |

2613 Zero excess digits. \\ | 2623 11. Clamp $c$. \\ |

2614 10. If $digs < oldused$ then do \\ | 2624 12. Return MP\_OKAY. \\ |

2615 \hspace{3mm}10.1 for $n$ from $digs$ to $oldused - 1$ do \\ | |

2616 \hspace{6mm}10.1.1 $c_n \leftarrow 0$ \\ | |

2617 11. Clamp excessive digits of $c$. (\textit{mp\_clamp}) \\ | |

2618 12. Return(\textit{MP\_OKAY}). \\ | |

2619 \hline | 2625 \hline |

2620 \end{tabular} | 2626 \end{tabular} |

2621 \end{center} | 2627 \end{center} |

2622 \end{small} | 2628 \end{small} |

2623 \caption{Algorithm fast\_s\_mp\_mul\_digs} | 2629 \caption{Algorithm fast\_s\_mp\_mul\_digs} |

2624 \label{fig:COMBAMULT} | 2630 \label{fig:COMBAMULT} |

2625 \end{figure} | 2631 \end{figure} |

2626 | 2632 |

2627 \textbf{Algorithm fast\_s\_mp\_mul\_digs.} | 2633 \textbf{Algorithm fast\_s\_mp\_mul\_digs.} |

2628 This algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision. The algorithm | 2634 This algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision. |

2629 essentially peforms the same calculation as algorithm s\_mp\_mul\_digs, just much faster. | 2635 |

2630 | 2636 The outer loop of this algorithm is more complicated than that of the baseline multiplier. This is because on the inside of the |

2631 The array $\hat W$ is meant to be on the stack when the algorithm is used. The size of the array does not change which is ideal. Note also that | 2637 loop we want to produce one column per pass. This allows the accumulator $\_ \hat W$ to be placed in CPU registers and |

2632 unlike algorithm s\_mp\_mul\_digs no temporary mp\_int is required since the result is calculated directly in $\hat W$. | 2638 reduce the memory bandwidth to two \textbf{mp\_digit} reads per iteration. |

2633 | 2639 |

2634 The $O(n^2)$ loop on step four is where the Comba method's advantages begin to show through in comparison to the baseline algorithm. The lack of | 2640 The $ty$ variable is set to the minimum count of $ix$ or the number of digits in $b$. That way if $a$ has more digits than |

2635 a carry variable or propagation in this loop allows the loop to be performed with only single precision multiplication and additions. Now that each | 2641 $b$ this will be limited to $b.used - 1$. The $tx$ variable is set to the to the distance past $b.used$ the variable |

2636 iteration of the inner loop can be performed independent of the others the inner loop can be performed with a high level of parallelism. | 2642 $ix$ is. This is used for the immediately subsequent statement where we find $iy$. |

2643 | |

2644 The variable $iy$ is the minimum digits we can read from either $a$ or $b$ before running out. Computing one column at a time | |

2645 means we have to scan one integer upwards and the other downwards. $a$ starts at $tx$ and $b$ starts at $ty$. In each | |

2646 pass we are producing the $ix$'th output column and we note that $tx + ty = ix$. As we move $tx$ upwards we have to | |

2647 move $ty$ downards so the equality remains valid. The $iy$ variable is the number of iterations until | |

2648 $tx \ge a.used$ or $ty < 0$ occurs. | |

2649 | |

2650 After every inner pass we store the lower half of the accumulator into $W_{ix}$ and then propagate the carry of the accumulator | |

2651 into the next round by dividing $\_ \hat W$ by $\beta$. | |

2637 | 2652 |

2638 To measure the benefits of the Comba method over the baseline method consider the number of operations that are required. If the | 2653 To measure the benefits of the Comba method over the baseline method consider the number of operations that are required. If the |

2639 cost in terms of time of a multiply and addition is $p$ and the cost of a carry propagation is $q$ then a baseline multiplication would require | 2654 cost in terms of time of a multiply and addition is $p$ and the cost of a carry propagation is $q$ then a baseline multiplication would require |

2640 $O \left ((p + q)n^2 \right )$ time to multiply two $n$-digit numbers. The Comba method requires only $O(pn^2 + qn)$ time, however in practice, | 2655 $O \left ((p + q)n^2 \right )$ time to multiply two $n$-digit numbers. The Comba method requires only $O(pn^2 + qn)$ time, however in practice, |

2641 the speed increase is actually much more. With $O(n)$ space the algorithm can be reduced to $O(pn + qn)$ time by implementing the $n$ multiply | 2656 the speed increase is actually much more. With $O(n)$ space the algorithm can be reduced to $O(pn + qn)$ time by implementing the $n$ multiply |

2642 and addition operations in the nested loop in parallel. | 2657 and addition operations in the nested loop in parallel. |

2643 | 2658 |

2644 EXAM,bn_fast_s_mp_mul_digs.c | 2659 EXAM,bn_fast_s_mp_mul_digs.c |

2645 | 2660 |

2646 The memset on line @47,[email protected] clears the initial $\hat W$ array to zero in a single step. Like the slower baseline multiplication | 2661 As per the pseudo--code we first calculate $pa$ (line @47,[email protected]) as the number of digits to output. Next we begin the outer loop |

2647 implementation a series of aliases (\textit{lines @67, [email protected], @70, [email protected] and @75,[email protected]}) are used to simplify the inner $O(n^2)$ loop. | 2662 to produce the individual columns of the product. We use the two aliases $tmpx$ and $tmpy$ (lines @61,[email protected], @62,[email protected]) to point |

2648 In this case a new alias $\_\hat W$ has been added which refers to the double precision columns offset by $ix$ in each pass. | 2663 inside the two multiplicands quickly. |

2649 | 2664 |

2650 The inner loop on lines @83,[email protected], @84,[email protected] and @85,}@ is where the algorithm will spend the majority of the time, which is why it has been | 2665 The inner loop (lines @70,[email protected] to @72,}@) of this implementation is where the tradeoff come into play. Originally this comba |

2651 stripped to the bones of any extra baggage\footnote{Hence the pointer aliases.}. On x86 processors the multiplication and additions amount to at the | 2666 implementation was ``row--major'' which means it adds to each of the columns in each pass. After the outer loop it would then fix |

2652 very least five instructions (\textit{two loads, two additions, one multiply}) while on the ARMv4 processors they amount to only three | 2667 the carries. This was very fast except it had an annoying drawback. You had to read a mp\_word and two mp\_digits and write |

2653 (\textit{one load, one store, one multiply-add}). For both of the x86 and ARMv4 processors the GCC compiler performs a good job at unrolling the loop | 2668 one mp\_word per iteration. On processors such as the Athlon XP and P4 this did not matter much since the cache bandwidth |

2654 and scheduling the instructions so there are very few dependency stalls. | 2669 is very high and it can keep the ALU fed with data. It did, however, matter on older and embedded cpus where cache is often |

2655 | 2670 slower and also often doesn't exist. This new algorithm only performs two reads per iteration under the assumption that the |

2656 In theory the difference between the baseline and comba algorithms is a mere $O(qn)$ time difference. However, in the $O(n^2)$ nested loop of the | 2671 compiler has aliased $\_ \hat W$ to a CPU register. |

2657 baseline method there are dependency stalls as the algorithm must wait for the multiplier to finish before propagating the carry to the next | 2672 |

2658 digit. As a result fewer of the often multiple execution units\footnote{The AMD Athlon has three execution units and the Intel P4 has four.} can | 2673 After the inner loop we store the current accumulator in $W$ and shift $\_ \hat W$ (lines @75,W[ix]@, @78,>>@) to forward it as |

2659 be simultaneously used. | 2674 a carry for the next pass. After the outer loop we use the final carry (line @82,W[ix]@) as the last digit of the product. |

2660 | 2675 |

2661 \subsection{Polynomial Basis Multiplication} | 2676 \subsection{Polynomial Basis Multiplication} |

2662 To break the $O(n^2)$ barrier in multiplication requires a completely different look at integer multiplication. In the following algorithms | 2677 To break the $O(n^2)$ barrier in multiplication requires a completely different look at integer multiplication. In the following algorithms |

2663 the use of polynomial basis representation for two integers $a$ and $b$ as $f(x) = \sum_{i=0}^{n} a_i x^i$ and | 2678 the use of polynomial basis representation for two integers $a$ and $b$ as $f(x) = \sum_{i=0}^{n} a_i x^i$ and |

2664 $g(x) = \sum_{i=0}^{n} b_i x^i$ respectively, is required. In this system both $f(x)$ and $g(x)$ have $n + 1$ terms and are of the $n$'th degree. | 2679 $g(x) = \sum_{i=0}^{n} b_i x^i$ respectively, is required. In this system both $f(x)$ and $g(x)$ have $n + 1$ terms and are of the $n$'th degree. |

2974 Once the coeffients have been isolated, the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$ is known. By substituting $\beta^{k}$ for $x$, the integer | 2989 Once the coeffients have been isolated, the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$ is known. By substituting $\beta^{k}$ for $x$, the integer |

2975 result $a \cdot b$ is produced. | 2990 result $a \cdot b$ is produced. |

2976 | 2991 |

2977 EXAM,bn_mp_toom_mul.c | 2992 EXAM,bn_mp_toom_mul.c |

2978 | 2993 |

2979 -- Comments to be added during editing phase. | 2994 The first obvious thing to note is that this algorithm is complicated. The complexity is worth it if you are multiplying very |

2995 large numbers. For example, a 10,000 digit multiplication takes approximaly 99,282,205 fewer single precision multiplications with | |

2996 Toom--Cook than a Comba or baseline approach (this is a savings of more than 99$\%$). For most ``crypto'' sized numbers this | |

2997 algorithm is not practical as Karatsuba has a much lower cutoff point. | |

2998 | |

2999 First we split $a$ and $b$ into three roughly equal portions. This has been accomplished (lines @40,[email protected] to @69,[email protected]) with | |

3000 combinations of mp\_rshd() and mp\_mod\_2d() function calls. At this point $a = a2 \cdot \beta^2 + a1 \cdot \beta + a0$ and similiarly | |

3001 for $b$. | |

3002 | |

3003 Next we compute the five points $w0, w1, w2, w3$ and $w4$. Recall that $w0$ and $w4$ can be computed directly from the portions so | |

3004 we get those out of the way first (lines @72,[email protected] and @77,[email protected]). Next we compute $w1, w2$ and $w3$ using Horners method. | |

3005 | |

3006 After this point we solve for the actual values of $w1, w2$ and $w3$ by reducing the $5 \times 5$ system which is relatively | |

3007 straight forward. | |

2980 | 3008 |

2981 \subsection{Signed Multiplication} | 3009 \subsection{Signed Multiplication} |

2982 Now that algorithms to handle multiplications of every useful dimensions have been developed, a rather simple finishing touch is required. So far all | 3010 Now that algorithms to handle multiplications of every useful dimensions have been developed, a rather simple finishing touch is required. So far all |

2983 of the multiplication algorithms have been unsigned multiplications which leaves only a signed multiplication algorithm to be established. | 3011 of the multiplication algorithms have been unsigned multiplications which leaves only a signed multiplication algorithm to be established. |

2984 | 3012 |

2985 \newpage\begin{figure}[!here] | 3013 \begin{figure}[!here] |

2986 \begin{small} | 3014 \begin{small} |

2987 \begin{center} | 3015 \begin{center} |

2988 \begin{tabular}{l} | 3016 \begin{tabular}{l} |

2989 \hline Algorithm \textbf{mp\_mul}. \\ | 3017 \hline Algorithm \textbf{mp\_mul}. \\ |

2990 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\ | 3018 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\ |

3063 | 3091 |

3064 \subsection{The Baseline Squaring Algorithm} | 3092 \subsection{The Baseline Squaring Algorithm} |

3065 The baseline squaring algorithm is meant to be a catch-all squaring algorithm. It will handle any of the input sizes that the faster routines | 3093 The baseline squaring algorithm is meant to be a catch-all squaring algorithm. It will handle any of the input sizes that the faster routines |

3066 will not handle. | 3094 will not handle. |

3067 | 3095 |

3068 \newpage\begin{figure}[!here] | 3096 \begin{figure}[!here] |

3069 \begin{small} | 3097 \begin{small} |

3070 \begin{center} | 3098 \begin{center} |

3071 \begin{tabular}{l} | 3099 \begin{tabular}{l} |

3072 \hline Algorithm \textbf{s\_mp\_sqr}. \\ | 3100 \hline Algorithm \textbf{s\_mp\_sqr}. \\ |

3073 \textbf{Input}. mp\_int $a$ \\ | 3101 \textbf{Input}. mp\_int $a$ \\ |

3119 Similar to algorithm s\_mp\_mul\_digs, after every pass of the inner loop, the destination is correctly set to the sum of all of the partial | 3147 Similar to algorithm s\_mp\_mul\_digs, after every pass of the inner loop, the destination is correctly set to the sum of all of the partial |

3120 results calculated so far. This involves expensive carry propagation which will be eliminated in the next algorithm. | 3148 results calculated so far. This involves expensive carry propagation which will be eliminated in the next algorithm. |

3121 | 3149 |

3122 EXAM,bn_s_mp_sqr.c | 3150 EXAM,bn_s_mp_sqr.c |

3123 | 3151 |

3124 Inside the outer loop (\textit{see line @32,[email protected]}) the square term is calculated on line @35,r [email protected] Line @42,>>@ extracts the carry from the square | 3152 Inside the outer loop (line @32,[email protected]) the square term is calculated on line @35,r [email protected] The carry (line @42,>>@) has been |

3125 term. Aliases for $a_{ix}$ and $t_{ix+iy}$ are initialized on lines @45,[email protected] and @48,[email protected] respectively. The doubling is performed using two | 3153 extracted from the mp\_word accumulator using a right shift. Aliases for $a_{ix}$ and $t_{ix+iy}$ are initialized |

3126 additions (\textit{see line @57,r + [email protected]}) since it is usually faster than shifting,if not at least as fast. | 3154 (lines @45,[email protected] and @48,[email protected]) to simplify the inner loop. The doubling is performed using two |

3155 additions (line @57,r + [email protected]) since it is usually faster than shifting, if not at least as fast. | |

3156 | |

3157 The important observation is that the inner loop does not begin at $iy = 0$ like for multiplication. As such the inner loops | |

3158 get progressively shorter as the algorithm proceeds. This is what leads to the savings compared to using a multiplication to | |

3159 square a number. | |

3127 | 3160 |

3128 \subsection{Faster Squaring by the ``Comba'' Method} | 3161 \subsection{Faster Squaring by the ``Comba'' Method} |

3129 A major drawback to the baseline method is the requirement for single precision shifting inside the $O(n^2)$ nested loop. Squaring has an additional | 3162 A major drawback to the baseline method is the requirement for single precision shifting inside the $O(n^2)$ nested loop. Squaring has an additional |

3130 drawback that it must double the product inside the inner loop as well. As for multiplication, the Comba technique can be used to eliminate these | 3163 drawback that it must double the product inside the inner loop as well. As for multiplication, the Comba technique can be used to eliminate these |

3131 performance hazards. | 3164 performance hazards. |

3133 The first obvious solution is to make an array of mp\_words which will hold all of the columns. This will indeed eliminate all of the carry | 3166 The first obvious solution is to make an array of mp\_words which will hold all of the columns. This will indeed eliminate all of the carry |

3134 propagation operations from the inner loop. However, the inner product must still be doubled $O(n^2)$ times. The solution stems from the simple fact | 3167 propagation operations from the inner loop. However, the inner product must still be doubled $O(n^2)$ times. The solution stems from the simple fact |

3135 that $2a + 2b + 2c = 2(a + b + c)$. That is the sum of all of the double products is equal to double the sum of all the products. For example, | 3168 that $2a + 2b + 2c = 2(a + b + c)$. That is the sum of all of the double products is equal to double the sum of all the products. For example, |

3136 $ab + ba + ac + ca = 2ab + 2ac = 2(ab + ac)$. | 3169 $ab + ba + ac + ca = 2ab + 2ac = 2(ab + ac)$. |

3137 | 3170 |

3138 However, we cannot simply double all of the columns, since the squares appear only once per row. The most practical solution is to have two mp\_word | 3171 However, we cannot simply double all of the columns, since the squares appear only once per row. The most practical solution is to have two |

3139 arrays. One array will hold the squares and the other array will hold the double products. With both arrays the doubling and carry propagation can be | 3172 mp\_word arrays. One array will hold the squares and the other array will hold the double products. With both arrays the doubling and |

3140 moved to a $O(n)$ work level outside the $O(n^2)$ level. | 3173 carry propagation can be moved to a $O(n)$ work level outside the $O(n^2)$ level. In this case, we have an even simpler solution in mind. |

3141 | 3174 |

3142 \newpage\begin{figure}[!here] | 3175 \newpage\begin{figure}[!here] |

3143 \begin{small} | 3176 \begin{small} |

3144 \begin{center} | 3177 \begin{center} |

3145 \begin{tabular}{l} | 3178 \begin{tabular}{l} |

3146 \hline Algorithm \textbf{fast\_s\_mp\_sqr}. \\ | 3179 \hline Algorithm \textbf{fast\_s\_mp\_sqr}. \\ |

3147 \textbf{Input}. mp\_int $a$ \\ | 3180 \textbf{Input}. mp\_int $a$ \\ |

3148 \textbf{Output}. $b \leftarrow a^2$ \\ | 3181 \textbf{Output}. $b \leftarrow a^2$ \\ |

3149 \hline \\ | 3182 \hline \\ |

3150 Place two arrays of \textbf{MP\_WARRAY} mp\_words named $\hat W$ and $\hat {X}$ on the stack. \\ | 3183 Place an array of \textbf{MP\_WARRAY} mp\_digits named $W$ on the stack. \\ |

3151 1. If $b.alloc < 2a.used + 1$ then grow $b$ to $2a.used + 1$ digits. (\textit{mp\_grow}). \\ | 3184 1. If $b.alloc < 2a.used + 1$ then grow $b$ to $2a.used + 1$ digits. (\textit{mp\_grow}). \\ |

3152 2. If step 1 failed return(\textit{MP\_MEM}). \\ | 3185 2. If step 1 failed return(\textit{MP\_MEM}). \\ |

3153 3. for $ix$ from $0$ to $2a.used + 1$ do \\ | |

3154 \hspace{3mm}3.1 $\hat W_{ix} \leftarrow 0$ \\ | |

3155 \hspace{3mm}3.2 $\hat {X}_{ix} \leftarrow 0$ \\ | |

3156 4. for $ix$ from $0$ to $a.used - 1$ do \\ | |

3157 \hspace{3mm}Compute the square.\\ | |

3158 \hspace{3mm}4.1 $\hat {X}_{ix+ix} \leftarrow \left ( a_{ix} \right )^2$ \\ | |

3159 \\ | 3186 \\ |

3160 \hspace{3mm}Compute the double products.\\ | 3187 3. $pa \leftarrow 2 \cdot a.used$ \\ |

3161 \hspace{3mm}4.2 for $iy$ from $ix + 1$ to $a.used - 1$ do \\ | 3188 4. $\hat W1 \leftarrow 0$ \\ |

3162 \hspace{6mm}4.2.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}a_{iy}$ \\ | 3189 5. for $ix$ from $0$ to $pa - 1$ do \\ |

3163 5. $oldused \leftarrow b.used$ \\ | 3190 \hspace{3mm}5.1 $\_ \hat W \leftarrow 0$ \\ |

3164 6. $b.used \leftarrow 2a.used + 1$ \\ | 3191 \hspace{3mm}5.2 $ty \leftarrow \mbox{MIN}(a.used - 1, ix)$ \\ |

3192 \hspace{3mm}5.3 $tx \leftarrow ix - ty$ \\ | |

3193 \hspace{3mm}5.4 $iy \leftarrow \mbox{MIN}(a.used - tx, ty + 1)$ \\ | |

3194 \hspace{3mm}5.5 $iy \leftarrow \mbox{MIN}(iy, \lfloor \left (ty - tx + 1 \right )/2 \rfloor)$ \\ | |

3195 \hspace{3mm}5.6 for $iz$ from $0$ to $iz - 1$ do \\ | |

3196 \hspace{6mm}5.6.1 $\_ \hat W \leftarrow \_ \hat W + a_{tx + iz}a_{ty - iz}$ \\ | |

3197 \hspace{3mm}5.7 $\_ \hat W \leftarrow 2 \cdot \_ \hat W + \hat W1$ \\ | |

3198 \hspace{3mm}5.8 if $ix$ is even then \\ | |

3199 \hspace{6mm}5.8.1 $\_ \hat W \leftarrow \_ \hat W + \left ( a_{\lfloor ix/2 \rfloor}\right )^2$ \\ | |

3200 \hspace{3mm}5.9 $W_{ix} \leftarrow \_ \hat W (\mbox{mod }\beta)$ \\ | |

3201 \hspace{3mm}5.10 $\hat W1 \leftarrow \lfloor \_ \hat W / \beta \rfloor$ \\ | |

3165 \\ | 3202 \\ |

3166 Double the products and propagate the carries simultaneously. \\ | 3203 6. $oldused \leftarrow b.used$ \\ |

3167 7. $\hat W_0 \leftarrow 2 \hat W_0 + \hat {X}_0$ \\ | 3204 7. $b.used \leftarrow 2 \cdot a.used$ \\ |

3168 8. for $ix$ from $1$ to $2a.used$ do \\ | 3205 8. for $ix$ from $0$ to $pa - 1$ do \\ |

3169 \hspace{3mm}8.1 $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ \\ | 3206 \hspace{3mm}8.1 $b_{ix} \leftarrow W_{ix}$ \\ |

3170 \hspace{3mm}8.2 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix - 1} / \beta \rfloor$ \\ | 3207 9. for $ix$ from $pa$ to $oldused - 1$ do \\ |

3171 \hspace{3mm}8.3 $b_{ix-1} \leftarrow W_{ix-1} \mbox{ (mod }\beta\mbox{)}$ \\ | 3208 \hspace{3mm}9.1 $b_{ix} \leftarrow 0$ \\ |

3172 9. $b_{2a.used} \leftarrow \hat W_{2a.used} \mbox{ (mod }\beta\mbox{)}$ \\ | 3209 10. Clamp excess digits from $b$. (\textit{mp\_clamp}) \\ |

3173 10. if $2a.used + 1 < oldused$ then do \\ | 3210 11. Return(\textit{MP\_OKAY}). \\ |

3174 \hspace{3mm}10.1 for $ix$ from $2a.used + 1$ to $oldused$ do \\ | |

3175 \hspace{6mm}10.1.1 $b_{ix} \leftarrow 0$ \\ | |

3176 11. Clamp excess digits from $b$. (\textit{mp\_clamp}) \\ | |

3177 12. Return(\textit{MP\_OKAY}). \\ | |

3178 \hline | 3211 \hline |

3179 \end{tabular} | 3212 \end{tabular} |

3180 \end{center} | 3213 \end{center} |

3181 \end{small} | 3214 \end{small} |

3182 \caption{Algorithm fast\_s\_mp\_sqr} | 3215 \caption{Algorithm fast\_s\_mp\_sqr} |

3183 \end{figure} | 3216 \end{figure} |

3184 | 3217 |

3185 \textbf{Algorithm fast\_s\_mp\_sqr.} | 3218 \textbf{Algorithm fast\_s\_mp\_sqr.} |

3186 This algorithm computes the square of an input using the Comba technique. It is designed to be a replacement for algorithm s\_mp\_sqr when | 3219 This algorithm computes the square of an input using the Comba technique. It is designed to be a replacement for algorithm |

3187 the number of input digits is less than \textbf{MP\_WARRAY} and less than $\delta \over 2$. | 3220 s\_mp\_sqr when the number of input digits is less than \textbf{MP\_WARRAY} and less than $\delta \over 2$. |

3188 | 3221 This algorithm is very similar to the Comba multiplier except with a few key differences we shall make note of. |

3189 This routine requires two arrays of mp\_words to be placed on the stack. The first array $\hat W$ will hold the double products and the second | 3222 |

3190 array $\hat X$ will hold the squares. Though only at most $MP\_WARRAY \over 2$ words of $\hat X$ are used, it has proven faster on most | 3223 First, we have an accumulator and carry variables $\_ \hat W$ and $\hat W1$ respectively. This is because the inner loop |

3191 processors to simply make it a full size array. | 3224 products are to be doubled. If we had added the previous carry in we would be doubling too much. Next we perform an |

3192 | 3225 addition MIN condition on $iy$ (step 5.5) to prevent overlapping digits. For example, $a_3 \cdot a_5$ is equal |

3193 The loop on step 3 will zero the two arrays to prepare them for the squaring step. Step 4.1 computes the squares of the product. Note how | 3226 $a_5 \cdot a_3$. Whereas in the multiplication case we would have $5 < a.used$ and $3 \ge 0$ is maintained since we double the sum |

3194 it simply assigns the value into the $\hat X$ array. The nested loop on step 4.2 computes the doubles of the products. This loop | 3227 of the products just outside the inner loop we have to avoid doing this. This is also a good thing since we perform |

3195 computes the sum of the products for each column. They are not doubled until later. | 3228 fewer multiplications and the routine ends up being faster. |

3196 | 3229 |

3197 After the squaring loop, the products stored in $\hat W$ musted be doubled and the carries propagated forwards. It makes sense to do both | 3230 Finally the last difference is the addition of the ``square'' term outside the inner loop (step 5.8). We add in the square |

3198 operations at the same time. The expression $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ computes the sum of the double product and the | 3231 only to even outputs and it is the square of the term at the $\lfloor ix / 2 \rfloor$ position. |

3199 squares in place. | |

3200 | 3232 |

3201 EXAM,bn_fast_s_mp_sqr.c | 3233 EXAM,bn_fast_s_mp_sqr.c |

3202 | 3234 |

3203 -- Write something deep and insightful later, Tom. | 3235 This implementation is essentially a copy of Comba multiplication with the appropriate changes added to make it faster for |

3236 the special case of squaring. | |

3204 | 3237 |

3205 \subsection{Polynomial Basis Squaring} | 3238 \subsection{Polynomial Basis Squaring} |

3206 The same algorithm that performs optimal polynomial basis multiplication can be used to perform polynomial basis squaring. The minor exception | 3239 The same algorithm that performs optimal polynomial basis multiplication can be used to perform polynomial basis squaring. The minor exception |

3207 is that $\zeta_y = f(y)g(y)$ is actually equivalent to $\zeta_y = f(y)^2$ since $f(y) = g(y)$. Instead of performing $2n + 1$ | 3240 is that $\zeta_y = f(y)g(y)$ is actually equivalent to $\zeta_y = f(y)^2$ since $f(y) = g(y)$. Instead of performing $2n + 1$ |

3208 multiplications to find the $\zeta$ relations, squaring operations are performed instead. | 3241 multiplications to find the $\zeta$ relations, squaring operations are performed instead. |

3310 | 3343 |

3311 By inlining the copy and shift operations the cutoff point for Karatsuba multiplication can be lowered. On the Athlon the cutoff point | 3344 By inlining the copy and shift operations the cutoff point for Karatsuba multiplication can be lowered. On the Athlon the cutoff point |

3312 is exactly at the point where Comba squaring can no longer be used (\textit{128 digits}). On slower processors such as the Intel P4 | 3345 is exactly at the point where Comba squaring can no longer be used (\textit{128 digits}). On slower processors such as the Intel P4 |

3313 it is actually below the Comba limit (\textit{at 110 digits}). | 3346 it is actually below the Comba limit (\textit{at 110 digits}). |

3314 | 3347 |

3315 This routine uses the same error trap coding style as mp\_karatsuba\_sqr. As the temporary variables are initialized errors are redirected to | 3348 This routine uses the same error trap coding style as mp\_karatsuba\_sqr. As the temporary variables are initialized errors are |

3316 the error trap higher up. If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and mp\_clears are executed normally. | 3349 redirected to the error trap higher up. If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and |

3317 | 3350 mp\_clears are executed normally. |

3318 \textit{Last paragraph sucks. re-write! -- Tom} | |

3319 | 3351 |

3320 \subsection{Toom-Cook Squaring} | 3352 \subsection{Toom-Cook Squaring} |

3321 The Toom-Cook squaring algorithm mp\_toom\_sqr is heavily based on the algorithm mp\_toom\_mul with the exception that squarings are used | 3353 The Toom-Cook squaring algorithm mp\_toom\_sqr is heavily based on the algorithm mp\_toom\_mul with the exception that squarings are used |

3322 instead of multiplication to find the five relations.. The reader is encouraged to read the description of the latter algorithm and try to | 3354 instead of multiplication to find the five relations. The reader is encouraged to read the description of the latter algorithm and try to |

3323 derive their own Toom-Cook squaring algorithm. | 3355 derive their own Toom-Cook squaring algorithm. |

3324 | 3356 |

3325 \subsection{High Level Squaring} | 3357 \subsection{High Level Squaring} |

3326 \newpage\begin{figure}[!here] | 3358 \newpage\begin{figure}[!here] |

3327 \begin{small} | 3359 \begin{small} |

3360 \section*{Exercises} | 3392 \section*{Exercises} |

3361 \begin{tabular}{cl} | 3393 \begin{tabular}{cl} |

3362 $\left [ 3 \right ] $ & Devise an efficient algorithm for selection of the radix point to handle inputs \\ | 3394 $\left [ 3 \right ] $ & Devise an efficient algorithm for selection of the radix point to handle inputs \\ |

3363 & that have different number of digits in Karatsuba multiplication. \\ | 3395 & that have different number of digits in Karatsuba multiplication. \\ |

3364 & \\ | 3396 & \\ |

3365 $\left [ 3 \right ] $ & In ~SQUARE~ the fact that every column of a squaring is made up \\ | 3397 $\left [ 2 \right ] $ & In ~SQUARE~ the fact that every column of a squaring is made up \\ |

3366 & of double products and at most one square is stated. Prove this statement. \\ | 3398 & of double products and at most one square is stated. Prove this statement. \\ |

3367 & \\ | 3399 & \\ |

3368 $\left [ 2 \right ] $ & In the Comba squaring algorithm half of the $\hat X$ variables are not used. \\ | |

3369 & Revise algorithm fast\_s\_mp\_sqr to shrink the $\hat X$ array. \\ | |

3370 & \\ | |

3371 $\left [ 3 \right ] $ & Prove the equation for Karatsuba squaring. \\ | 3400 $\left [ 3 \right ] $ & Prove the equation for Karatsuba squaring. \\ |

3372 & \\ | 3401 & \\ |

3373 $\left [ 1 \right ] $ & Prove that Karatsuba squaring requires $O \left (n^{lg(3)} \right )$ time. \\ | 3402 $\left [ 1 \right ] $ & Prove that Karatsuba squaring requires $O \left (n^{lg(3)} \right )$ time. \\ |

3374 & \\ | 3403 & \\ |

3375 $\left [ 2 \right ] $ & Determine the minimal ratio between addition and multiplication clock cycles \\ | 3404 $\left [ 2 \right ] $ & Determine the minimal ratio between addition and multiplication clock cycles \\ |

3376 & required for equation $6.7$ to be true. \\ | 3405 & required for equation $6.7$ to be true. \\ |

3406 & \\ | |

3407 $\left [ 3 \right ] $ & Implement a threaded version of Comba multiplication (and squaring) where you \\ | |

3408 & compute subsets of the columns in each thread. Determine a cutoff point where \\ | |

3409 & it is effective and add the logic to mp\_mul() and mp\_sqr(). \\ | |

3410 &\\ | |

3411 $\left [ 4 \right ] $ & Same as the previous but also modify the Karatsuba and Toom-Cook. You must \\ | |

3412 & increase the throughput of mp\_exptmod() for random odd moduli in the range \\ | |

3413 & $512 \ldots 4096$ bits significantly ($> 2x$) to complete this challenge. \\ | |

3377 & \\ | 3414 & \\ |

3378 \end{tabular} | 3415 \end{tabular} |

3379 | 3416 |

3380 \chapter{Modular Reduction} | 3417 \chapter{Modular Reduction} |

3381 MARK,REDUCTION | 3418 MARK,REDUCTION |

3392 other forms of residues. | 3429 other forms of residues. |

3393 | 3430 |

3394 Modular reductions are normally used to create either finite groups, rings or fields. The most common usage for performance driven modular reductions | 3431 Modular reductions are normally used to create either finite groups, rings or fields. The most common usage for performance driven modular reductions |

3395 is in modular exponentiation algorithms. That is to compute $d = a^b \mbox{ (mod }c\mbox{)}$ as fast as possible. This operation is used in the | 3432 is in modular exponentiation algorithms. That is to compute $d = a^b \mbox{ (mod }c\mbox{)}$ as fast as possible. This operation is used in the |

3396 RSA and Diffie-Hellman public key algorithms, for example. Modular multiplication and squaring also appears as a fundamental operation in | 3433 RSA and Diffie-Hellman public key algorithms, for example. Modular multiplication and squaring also appears as a fundamental operation in |

3397 Elliptic Curve cryptographic algorithms. As will be discussed in the subsequent chapter there exist fast algorithms for computing modular | 3434 elliptic curve cryptographic algorithms. As will be discussed in the subsequent chapter there exist fast algorithms for computing modular |

3398 exponentiations without having to perform (\textit{in this example}) $b - 1$ multiplications. These algorithms will produce partial results in the | 3435 exponentiations without having to perform (\textit{in this example}) $b - 1$ multiplications. These algorithms will produce partial results in the |

3399 range $0 \le x < c^2$ which can be taken advantage of to create several efficient algorithms. They have also been used to create redundancy check | 3436 range $0 \le x < c^2$ which can be taken advantage of to create several efficient algorithms. They have also been used to create redundancy check |

3400 algorithms known as CRCs, error correction codes such as Reed-Solomon and solve a variety of number theoeretic problems. | 3437 algorithms known as CRCs, error correction codes such as Reed-Solomon and solve a variety of number theoeretic problems. |

3401 | 3438 |

3402 \section{The Barrett Reduction} | 3439 \section{The Barrett Reduction} |

3608 | 3645 |

3609 \subsection{The Barrett Setup Algorithm} | 3646 \subsection{The Barrett Setup Algorithm} |

3610 In order to use algorithm mp\_reduce the value of $\mu$ must be calculated in advance. Ideally this value should be computed once and stored for | 3647 In order to use algorithm mp\_reduce the value of $\mu$ must be calculated in advance. Ideally this value should be computed once and stored for |

3611 future use so that the Barrett algorithm can be used without delay. | 3648 future use so that the Barrett algorithm can be used without delay. |

3612 | 3649 |

3613 \begin{figure}[!here] | 3650 \newpage\begin{figure}[!here] |

3614 \begin{small} | 3651 \begin{small} |

3615 \begin{center} | 3652 \begin{center} |

3616 \begin{tabular}{l} | 3653 \begin{tabular}{l} |

3617 \hline Algorithm \textbf{mp\_reduce\_setup}. \\ | 3654 \hline Algorithm \textbf{mp\_reduce\_setup}. \\ |

3618 \textbf{Input}. mp\_int $a$ ($a > 1$) \\ | 3655 \textbf{Input}. mp\_int $a$ ($a > 1$) \\ |

5816 \section{Jacobi Symbol Computation} | 5853 \section{Jacobi Symbol Computation} |

5817 To explain the Jacobi Symbol we shall first discuss the Legendre function\footnote{Arrg. What is the name of this?} off which the Jacobi symbol is | 5854 To explain the Jacobi Symbol we shall first discuss the Legendre function\footnote{Arrg. What is the name of this?} off which the Jacobi symbol is |

5818 defined. The Legendre function computes whether or not an integer $a$ is a quadratic residue modulo an odd prime $p$. Numerically it is | 5855 defined. The Legendre function computes whether or not an integer $a$ is a quadratic residue modulo an odd prime $p$. Numerically it is |

5819 equivalent to equation \ref{eqn:legendre}. | 5856 equivalent to equation \ref{eqn:legendre}. |

5820 | 5857 |

5858 \textit{-- Tom, don't be an ass, cite your source here...!} | |

5859 | |

5821 \begin{equation} | 5860 \begin{equation} |

5822 a^{(p-1)/2} \equiv \begin{array}{rl} | 5861 a^{(p-1)/2} \equiv \begin{array}{rl} |

5823 -1 & \mbox{if }a\mbox{ is a quadratic non-residue.} \\ | 5862 -1 & \mbox{if }a\mbox{ is a quadratic non-residue.} \\ |

5824 0 & \mbox{if }a\mbox{ divides }p\mbox{.} \\ | 5863 0 & \mbox{if }a\mbox{ divides }p\mbox{.} \\ |

5825 1 & \mbox{if }a\mbox{ is a quadratic residue}. | 5864 1 & \mbox{if }a\mbox{ is a quadratic residue}. |