20210718, 08:59  #34 
Jul 2021
3×11 Posts 
Never heard of of this kind of representation, what is it about? How can I use it or where to read about it?
I just slice input raw bytes of input large number into 4bit fixedsize words and that's it. Afterward I convert these 4bit words into complex numbers and do all the FFT transformations. At the end do a carry propagation. 
20210718, 09:04  #35 
P90 years forever!
Aug 2002
Yeehaw, FL
2^{2}×23×83 Posts 

20210718, 09:16  #36  
Jul 2021
100001_{2} Posts 
Quote:
Signed balanced representation could probably save 2 bits, i.e. result will be below (K  1 + K  1 + Log2(N)) bits in absolute value. Thus if Prime95 for quite large number (e.g. 2^28 bits) has words of 17 bits in size (total 2^28/17=2^24 words) then it means final coefficients will be (17  1 + 17  1 + 24)=56 bits which is more than 53bits mantissa of double64, to hold such value. Also mantissa not only should hold value of this size but also cope with all rounding errors happened after multiplications of millions of numbers. Also very interesting, if I map all numbers to signed balanced range, how then I map back final result? To me it looks strange that after mapping final result can unmap to correct values. Is this kind of mapping and unmapping fits well with operation of multiplication? Last fiddled with by moytrage on 20210718 at 09:17 

20210718, 09:23  #37  
Jun 2003
19×271 Posts 
Quote:
Quote:
Example: Let [1, 9, 12, 7] be your number in base16 (0x19C7 = 6599). Starting from lowest digit, we check: 7 > 8? No. Retain asis. 12 > 8? Yes. Use 1216 = 4. add 1 to next digit. (9+1) > 8? Yes. Use 1016 = 6. add 1 to next digit. (1+1) > 8? No. Retain 2. So the new representation is [2, 6, 4, 7]. The advantage of balanced representation is 2fold. One: the magnitudes of the digits are 1 bit less. Two: the intermmediate products are both positive and negative, hence their sum will undergo lot of cancellation and the final sumofproduct will be much smaller. Hence you can stick in much larger number of bits per limb. 

20210718, 09:40  #38  
Jun 2003
19·271 Posts 
Quote:


20210718, 09:59  #39 
Jul 2021
3×11 Posts 
If I map two numbers to signed form, then multiply them, then convert multiply result back to unsigned form  what will be the gurantee that these sequence of operations will give same result as if I just multiply two unsigned values and get unsigned result?
Is it trivially obvious that these two will give same results? Last fiddled with by moytrage on 20210718 at 10:00 
20210718, 10:33  #40 
Jul 2009
Germany
2·313 Posts 

20210718, 11:30  #41 
Jul 2021
3×11 Posts 

20210718, 12:52  #42 
Tribal Bullet
Oct 2004
3·1,181 Posts 
As a general rule, when performing a convolution of size 2^N of words with size B bits, the convolution results require 2*B+N bits to represent exactly. When using double precision floating point you get a 53bit floating point mantissa to fit these results. Suppose your numbers to multiply are of size 100M digits ~ 330M bits (<2^29). This can be represented as 2^17 x 12 bit words or 2^16 x 13bit words, or some other combination. If using the latter combination then a convolution will produce results with 42bit words, which easily fits in a 53bit floating point mantissa with extra bits to round each element correctly to an integer. So you can afford to be much more aggressive choosing B than you are concluding.
Now this is an approximation to the real sizes of things; when doing general multiplies the number of words in the product is twice as large so the FFT needs to be of size 2^(N+1). Using balanced representation gives two bonuses: the sign bit of the floating point representation gives an extra bit of mantissa to fit the results, but much more importantly the convolution results now form a distribution centered about 0, which allows choosing a B that is large enough to risk a convolution result that fails to be correctly rounded, provided the probability that such a result appears is deemed small enough to risk. This is one of the reasons Prime95 can use 19bit words with very large transforms, despite the error analysis above indicating it would be too dangerous. The other thing to conclude from this is that an NTT modulus around 2^64 is nearly the same size as a floating point mantissa of size 2^53, so if you are using much smaller B in your FFT then you need to look over the analysis you did to choose the sizes, both kinds of transforms should admit nearly the same size arrays of words. Now it's possible to use the CRT and construct very large convolution results out of comparatively very small NTTs, but the time and memory for postprocessing to apply the CRT grows basically quadratically increasing numbers of CRT moduli. My experience over a long time trying to make NTTs competitive with FFTs is that nothing you do with NTTs can overcome the latency of 64bit integer multiplies compared to the pipeline throughput of double precision floating point multiplies. For a while I thought GPU computing would be an exception, since consumergrade GPUs have severely limited double precision throughput that could make NTTs competitive, but an NTT on GPU would require 3132 bit moduli and the CRT and there are very few such moduli that have a root of 2 of order large enough to make the requisite size NTT. See here for my last attempt. Last fiddled with by jasonp on 20210720 at 17:43 Reason: fix convolution size 
20210718, 13:07  #43 
Jul 2021
3·11 Posts 

20210718, 20:21  #44  
∂^{2}ω=0
Sep 2002
República de California
2^{2}·5·11·53 Posts 
Quote:
That paper does not delve into the details of just why balanceddigit representation is as good as it is in reducing roundoff errors. In fact it's due both to said representation and use of a fast algorithm such as FFT for computing the resulting convolution. Start with a balanceddigit vector input and compute the convolution using both a matrixmultiply DFT and FFT. In exact arithmetic, the outputs are the same. In practice, the DFT approach gives much higher roundoff errors. Attached is the relevant snip from a paper the above poster jasonp and I have some familiarity with (this section was written by me, so any errors in it are on me, not him): 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
PRP on gpu is faster that on cpu  indomit  Information & Answers  4  20201007 10:50 
faster than LL?  paulunderwood  Miscellaneous Math  13  20160802 00:05 
My CPU is getting faster and faster ;)  lidocorc  Software  2  20081108 09:26 
Faster way to do LLT?  1260  Miscellaneous Math  23  20050904 07:12 
Faster than LL?  clowns789  Miscellaneous Math  3  20040527 23:39 