1994-08-19 10:40:01 +01:00
|
|
|
/* s_cbrtf.c -- float version of s_cbrt.c.
|
|
|
|
* Conversion to float by Ian Lance Taylor, Cygnus Support, ian@cygnus.com.
|
2005-12-13 20:17:23 +00:00
|
|
|
* Debugged and optimized by Bruce D. Evans.
|
1994-08-19 10:40:01 +01:00
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ====================================================
|
|
|
|
* Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
|
|
|
|
*
|
|
|
|
* Developed at SunPro, a Sun Microsystems, Inc. business.
|
|
|
|
* Permission to use, copy, modify, and distribute this
|
1995-05-30 06:51:47 +01:00
|
|
|
* software is freely granted, provided that this notice
|
1994-08-19 10:40:01 +01:00
|
|
|
* is preserved.
|
|
|
|
* ====================================================
|
|
|
|
*/
|
|
|
|
|
2008-02-22 02:30:36 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
1994-08-19 10:40:01 +01:00
|
|
|
|
|
|
|
#include "math.h"
|
|
|
|
#include "math_private.h"
|
|
|
|
|
|
|
|
/* cbrtf(x)
|
|
|
|
* Return cube root of x
|
|
|
|
*/
|
1995-05-30 06:51:47 +01:00
|
|
|
static const unsigned
|
2005-12-11 19:51:30 +00:00
|
|
|
B1 = 709958130, /* B1 = (127-127.0/3-0.03306235651)*2**23 */
|
|
|
|
B2 = 642849266; /* B2 = (127-127.0/3-24/3-0.03306235651)*2**23 */
|
1994-08-19 10:40:01 +01:00
|
|
|
|
2002-05-28 19:15:04 +01:00
|
|
|
float
|
|
|
|
cbrtf(float x)
|
1994-08-19 10:40:01 +01:00
|
|
|
{
|
Use double precision internally to optimize cbrtf(), and change the
algorithm for the second step significantly to also get a perfectly
rounded result in round-to-nearest mode. The resulting optimization
is about 25% on Athlon64's and 30% on Athlon XP's (about 25 cycles
out of 100 on the former).
Using extra precision, we don't need to do anything special to avoid
large rounding errors in the third step (Newton's method), so we can
regroup terms to avoid a division, increase clarity, and increase
opportunities for parallelism. Rearrangement for parallelism loses
the increase in clarity. We end up with the same number of operations
but with a division reduced to a multiplication.
Using specifically double precision, there is enough extra precision
for the third step to give enough precision for perfect rounding to
float precision provided the previous steps are accurate to 16 bits.
(They were accurate to 12 bits, which was almost minimal for imperfect
rounding in the old version but would be more than enough for imperfect
rounding in this version (9 bits would be enough now).) I couldn't
find any significant time optimizations from optimizing the previous
steps, so I decided to optimize for accuracy instead. The second step
needed a division although a previous commit optimized it to use a
polynomial approximation for its main detail, and this division dominated
the time for the second step. Use the same Newton's method for the
second step as for the third step since this is insignificantly slower
than the division plus the polynomial (now that Newton's method only
needs 1 division), significantly more accurate, and simpler. Single
precision would be precise enough for the second step, but doesn't
have enough exponent range to handle denormals without the special
grouping of terms (as in previous versions) that requires another
division, so we use double precision for both the second and third
steps.
2006-01-05 07:57:31 +00:00
|
|
|
double r,T;
|
|
|
|
float t;
|
1994-08-19 10:40:01 +01:00
|
|
|
int32_t hx;
|
|
|
|
u_int32_t sign;
|
|
|
|
u_int32_t high;
|
|
|
|
|
|
|
|
GET_FLOAT_WORD(hx,x);
|
|
|
|
sign=hx&0x80000000; /* sign= sign(x) */
|
|
|
|
hx ^=sign;
|
|
|
|
if(hx>=0x7f800000) return(x+x); /* cbrt(NaN,INF) is itself */
|
|
|
|
|
|
|
|
/* rough cbrt to 5 bits */
|
2007-05-29 08:13:07 +01:00
|
|
|
if(hx<0x00800000) { /* zero or subnormal? */
|
|
|
|
if(hx==0)
|
|
|
|
return(x); /* cbrt(+-0) is itself */
|
2005-12-13 18:22:00 +00:00
|
|
|
SET_FLOAT_WORD(t,0x4b800000); /* set t= 2**24 */
|
|
|
|
t*=x;
|
|
|
|
GET_FLOAT_WORD(high,t);
|
2005-12-13 20:17:23 +00:00
|
|
|
SET_FLOAT_WORD(t,sign|((high&0x7fffffff)/3+B2));
|
2005-12-13 18:22:00 +00:00
|
|
|
} else
|
2005-12-13 20:17:23 +00:00
|
|
|
SET_FLOAT_WORD(t,sign|(hx/3+B1));
|
1994-08-19 10:40:01 +01:00
|
|
|
|
2006-01-05 09:18:48 +00:00
|
|
|
/*
|
|
|
|
* First step Newton iteration (solving t*t-x/t == 0) to 16 bits. In
|
|
|
|
* double precision so that its terms can be arranged for efficiency
|
|
|
|
* without causing overflow or underflow.
|
|
|
|
*/
|
Use double precision internally to optimize cbrtf(), and change the
algorithm for the second step significantly to also get a perfectly
rounded result in round-to-nearest mode. The resulting optimization
is about 25% on Athlon64's and 30% on Athlon XP's (about 25 cycles
out of 100 on the former).
Using extra precision, we don't need to do anything special to avoid
large rounding errors in the third step (Newton's method), so we can
regroup terms to avoid a division, increase clarity, and increase
opportunities for parallelism. Rearrangement for parallelism loses
the increase in clarity. We end up with the same number of operations
but with a division reduced to a multiplication.
Using specifically double precision, there is enough extra precision
for the third step to give enough precision for perfect rounding to
float precision provided the previous steps are accurate to 16 bits.
(They were accurate to 12 bits, which was almost minimal for imperfect
rounding in the old version but would be more than enough for imperfect
rounding in this version (9 bits would be enough now).) I couldn't
find any significant time optimizations from optimizing the previous
steps, so I decided to optimize for accuracy instead. The second step
needed a division although a previous commit optimized it to use a
polynomial approximation for its main detail, and this division dominated
the time for the second step. Use the same Newton's method for the
second step as for the third step since this is insignificantly slower
than the division plus the polynomial (now that Newton's method only
needs 1 division), significantly more accurate, and simpler. Single
precision would be precise enough for the second step, but doesn't
have enough exponent range to handle denormals without the special
grouping of terms (as in previous versions) that requires another
division, so we use double precision for both the second and third
steps.
2006-01-05 07:57:31 +00:00
|
|
|
T=t;
|
|
|
|
r=T*T*T;
|
2006-01-05 09:18:48 +00:00
|
|
|
T=T*((double)x+x+r)/(x+r+r);
|
Fixed all 502518670 errors of more than 1 ulp for cbrtf() on amd64.
The maximum error was 3.56 ulps.
The bug was another translation error. The double precision version
has a comment saying "new cbrt to 23 bits, may be implemented in
precision". This means exactly what it says -- that the 23 bit second
approximation for the double precision cbrt() may be implemented in
single (i.e., float) precision. It doesn't mean what the translation
assumed -- that this approximation, when implemented in float precision,
is good enough for the the final approximation in float precision.
First, float precision needs a 24 bit approximation. The "23 bit"
approximation is actually good to 24 bits on float precision args, but
only if it is evaluated in double precision. Second, the algorithm
requires a cleanup step to ensure its error bound.
In float precision, any reasonable algorithm works for the cleanup
step. Use the same algorithm as for double precision, although this
is much more than enough and is a significant pessimization, and don't
optimize or simplify anything using double precision to implement the
float case, so that the whole double precision algorithm can be verified
in float precision. A maximum error of 0.667 ulps is claimed for cbrt()
and the max for cbrtf() using the same algorithm shouldn't be different,
but the actual max for cbrtf() on amd64 is now 0.9834 ulps. (On i386
-O1 the max is 0.5006 (down from < 0.7) due to extra precision.)
2005-12-11 13:22:01 +00:00
|
|
|
|
2006-01-05 09:18:48 +00:00
|
|
|
/*
|
|
|
|
* Second step Newton iteration to 47 bits. In double precision for
|
|
|
|
* efficiency and accuracy.
|
|
|
|
*/
|
Use double precision internally to optimize cbrtf(), and change the
algorithm for the second step significantly to also get a perfectly
rounded result in round-to-nearest mode. The resulting optimization
is about 25% on Athlon64's and 30% on Athlon XP's (about 25 cycles
out of 100 on the former).
Using extra precision, we don't need to do anything special to avoid
large rounding errors in the third step (Newton's method), so we can
regroup terms to avoid a division, increase clarity, and increase
opportunities for parallelism. Rearrangement for parallelism loses
the increase in clarity. We end up with the same number of operations
but with a division reduced to a multiplication.
Using specifically double precision, there is enough extra precision
for the third step to give enough precision for perfect rounding to
float precision provided the previous steps are accurate to 16 bits.
(They were accurate to 12 bits, which was almost minimal for imperfect
rounding in the old version but would be more than enough for imperfect
rounding in this version (9 bits would be enough now).) I couldn't
find any significant time optimizations from optimizing the previous
steps, so I decided to optimize for accuracy instead. The second step
needed a division although a previous commit optimized it to use a
polynomial approximation for its main detail, and this division dominated
the time for the second step. Use the same Newton's method for the
second step as for the third step since this is insignificantly slower
than the division plus the polynomial (now that Newton's method only
needs 1 division), significantly more accurate, and simpler. Single
precision would be precise enough for the second step, but doesn't
have enough exponent range to handle denormals without the special
grouping of terms (as in previous versions) that requires another
division, so we use double precision for both the second and third
steps.
2006-01-05 07:57:31 +00:00
|
|
|
r=T*T*T;
|
2006-01-05 09:18:48 +00:00
|
|
|
T=T*((double)x+x+r)/(x+r+r);
|
Fixed all 502518670 errors of more than 1 ulp for cbrtf() on amd64.
The maximum error was 3.56 ulps.
The bug was another translation error. The double precision version
has a comment saying "new cbrt to 23 bits, may be implemented in
precision". This means exactly what it says -- that the 23 bit second
approximation for the double precision cbrt() may be implemented in
single (i.e., float) precision. It doesn't mean what the translation
assumed -- that this approximation, when implemented in float precision,
is good enough for the the final approximation in float precision.
First, float precision needs a 24 bit approximation. The "23 bit"
approximation is actually good to 24 bits on float precision args, but
only if it is evaluated in double precision. Second, the algorithm
requires a cleanup step to ensure its error bound.
In float precision, any reasonable algorithm works for the cleanup
step. Use the same algorithm as for double precision, although this
is much more than enough and is a significant pessimization, and don't
optimize or simplify anything using double precision to implement the
float case, so that the whole double precision algorithm can be verified
in float precision. A maximum error of 0.667 ulps is claimed for cbrt()
and the max for cbrtf() using the same algorithm shouldn't be different,
but the actual max for cbrtf() on amd64 is now 0.9834 ulps. (On i386
-O1 the max is 0.5006 (down from < 0.7) due to extra precision.)
2005-12-11 13:22:01 +00:00
|
|
|
|
Use double precision internally to optimize cbrtf(), and change the
algorithm for the second step significantly to also get a perfectly
rounded result in round-to-nearest mode. The resulting optimization
is about 25% on Athlon64's and 30% on Athlon XP's (about 25 cycles
out of 100 on the former).
Using extra precision, we don't need to do anything special to avoid
large rounding errors in the third step (Newton's method), so we can
regroup terms to avoid a division, increase clarity, and increase
opportunities for parallelism. Rearrangement for parallelism loses
the increase in clarity. We end up with the same number of operations
but with a division reduced to a multiplication.
Using specifically double precision, there is enough extra precision
for the third step to give enough precision for perfect rounding to
float precision provided the previous steps are accurate to 16 bits.
(They were accurate to 12 bits, which was almost minimal for imperfect
rounding in the old version but would be more than enough for imperfect
rounding in this version (9 bits would be enough now).) I couldn't
find any significant time optimizations from optimizing the previous
steps, so I decided to optimize for accuracy instead. The second step
needed a division although a previous commit optimized it to use a
polynomial approximation for its main detail, and this division dominated
the time for the second step. Use the same Newton's method for the
second step as for the third step since this is insignificantly slower
than the division plus the polynomial (now that Newton's method only
needs 1 division), significantly more accurate, and simpler. Single
precision would be precise enough for the second step, but doesn't
have enough exponent range to handle denormals without the special
grouping of terms (as in previous versions) that requires another
division, so we use double precision for both the second and third
steps.
2006-01-05 07:57:31 +00:00
|
|
|
/* rounding to 24 bits is perfect in round-to-nearest mode */
|
|
|
|
return(T);
|
1994-08-19 10:40:01 +01:00
|
|
|
}
|