Avoid reevaluating arguments in _round, _constrain, _sign.#529
Conversation
|
Nice. Would you have time to compare |
|
I could try it. The point of the template function is the make sure that each expression is only evaluated once. So for example if you have _constrain(complicated_expr, low, high), you don't want to do if |
|
Okay, replacing with builtins gives another 600ns speedup on a baseline of 9500ns for loopFOC (for position control mode in foc_current torque control). These implementations should be C++11 compatible. Let me know if it is acceptable and I will amend the commit. Note that I have no way of testing that this works across different compilers so if you have a test suite to hand please feel welcome to run it. template<typename T>
constexpr inline int _sign(T val) {
return __builtin_signbit(val);
}
#ifndef _round
template<typename T>
constexpr inline std::enable_if<std::is_same<T, double>::value, long>::type _round(T x) {
return __builtin_round(x);
}
template<typename T>
constexpr inline std::enable_if<std::is_same<T, float>::value, long>::type _round(T x) {
return __builtin_roundf(x);
}
#endif
// Use enable_if to select the fastest implementation according to the amt type.
// Using __builtin_fXf is measurably faster than using the ternary approach.
template<typename T, typename L, typename H>
constexpr inline std::enable_if<std::is_integral<T>::value, T>::type _constrain(T amt, L low, H high) {
return (amt < low) ? low : (amt > high) ? high : amt;
}
template<typename T, typename L, typename H>
constexpr inline std::enable_if<std::is_same<T, float>::value, T>::type _constrain(T amt, L low, H high) {
return __builtin_fmaxf(low, __builtin_fminf(high, amt));
}
template<typename T, typename L, typename H>
constexpr inline std::enable_if<std::is_same<T, double>::value, T>::type _constrain(T amt, L low, H high) {
return __builtin_fmax(low, __builtin_fmin(high, amt));
} |
|
Wow, thank you so much for trying it - like I thought, even better :-) Personally I’m very tempted to do it like that… what do you think? Although very cool I think I’d leave out the template magic that selects the fastest one. But also up for discussion. |
I don't think we can, because if anyone is compiling without -ffast-math (which is probably a lot of people because it's not on by default), the __builtin_fmax/_fmin/_round calls would add double promotion and software double rounding. (Will await your agreement or further feedback on this point before continuing.) |
0dfd803 to
920d0ae
Compare
|
Looks like the AVR compiler indeed does not have type_traits, and thus we can't use enable_if for everything. My idea right now is to keep the enable_if code but to use the non-enable_if code for AVR. |
_round, _constrain and _sign currently reevaluate their arguments, which wastes CPU cycles. In torque/foc_current mode, this fix saves about 300ns/foc loop (about 2% of total CPU usage). In other modes, especially estimated_current, it saves more.
|
Not sure what the deal is with the arduino_due_x compile -- seems to be a problem with min and max which should not have changed due to my code. Let me know if anything jumps out at you. |
|
Personally I'd rather not take that first step onto the slippery slope toward the horror that is "modern C++". The builtins are a tantalizing treat for an optimization nut like myself, but SimpleFOC is getting too bloated as it is, so I'd rather push back toward minimalism and readability for people to learn how FOC works rather than pure focus on performance. My first question is, why is the compiler not optimizing it in the first place? The only moderately-complex expressions I see on a find-in-files of _constrain are the lag estimate The setPwm functions are wasteful anyway, doing two constrains when the voltage limit should (#536) always be below voltage_power_supply anyway. So I'd refactor to something like this: Still a nested ternary, but easier for a dumb compiler to see the repetition, and not that many wasted comparisons in any case. The lag estimate is such a sprawl that it should be split onto two lines anyway, and then can do like this to avoid the issue. Or pull it out into a separate function since it's the same in estimated_current and dc_current. The similar inductance calculations in foc_current mode could also be split onto two lines like that, but I've never understood why those should be there in the first place. The effect of phase lag is that some of the should-be q current ends up in the d axis, but since the current PIDs are already adjusting the voltage to correct for it, modifying d voltage afterward should overcorrect and cause field weakening instead. There are some instances of two arithmetic operations in some hardware-specific driver code like this: But if we want to accommodate braindead compilers then we should be doing like this anyway: |
|
You could definitely adjust all the callsites to pass variables only. IMO
it violates the principle of least surprise that something that looks like
a function reevaluated its arguments. It’s ugly to have to work around this
by extracting things into variables. Functions implemented as macros are
universally limited this way in C and C++.
I also dont think that having library functions implemented properly can
rightly be called “bloated.” What makes software hard to understand is the
complexity of the interface. This new constrain actually decreases the
interface’s complexity. Because now something that looks like a function
behaves like one. (This is the osterhaut a philosophy of software design
way of looking at things.)
…On Wed, May 27, 2026 at 10:27 AM dekutree64 ***@***.***> wrote:
*dekutree64* left a comment (simplefoc/Arduino-FOC#529)
<#529 (comment)>
Personally I'd rather not take that first step onto the slippery slope
toward the horror that is "modern C++". The builtins are a tantalizing
treat for an optimization nut like myself, but SimpleFOC is getting too
bloated as it is, so I'd rather push back toward minimalism and readability
for people to learn how FOC works rather than pure focus on performance.
My first question is, why is the compiler not optimizing it in the first
place? The only moderately-complex expressions I see on a find-in-files of
_constrain are the lag estimate _constrain(
-current_sp*shaft_velocity*pole_pairs*axis_inductance.q, -voltage_limit,
voltage_limit) and stepper setPwm functions
_constrain(abs(Ualpha)/voltage_power_supply,0.0f,1.0f). The rest have at
most one arithmetic operation, so if a compiler can't see that opportunity,
this would only be a drop in the bucket of micro-optimization to be done.
The setPwm functions are wasteful anyway, doing two constrains when the
voltage limit *should* (#536
<#536>) always be below
voltage_power_supply anyway. So I'd refactor to something like this:
void StepperDriver4PWM::setPwm(float Ualpha, float Ubeta) {
float duty_cycle1A(0.0f),duty_cycle1B(0.0f),duty_cycle2A(0.0f),duty_cycle2B(0.0f);
float limit = min(voltage_limit, voltage_power_supply); // Sanity check. voltage_limit should always be less than voltage_power_supply
float duty1 = min(abs(Ualpha), limit)/voltage_power_supply;
float duty2 = min(abs(Ubeta), limit)/voltage_power_supply;
if(Ualpha > 0) duty_cycle1B = duty1; else duty_cycle1A = duty1;
if(Ubeta > 0) duty_cycle2B = duty2; else duty_cycle2A = duty2;
_writeDutyCycle4PWM(duty_cycle1A, duty_cycle1B, duty_cycle2A, duty_cycle2B, params);
}
Still a nested ternary, but easier for a dumb compiler to see the
repetition, and not that many wasted comparisons in any case.
The lag estimate is such a sprawl that it should be split onto two lines
anyway, and then can do like this to avoid the issue.
if(_isset(axis_inductance.q)) {
voltage.d = -current_sp*shaft_velocity*pole_pairs*axis_inductance.q;
voltage.d = _constrain( voltage.d, -voltage_limit, voltage_limit) + feed_forward_voltage.d
} else
voltage.d = feed_forward_voltage.d;
Or pull it out into a separate function since it's the same in
estimated_current and dc_current.
The similar inductance calculations in foc_current mode could also be
split onto two lines like that, but I've never understood why those should
be there in the first place. The effect of phase lag is that some of the
should-be q current ends up in the d axis, but since the current PIDs are
already adjusting the voltage to correct for it, modifying d voltage
afterward should overcorrect and cause field weakening instead.
There are some instances of two arithmetic operations in some
hardware-specific driver code like this:
int pwm_h = _constrain(val-dead_time/2,0,255);
int pwm_l = _constrain(val+dead_time/2,0,255);
But if we want to accommodate braindead compilers then we should be doing
like this anyway:
dead_time /= 2;
int pwm_h = _constrain(val-dead_time,0,255);
int pwm_l = _constrain(val+dead_time,0,255);
—
Reply to this email directly, view it on GitHub
<#529?email_source=notifications&email_token=AAGDGTAZIPYZE2RMBARH3VL444JPNA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVGY2DQOJRHE4KM4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4556489198>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGDGTH33BQ5GPUUVVXZS63444JPNAVCNFSM6AAAAACYGKAYAOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJWGQ4DSMJZHA>
.
Triage notifications, keep track of coding agent tasks and review pull
requests on the go with GitHub Mobile for iOS
<https://github.com/notifications/mobile/ios/AAGDGTF4MEHP73LN4QHNBC3444JPNA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVGY2DQOJRHE4KM4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KUZTPN52GK4S7NFXXG>
and Android
<https://github.com/notifications/mobile/android/AAGDGTAFV5KKJPQUK4XOVH3444JPNA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVGY2DQOJRHE4KM4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2K4ZTPN52GK4S7MFXGI4TPNFSA>.
Download it today!
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
Yeah, if it won't even optimize a single operation in the first macro argument then we should probably just do it your way. It is proper, but nonetheless costly in terms of comprehensibility and maintenance. I've been writing C++ for 27 years and even I don't know what |
|
It is definitely unfortunate that we have to do it this way. More recent
versions of C++ have a simpler way of saying this but in Arduino we can
only rely on C++11. In modern C++ I’d just replace this with std::clamp.
What that template does is evaluate the first argument (is_same::value —
are these two things the same type) and then if so it becomes the type of
the second argument. (If the first argument returns false the value
inference fails and the template is dropped from overload consideration. )
That being said, I understand not everyone needs this to perform the way I
need it to. If you guys don’t want to take up this change it’s fine with
me. I’ll keep it in my fork. :)
…On Wed, May 27, 2026 at 1:25 PM dekutree64 ***@***.***> wrote:
*dekutree64* left a comment (simplefoc/Arduino-FOC#529)
<#529 (comment)>
Yeah, if it won't even optimize a single operation in the first macro
argument then we should probably just do it your way. It is proper, but
nonetheless costly in terms of comprehensibility and maintenance. I've been
writing C++ for 27 years and even I don't know what typename
std::enable_if<std::is_same<T, float>::value, long>::type does, so I hate
to subject newbie programmers to it on a function that's used frequently
throughout the code and thus very frustrating if you don't understand how
it works. And if there's ever an error involving those functions it will be
far more difficult to track down through the sprawl of template error
messages. And since there are so few occurrences in the code where it
really matters, I'd rather kick the can down the road rather than paying
the price upfront and continuously from now on.
—
Reply to this email directly, view it on GitHub
<#529?email_source=notifications&email_token=AAGDGTEQYDQO5GCVK365J2T4446MLA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG44TINZZGM4KM4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4557947938>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGDGTFT6CKZ5USSHBMHDL34446MLAVCNFSM6AAAAACYGKAYAOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXHE2DOOJTHA>
.
Triage notifications, keep track of coding agent tasks and review pull
requests on the go with GitHub Mobile for iOS
<https://github.com/notifications/mobile/ios/AAGDGTHMNHURBMSWNUXZWJ34446MLA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG44TINZZGM4KM4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KUZTPN52GK4S7NFXXG>
and Android
<https://github.com/notifications/mobile/android/AAGDGTFDITAMHMTRZ4UOPCL4446MLA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG44TINZZGM4KM4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2K4ZTPN52GK4S7MFXGI4TPNFSA>.
Download it today!
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Description
_round, _constrain and _sign currently reevaluate their arguments, which wastes CPU cycles. In torque/foc_current mode, this fix saves about 300ns/foc loop (about 2% of total FOC CPU usage). In other modes, especially estimated_current, it saves more.
Type of change
How Has This Been Tested?
I tested this on on my STM32G474 FOC bench setup.
Test Configuration/Setup: