Skip to content

Avoid reevaluating arguments in _round, _constrain, _sign.#529

Open
jaguilar wants to merge 1 commit into
simplefoc:devfrom
jaguilar:opt
Open

Avoid reevaluating arguments in _round, _constrain, _sign.#529
jaguilar wants to merge 1 commit into
simplefoc:devfrom
jaguilar:opt

Conversation

@jaguilar
Copy link
Copy Markdown

Description

_round, _constrain and _sign currently reevaluate their arguments, which wastes CPU cycles. In torque/foc_current mode, this fix saves about 300ns/foc loop (about 2% of total FOC CPU usage). In other modes, especially estimated_current, it saves more.

Type of change

  • Optimization

How Has This Been Tested?

I tested this on on my STM32G474 FOC bench setup.

Test Configuration/Setup:

  • Hardware: STM32G474, custom 3BLDC driver
  • IDE: VSCode & arm-none-eabi-gcc
  • MCU package version (stm32duino/arduino-esp32/..)

@runger1101001
Copy link
Copy Markdown
Member

Nice.
You're saying the ternary operator isn't optimised as well as the simpler structure in the template function?

Would you have time to compare __builtin_round() with the self-implemented one? That might be another possible option?

@jaguilar
Copy link
Copy Markdown
Author

I could try it.

The point of the template function is the make sure that each expression is only evaluated once. So for example if you have _constrain(complicated_expr, low, high), you don't want to do if (complicated_expr < low) ? low : (complicated_expr > high) ? high : complicated_expr, since in the non-clamped case you have to reevaluate the expression three times.

@jaguilar
Copy link
Copy Markdown
Author

jaguilar commented Apr 29, 2026

Okay, replacing with builtins gives another 600ns speedup on a baseline of 9500ns for loopFOC (for position control mode in foc_current torque control). These implementations should be C++11 compatible. Let me know if it is acceptable and I will amend the commit.

Note that I have no way of testing that this works across different compilers so if you have a test suite to hand please feel welcome to run it.

template<typename T>
constexpr inline int _sign(T val) {
  return __builtin_signbit(val);
}

#ifndef _round
template<typename T>
constexpr inline std::enable_if<std::is_same<T, double>::value, long>::type _round(T x) {
  return __builtin_round(x);
}
template<typename T>
constexpr inline std::enable_if<std::is_same<T, float>::value, long>::type _round(T x) {
  return __builtin_roundf(x);
}
#endif

// Use enable_if to select the fastest implementation according to the amt type.
// Using __builtin_fXf is measurably faster than using the ternary approach.
template<typename T, typename L, typename H>
constexpr inline std::enable_if<std::is_integral<T>::value, T>::type _constrain(T amt, L low, H high) {
  return (amt < low) ? low : (amt > high) ? high : amt;
}
template<typename T, typename L, typename H>
constexpr inline std::enable_if<std::is_same<T, float>::value, T>::type _constrain(T amt, L low, H high) {
  return __builtin_fmaxf(low, __builtin_fminf(high, amt));
}
template<typename T, typename L, typename H>
constexpr inline std::enable_if<std::is_same<T, double>::value, T>::type _constrain(T amt, L low, H high) {
  return __builtin_fmax(low, __builtin_fmin(high, amt));
}

@runger1101001
Copy link
Copy Markdown
Member

runger1101001 commented Apr 29, 2026

Wow, thank you so much for trying it - like I thought, even better :-)

Personally I’m very tempted to do it like that… what do you think?
The built ins are GCC specific. But I’m not aware of any Arduino platform that does not use gcc… so I think we’re fine for compatibility in our ecosystem and it’s only people porting to CubeIDE and Keil compilers that might have troubles…

Although very cool I think I’d leave out the template magic that selects the fastest one. But also up for discussion.

@jaguilar
Copy link
Copy Markdown
Author

jaguilar commented Apr 29, 2026

Although very cool I think I’d leave out the template magic that selects the fastest one. But also up for discussion.

I don't think we can, because if anyone is compiling without -ffast-math (which is probably a lot of people because it's not on by default), the __builtin_fmax/_fmin/_round calls would add double promotion and software double rounding. (Will await your agreement or further feedback on this point before continuing.)

@jaguilar jaguilar force-pushed the opt branch 2 times, most recently from 0dfd803 to 920d0ae Compare April 30, 2026 14:17
@jaguilar
Copy link
Copy Markdown
Author

Looks like the AVR compiler indeed does not have type_traits, and thus we can't use enable_if for everything. My idea right now is to keep the enable_if code but to use the non-enable_if code for AVR.

_round, _constrain and _sign currently reevaluate their arguments, which
wastes CPU cycles. In torque/foc_current mode, this fix saves about
300ns/foc loop (about 2% of total CPU usage). In other modes, especially
estimated_current, it saves more.
@jaguilar
Copy link
Copy Markdown
Author

Not sure what the deal is with the arduino_due_x compile -- seems to be a problem with min and max which should not have changed due to my code. Let me know if anything jumps out at you.

@dekutree64
Copy link
Copy Markdown
Contributor

Personally I'd rather not take that first step onto the slippery slope toward the horror that is "modern C++". The builtins are a tantalizing treat for an optimization nut like myself, but SimpleFOC is getting too bloated as it is, so I'd rather push back toward minimalism and readability for people to learn how FOC works rather than pure focus on performance.

My first question is, why is the compiler not optimizing it in the first place? The only moderately-complex expressions I see on a find-in-files of _constrain are the lag estimate _constrain( -current_sp*shaft_velocity*pole_pairs*axis_inductance.q, -voltage_limit, voltage_limit) and stepper setPwm functions _constrain(abs(Ualpha)/voltage_power_supply,0.0f,1.0f). The rest have at most one arithmetic operation, so if a compiler can't see that opportunity, this would only be a drop in the bucket of micro-optimization to be done.

The setPwm functions are wasteful anyway, doing two constrains when the voltage limit should (#536) always be below voltage_power_supply anyway. So I'd refactor to something like this:

void StepperDriver4PWM::setPwm(float Ualpha, float Ubeta) {
  float duty_cycle1A(0.0f),duty_cycle1B(0.0f),duty_cycle2A(0.0f),duty_cycle2B(0.0f);
  float limit = min(voltage_limit, voltage_power_supply); // Sanity check. voltage_limit should always be less than voltage_power_supply
  float duty1 = min(abs(Ualpha), limit)/voltage_power_supply;
  float duty2 = min(abs(Ubeta), limit)/voltage_power_supply;
  if(Ualpha > 0) duty_cycle1B = duty1; else duty_cycle1A = duty1;
  if(Ubeta > 0) duty_cycle2B = duty2; else duty_cycle2A = duty2;
  _writeDutyCycle4PWM(duty_cycle1A, duty_cycle1B, duty_cycle2A, duty_cycle2B, params);
}

Still a nested ternary, but easier for a dumb compiler to see the repetition, and not that many wasted comparisons in any case.

The lag estimate is such a sprawl that it should be split onto two lines anyway, and then can do like this to avoid the issue.

      if(_isset(axis_inductance.q)) {
        voltage.d = -current_sp*shaft_velocity*pole_pairs*axis_inductance.q;
        voltage.d = _constrain( voltage.d, -voltage_limit, voltage_limit) + feed_forward_voltage.d
      } else
        voltage.d = feed_forward_voltage.d;

Or pull it out into a separate function since it's the same in estimated_current and dc_current.

The similar inductance calculations in foc_current mode could also be split onto two lines like that, but I've never understood why those should be there in the first place. The effect of phase lag is that some of the should-be q current ends up in the d axis, but since the current PIDs are already adjusting the voltage to correct for it, modifying d voltage afterward should overcorrect and cause field weakening instead.

There are some instances of two arithmetic operations in some hardware-specific driver code like this:

  int pwm_h = _constrain(val-dead_time/2,0,255);
  int pwm_l = _constrain(val+dead_time/2,0,255);

But if we want to accommodate braindead compilers then we should be doing like this anyway:

  dead_time /= 2;
  int pwm_h = _constrain(val-dead_time,0,255);
  int pwm_l = _constrain(val+dead_time,0,255);

@jaguilar
Copy link
Copy Markdown
Author

jaguilar commented May 27, 2026 via email

@dekutree64
Copy link
Copy Markdown
Contributor

Yeah, if it won't even optimize a single operation in the first macro argument then we should probably just do it your way. It is proper, but nonetheless costly in terms of comprehensibility and maintenance. I've been writing C++ for 27 years and even I don't know what typename std::enable_if<std::is_same<T, float>::value, long>::type does, so I hate to subject newbie programmers to it on a function that's used frequently throughout the code and thus very frustrating if you don't understand how it works. And if there's ever an error involving those functions it will be far more difficult to track down through the sprawl of template error messages. And since there are so few occurrences in the code where it really matters, I'd rather kick the can down the road rather than paying the price upfront and continuously from now on.

@jaguilar
Copy link
Copy Markdown
Author

jaguilar commented May 27, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants