Debugging a Smoothieware bug
I actually had a specific bug in mind when I started looking into ARM Cortex-M tracing features. I use Smoothieware to drive my CNC3020, which I'm making a 3D printer option for. When I started increasing the printing speed, a few problems appeared.
Because the CNC3020 has uses leadscrews for motion, high speeds and 16x microstepping result in step speeds of around 30 ksteps/s. Because all parts are metal, it also doesn't give way to sudden jerks in motion, instead causing lost steps on the motors. Generally this is handled by the motion control software, which controls acceleration and deceleration. Unfortunately smoothie seems to have one or more bugs which cause jerks in between motion commands.
Below is a video of the typical case. Visually everything seems ok. If there is a few lost steps, it is only 0.1 mm or so, and the error becomes visible only when it builds up over many layers. However, if you listen to the audio (or look at the audio visualization), there is clearly a glitch there.
Hardware connections and trace config
To enable tracing on the LPC1769 processor, I needed access to the SWDIO, SWCLK and SWO pins. These were only available on SMD pads, but soldering a few wires resolved that. In addition, I wanted to see the step and direction signals of the motors, so I soldered those also and connected everything to a flat cable connector which fits my logic analyzer.
I chose to use OpenOCD to do the trace config, so that I didn't have to edit the smoothie firmware for every test. The example script worked after just removing the STM32 specific parts.
The Smoothieboard runs at 100 MHz, while my logic analyzer captures at 24 MSps. The SWO clock frequency has to be a factor to both of these frequencies to get a clean capture. This means that 4 MHz is the highest usable frequency. So I set TPI_ACPR to 24, giving a divider of 25 (100 / 25 = 4).
For first tests, I decided to use just ITM capture. However, as soon as the motors were moving, I started getting a lot of overflow packets:
The stepping interrupts were simply occurring too fast. ITM takes 3 bytes to report each interrupt entry/exit, which at 4 MHz baudrate means 6 microseconds, or 12 microseconds for complete interrupt. Smoothie has interrupts occurring a lot more often than that.
Getting a faster logic analyzer would allow running the trace output at 100 MHz. However, it is also possible to just slow down the CPU frequency. As long as the whole CPU is run at a slower speed, everything should happen with the same timings as at full speed. So I lowered CPU clocks to 4 MHz with a simple command in the OpenOCD script, and set TPI_ACPR to 0 to keep the trace output at 4 MHz also.
Let the bug-hunt begin
I had naturally already looked on the smoothie bugtracker to see if there was anything relevant. At the point where I had finished the software side for tracing, someone had posted a similar looking issue #617. It had an easily reproducible test case attached - great!
So I enabled ITM tracing and took a capture in the PulseView GUI. And yes, there were plenty of gaps like this:
To fully capture the motion, I had selected the capture length as 100 Msamples. However, sigrok decodes the signal from beginning to end, instead of just a selectable region, so it took over a minute to decode the whole trace. I'm not going to wait a minute! So let's use sigrok-cli on the command line to cut up the relevant part of the file:
sigrok-cli -i capture.sr -O binary | tail -c +98400000 | head -c 1000000 > foo sigrok-cli -I binary:samplerate=300000000 -i foo -o cut.sr
First line converts the data into binary file, one byte per sample. Looking at the image above, the relevant part began at about 8200 milliseconds. My samplerate was 12 MSps, so that gives sample number 98400000, which goes into the tail command. Then the pipe uses head to take the first 1 million samples after that, and everything into a temporary file.
The second line converts the binary file back to sigrok format so that it can be opened in the PulseView GUI. At the same time, I reset the samplerate to 300 MHz. This allows me to compare the timings directly to what they would be when the CPU is running at 100 MHz instead of being slowed down for tracing.
From the trace we can see the IRQ1 step interrupt occurring and setting the step signal. The shorter IRQ after it is IRQ2 "unstep" interrupt, which just clears the signal. Where the large gap is, the PendSV interrupt is continuously running.
It turns out that Smoothie always makes a pause of some length between two motion control blocks. The PendSV is calculating the step rates for the new block, and no steps are issued in the meantime. Also found out from the project authors, sometimes one of the axes finishes later due to small inaccuracies in acceleration, in which case all the other axes wait for it.
So this seemed terribly ugly to me. I jumped ahead implementing the "great solution" or "terrible kludge" of keeping the same step rate until the new block is ready, and then compensating for the steps taken in the meantime. This wasn't welcomed very well by the Smoothieware authors, but who cares, I can always use it for myself.
However, what I didn't consider was the length of the pause. In the capture above, it is less than 100 microseconds. In the issue report, the longest pause reported is 350 microseconds. Even at 30 ksteps/s, this pause lasts only 10 steps or less than 1 complete step. Doesn't sound long enough to cause the trouble I was seeing. Back to tracing..
Catching the random bug
I knew from past experience that the pauses I was getting were somewhat random, but when they occurred they occurred near the same spots of the gcode. So I started printing a gcode file that had had problems before. When the problems started occurring, I ran the OpenOCD script to slow down the CPU and start tracing.
Even with 100 Msample long traces, it took a few trials before I caught the gap in the trace. I would've never caught it in a debugger, but now I had it:
So this time it was 2000 microseconds long and step rate was 10ksteps/s before and after, so the gap was 20 microsteps or a bit over 1 full step. Possibly enough to cause lost steps, and if it does make 2 millisecond gaps, it probably makes 3-4 millisecond gaps also.
But zooming in it looked different from the prior issue. All axes had finished moving, and also PendSV wasn't occurring until near the end of the trace. The code in Stepper.cpp was only waiting for x/y/z axes, all of which I had on the graph.
What was delaying it? Was it a totally separate issue?
Well, ETM trace should tell us. I reconfigured for ETM capture, and a few trials later I had a trace of a similar gap. Near the beginning of the gap, I found this:
So it was passing that "all motors stopped" if condition at Stepper.cpp:217, and it was calling Block::release(). However, the block wasn't triggering the "block complete" event like it should have.
Something else was holding the block. Knowing a bit about Smoothie's architecture, it was quite easy to guess at this point that it was the Extruder module. It was a bit of a "d'oh" moment when I realized I had not connected the extruder step signal to the logic analyzer, but, such things happen. Fortunately it is simple to confirm the reason from the trace also:
Above bit was found near the end of the gap. There the Extruder gets notified that the extruder stepper has completed the movement. It calls the Block::release(), the event gets triggered and the wheels start rolling from there on.
The extruder speed is connected to the main stepper speeds. Overall all acceleration values for the motors are computed beforehand, and the slight rounding errors will only accumulate over the length of the move, resulting in large timing differences at the end of the move.
Fixing the bug
Having found the problem, it is only a matter of deciding how to fix it. I guess the most reasonable way is to modify the acceleration calculation so that the differences between axes will tend to grow smaller rather than larger.
I haven't yet made the fix, but atleast now I know what the problem is, and also have a way to check that the eventual fix works as planned.
– Petteri Aimonen on 7.3.2015