HoG’s apple?

I may have spotted one last low-hanging fruit in the orchard, and it may be gold but there’s a chance it’s just brass.

Currently motion and sampling processing run in series on a single thread.  Critically, the sampling code waits for the next data ready interrupt.  That’s wasted time that the motion processor could use gainfully for a bit more processing, but it can’t because the code is blocked  waiting for that interrupt.  When 10 sets of samples have been gathered, the sampling code hands back to the motion code; the motion code takes 3ms, and during that time, nobody is waiting for the data ready interrupt and so valid samples are lost.

Running sampling and motion processing on separate Python threads doesn’t work because they are running on a single CPU with the python interpreter scheduling running the show.  Throw in the GIL, and they’re effectively still running serially.

Now many many years ago, I wrote a kernel driver for an SDLC device.  It had an ISR – an interrupt service routine – which got called at higher priority when a new packet arrived; it did a tiny bit of processing and exited, allowing the main thread to process the new data the ISR had caught and cached.

That was in the kernel but I think something very similar is possible using the GPIO library in user-land.  It’s possible to register a callback that is made when the data ready interrupt occurs.  In the meantime, motion processing runs freely.  The interrupt handler / callback run on a separate OS (not python) thread, so hopefully the GIL won’t spoil things.

So here’s how it should work:

  • A data_ready callback gets registered to read the sensors – it’s triggered by the rising edge of the hardware data ready interrupt.
  • Once installed, the callback is made for each new batch of data every 1ms and a batch of sensor data is read.
  • The callback normally just caches the data but every 10 samples, it copies the batch into some memory the motion thread is watching, and kicks it into life by sending a unix signal (SIGUSR1).
  • The motion thread sits waiting for that signal (signal.pause()) – just like for the “threading” code does now.
  • Once received it has 10ms to process the data before the next batch comes through from the callback – that’s plenty of time.

The subtle difference is that the waiting is happening in the kernel scheduler rather than in the python scheduler, meaning the python motion code can run in between each new data ready interrupt callback.

From looking at the GPIO code, the callback thread is not a python thread and should not be affected by the GIL nor the overhead of python threading.  Which means the motion processing can happen at python level while sensor sampling happens on a ‘C’ level thread.

It definitely worth a go.

If it works, it also allows me to climb nearer to moral high ground too: I’ll be able to revert to the standard RPi.GPIO library rather than my version which performance tuned the wait_for_edge() call.  The callback function doesn’t have the same inefficiencies.  One of my guiding principles of this project was to keep it pure, so it feels good to return to the one true path to purity and enlightenment.  Fingers crossed!



Despicable me!

After discussion on the PyPy dev e-mail alias, it seems that to get PyPy performance, I need to change GPIO and RPIO from using the CPython ‘C’ API to using CFFI to call the GPIO / RPIO ‘C’ code from PyPy (or CPython come to that).  It’s the CPython ‘C’ API that’s the performance hog for anything but CPython.  But it’s not entirely clear to me how to use CFFI on the GPIO / RPIO ‘C’ code.  I don’t think it’s tricky – but I’m ignorant so there’s a lot of learning to do.  I think some googling’s needed to find some examples.

There’s a plan B though: kitty uses an OS FIFO to read data from the picamera on a separate thread while the camera thread is still filming.  This works well.  I could do the same thing, moving motion processing to a separate thread (though it’ll probably need to be a process due to the GIL) and feed it sensor data over a FIFO.  The motion process just waits on a select() for the next batch of sensor data, and processes it (taking ~ 3ms) while the next batch of sensor data is collected and averaged (taking ~ 10ms).  No data is lost, and timing is wholly driven by the MPU sampling rate.  I’m pretty certain this could work.

But that lead me to realize there’s a dirty, no, absolutely filthy and perhaps despicable hack I can do. The motion processing is consistently taking just over 3ms.  If I add that time to the 10ms taken to sample the sensors 10 times, then assuming the sensor readings in those 10 samples are pretty consistent (I’m assuming this already to some extent by averaging them), then including the 3ms for motion processing will make the velocity integration more accurate when there is acceleration.  Essentially it’s interpolating the 3 missing samples lost during motion processing based upon the average of the 10 that were collected successfully.  I could also do the same for missed data samples.

So I tried it out, and after a few minor tweaks, it worked.  I wanted to get a video to back up all these boring words, but by then, her main LiPo was running low, and she got all wibbly wobbly at that point.  I’ll try later when it’s up to full charge again.

So for now, dirty will do nicely, though I do intend to try the FIFO method too, initially with threads to get the code working, and then I’ll move over to using processes which is a little bit trickier.

I2C error cause diagnosed?

There are two threads in the code, one reading the sensors and once doing the motion processing.  The sensor thread runs full speed, and the motion thread gets kicked at 71Hz to process the batch of data recorded since last time.

The sensor thread sits waiting for the data ready interrupt and then immediately reads the data over I2C.  But what if the motion thread cuts in inbetween? It’s not as unlikely as it sounds because of the way the GIL works.  If this happens, then the reading of the sensors is deferred longer than 1ms after the data-ready interrupt, and that means that duff data could be read or there may be an I2C error.  That could well be the reason for the very long pulse shown in the ‘scope output.

I’d considered adding thread locks to the code when I introduced the threads, but decided it was unnecessary because the threads are completely independent except for the point where the sensor thread copies a batch of data from its own variables into a shared structure, and kicks the motion thread which copies it out from the shared structure.  Because the motion thread is woken at 71Hz, but it’s processing takes only a few milliseconds, locking isn’t necessary.

But if my speculation is right, I do need some locking to prevent the motion thread being run in that tiny gap between interrupt and reading the sensors*.  So that’s the next step.

I couldn’t resist flying her after all this testing even without the locking to see the effects of the other changes, and she just had two consistent flights.  The first only had an I2C error in warm-up, and the flight was perfect.  The second also had a I2C error mid-flight, at which point she started drifting slowly right, which backs up my theory.

I’m really quite excited at what effect the locking will show!

*There is still a risk the OS will get in the way at the point between the interrupt and reading the data, and this is where the necessity of a real-time OS steps in. At last I understand why DIY quadders are so dogmatic about needing Arduinos or a real-time OS but never being able to explain why in any solid way. My apologies for doubting your dogma.

Oh silly scope!

Here’s a capture from my ‘scope of the data ready interrupt pin.  I’ve configured it so that it goes high when there is data ready, and drops back low when the data is read by the code:

Data ready interrupt

Data ready interrupt

The display is 20ms wide, so there should be 20 pulses shown if everything was perfect.  The narrow pulses are perfect.  But there’s a couple interesting bits:

  1. There are several missing pulses most obviously between 14 and 18ms – that suggests the sensor is not consistently setting the interrupts pin high at the configured 1kHz sampling rate.
  2. There is one wide pulse between 6 and 9ms – that indicates an extended period between the sensor raising the interrupt pin high and the code reading the data – that could be because Linux / Raspian / Python is not real time, or because the motion processing thread was running, taking longer that 1ms – that prevents the sensors being read due to the Python GIL blocking threads.  That’s to be expected and there’s not a lot I can do without switching to another Python implementation – Cython or PyPy rather than CPython..

There’s nothing I can do about the gaps – that’s a sensor problem.  The code already attempts to compensate though by timing gaps between reads and uses that for integration,  which is then averaged on passing to the motion processing code periodically.  Still, it’s a worry gaps this large exist.

It’s reassuring that all the pulses shows are at 1ms intervals, so there’s no bogus spikes causing the I2C errors during flight.  The flight logs show only 1 I2C error which happened during warm up; to some extent that has lesser impact as the period is used to load up the Butterworth filters and at the end, extract take-off platform slope and resultant gravity distribution and calibration across the quad axes.

After all the tinkering I’ve been up to, I think it’s worth a flight just to see if anything is better as a result.

Releasing the GIL in Python ‘C’ libraries

I found there was a way to release the GIL inside Python ‘C’ libraries:

        // do something synchronous here like epoll_wait();

My variant of (and the mainline) GPIO library does this when it blocks – specifically when it’s waiting for the hardware interrupt pulse with GPIO.edge_detect_wait() (mine) and GPIO.wait_for_edge() (standard version).  I wondered whether actually releasing the GIL at this point was bad as once released the Python interpreter then processes another 100 bytes of bitcode before releasing it again; I postulated this could possibly cause missed sensor reads.

So I updated my library code, and ran the thread+signal variant of the code – once more 800Hz, just like the serialized version.  Oh well, it was worth a try.

For now then I’ve definitely run out of ideas about threading and have moved onto vibration damping; the point of 1kHz sensor reads is integrating out noise / vibration – if the vibration is physically damped, then the sensor read frequency becomes a lot less important. .  More on that tomorrow, probably.

GIL + Linux scheduling

On a whim, I disabled the motion processing code, so only the sensor reading code was running – it achieved 930Hz – still not the 1kHz data that’s available, but a huge improvement on the 800Hz as my best with motion processing included.

That pretty much confirms the GIL is spoiling the multithreading.

WRT 930Hz vs 1kHz, the likely cause is Linux scheduling – only an RTOS or microcontroller could fix that, though I did have a try at increasing the niceness of the code – sadly os.setpriority() is only supported on python 3 and I’m using 2.

For the moment then, I’ll call it quits on threads / processing speed, and carry on elsewhere.

GIL wins, five – love.

Just finished the change which keeps motion processing in the main thread, and collects / integrated data in the separate thread.  Data is copied between the two, and a signal sent to the main thread when new data is ready.  No explicit locking is in use.

The result?….800Hz – i.e. exactly the same as the completely serial data integration + motion processing code.  And it’s all GIL’s fault.

Oh well, at least I tried.