I’ve just managed to get my Quadcopter code to run under PyPy rather than CPython – that means the code is compiled in advance for each run (Just In Time or JIT) rather than interpreted line by line. Sadly, this took the performance down to 58% rather than the 95% I’d achieved with CPython 🙁
However, the PyPy code in the standard Raspian distribution is very out of date (version 2.2.1 compared to the current 2.6) so there’s more investigation to be done.
In passing, I also updated the Raspian distribution (sudo apt-get dist-upgrade) installed on Phoebe, and amazingly, that has taken me to about 98%!
Time to go see if that’s made a real difference…
I’m pretty sure that my interpreted CPython code is as efficient as possible, so if I want to capture all samples (for accurate integration) rather than the 95% I currently get, I need to make the motion processing code take less than 1ms and the most obvious currently-available solution (until the RPi A2 is launched) is to move from the interpreted CPython to the compiled Cython or PyPy.
I considered this a long time ago when trying to speed up the code, but in the end, I didn’t need to make the move as various tweaks to the code* improved the performance by a factor of 5 or 6.
But now is the time to make that leap of faith. I’ll update you on progress.
*Primary performance enhancements:
- customized GPIO library to optimize rising edge performance for the data ready hardware interrupt
- run sampling at 1kHz but only run motion processing for each batch of 10 (averaged) samples.
- minimized calls to time.time() to just time stamping each batch of 10 samples – another irony that calling time.time() is the most time consuming call in this code.
After some code tweaks, I’m now getting every single piece of data – yes, that’s right, I’m reading data at 1kHz!
Primary problem was integrating samples rather than averaging them. I’d found before that calling time.time() was actually a very time consuming call. By just adding samples instead of integrating them, I only need to call time.time() in each motion processing loop rather than per sample. It’s that which has taken my sampling rate to 1kHz – or to put it another way, the python code can now read the sensors faster than the sensors can churn out the data, so my code sits there twiddling its thumbs waiting for the data.
I’ve not tracked down the cause for the occassional I2C miss, so I’ve also moved the try / except wrapper from around the I2C read to around the (interrupt + I2C read). That forces the code to wait for the next sample before trying to read it again. That’s the cause for the spike in the middle.
Combining the above with my recent super-duper fast GPIO edge_detect_wait() update listening for a single 50us pulse per call, once more, I can climb back up on my trusty steed and blow raspberries at the RTOS dogma.
Code’s up on GitHub.
If you haven’t already, please read the previous BYOAQ-BAT articles first – just search for the BYOAQ-BAT tag.
This is another short article about getting the code to run as fast as possible. I’ve taken numerous steps to ensure the code itself is fast including:
- i2c 1 x 14 byte read rather than 14 x 1 byte reads
- time.time() called once per cycle at the point the next sample is read – there’s an irony that calling time.time() takes a lot of time!
- calculations of values used multiple times is done once – see the various matrices for an example
- apply calibration / scaling once per motion not once per sample
- separating data collection and motion processing into separate threads (although to be honest this is buggered by the GIL)
- data logging to RAM disk and copied to SD card on flight completion
but without the best from the OS, they are pointless.
We’ve already overclocked the CPU and given as much memory as possible to the CPU. Two last steps.
Get GPIO.tgz from GitHub is you don’t have it already. Do the following:
tar xvf GPIO.tgz
sudo apt-install remove python-rpi.gpio
sudo python setup.py install
That installs the GPIO replacement which is fast enough to catch the 50μS hardware interrupts from the sensors to say there is new data ready.
Then to read the data as fast as possible, you’ll need to edit /boot/config.txt adding:
Together I’m able to get about 700 loops per second each reading 14 bytes of data (a data rate of 78.4kbps) and process them. The sensors is updating this data at 1kHz or 112kbps which is why the baudrate needs to be increased from its default of 100kbps.
overclocking! To some extent, I’d been resistant to overclocking, both at risk to Phoebe, but also as, to me, it spoils her purity!
Anyway, I’ve just enabled moderate overclocking to 900MHz from 700, and bought myself another 100 cycles per second, taking it up to 900Hz sample rate. Definitely time for a test flight, and despite what the forecast says for today (18mph wind), the view from the window says still and sunny!
I’m using the “thread + signal” code, and the first flight was much the same as with the last single-threaded code and to be honest, exactly what I’d expected: some drift, which Phoebe spotted and stopped eventually, but still ending with a collision into a brick wall.
On the plus side, her vertical speed now seems to be under control, which is great; next step is a few more flights tweaking the
- motion PID gains to see if I can get quicker braking / return home for drift
- dlpf from 21Hz / 8.5ms lag to 44Hz / 4.9ms lag in sensor readings.
Don’t think I’ll be able to squeeze that in today though as we’re having friends for lunch with a nice Chianti 😉
P.S. Sorry for the flood of posts so far this year – as you might have guessed, I’m on a bit of a roll at the moment!
I found there was a way to release the GIL inside Python ‘C’ libraries:
// do something synchronous here like epoll_wait();
My variant of (and the mainline) GPIO library does this when it blocks – specifically when it’s waiting for the hardware interrupt pulse with GPIO.edge_detect_wait() (mine) and GPIO.wait_for_edge() (standard version). I wondered whether actually releasing the GIL at this point was bad as once released the Python interpreter then processes another 100 bytes of bitcode before releasing it again; I postulated this could possibly cause missed sensor reads.
So I updated my library code, and ran the thread+signal variant of the code – once more 800Hz, just like the serialized version. Oh well, it was worth a try.
For now then I’ve definitely run out of ideas about threading and have moved onto vibration damping; the point of 1kHz sensor reads is integrating out noise / vibration – if the vibration is physically damped, then the sensor read frequency becomes a lot less important. . More on that tomorrow, probably.
On a whim, I disabled the motion processing code, so only the sensor reading code was running – it achieved 930Hz – still not the 1kHz data that’s available, but a huge improvement on the 800Hz as my best with motion processing included.
That pretty much confirms the GIL is spoiling the multithreading.
WRT 930Hz vs 1kHz, the likely cause is Linux scheduling – only an RTOS or microcontroller could fix that, though I did have a try at increasing the niceness of the code – sadly os.setpriority() is only supported on python 3 and I’m using 2.
For the moment then, I’ll call it quits on threads / processing speed, and carry on elsewhere.
Just finished the change which keeps motion processing in the main thread, and collects / integrated data in the separate thread. Data is copied between the two, and a signal sent to the main thread when new data is ready. No explicit locking is in use.
The result?….800Hz – i.e. exactly the same as the completely serial data integration + motion processing code. And it’s all GIL’s fault.
Oh well, at least I tried.
The target here is to collect every piece of data made available from the sensors at 1kHz (1000Hz).
Here’s what I’ve found:
If data collection and motion processing run serially in one thread, then data collection happens at 800Hz. Top shows that python is taking 60% of the CPU power; together that suggests there is ‘space’ to achieve the 1kHz data collection rate (just!).
Moving the motion processing to a separate thread using the low level thread and lock function reduced performance to 500Hz – extremely disappointing given the amount of time it took me to do the recode (fiddly bits about who owned and initialize variables, and sychronizing threads are startup, shutdown, and batch data handover between threads)
Moving the motion processing to a separate thread using the higher level threading and Event function increased performance to 700Hz – better than 500Hz, but still slower than the 800Hz for the single threaded code.
I’m sure that when I initially hacked the data collection to a separate thread as my first quick trial, performance was increased to 900Hz, though I’m not sure how I did this, nor how I dealt with inter-thread transfer of the sensor data from the data collection to the motion processing thread.
Another thing to try is to move data collection to the new thread as with the initial hack, and have it signal the main thread (running motion processing) when a new batch of data is ready – signalling can’t be done the other way round – all signals go to the main thread so the main thread has to do the motion processing because that’s the thread that can be woken by a signal. The motion thread is blocked by a signal.pause pending the periodic signal.signal(SIGUSR1) from the data collection thread. I’ll give it a go as I have nothing better to do, but it’s another full rework of the threading code which is very fiddly and frustrating.
I’ll report when it’s done.
With my recent GPIO changes raising the data acquisition rate from 450Hz to 800Hz, suddenly it seemed it might be possible to capture data from the MPU6050 at the rate it provides it @ 1kHz.
The gating factor is the motion processing which takes over 1ms and therefore when run serially with the data acquisition causes missed data reads; the obvious solution was to run the motion processing in parallel, pushing it out to a separate thread; there’s more than enough oomph in the Raspberry Pi processor to run both threads in ‘parallel’ without one detectably affecting the other.
The idea was to have one thread dedicated to data collection / integration. Periodically. it’d send a batch of integrated data to the motion processing thread. Just a crude lock controls the transfer of this data; the collection thread owns it most of the time; the motion thread is given it when there’s a new batch of data to process. Once the motion thread has copied over the batch of data, the lock goes back to the data collection thread, and the motion processing thread performs its magic on its own private copy.
The only gotcha (which is potentially a blocking gotcha) is the Python Global Interpreter Lock (GIL) which makes the use of python threads virtually useless. I was vaguely aware of this so I polled the Raspberry Pi forum, which confirmed my concern.
But having Yorkshire roots (stubborn and overly self-confident), I did a quick and dirty hack, and the speed for data collection went up to 900Hz, so it seemed worthwhile doing this properly. So I’ve spend the last couple of days carefully rewriting and testing the code. Sadly, the ‘clean’ version dropped the data collection speed to just over 500Hz which was a huge disappointment.
Which leaves me in a quandary: did I just dream the 900Hz? And WTF did I do with the hacked version that showed such potential?
I’ll keep trying for a few days more before I give up – the weather is lousy so I have the time to tinker. Wish me luck!