Use Fast DCTs instead of direct DCTs. This should improve speed performance significatively. Also the code is highly parallelizable, so Juan will have a lot to say there
The obvious choice would be FFTW, but it is GPL and the cost of a commercial license (or 'non-free license' as they euphemistically call it) was above 7000 USD the last time I asked. So FFTW is discarded, unless you are going to release all of your code under GPL. Even then I'd ask the FFTW guys first because they probably don't want you to use their code in a PI module.
I'm looking for a good DCT implementation released under a BSD or MIT license, to integrate it with PCL. I've found
this one which looks good; other suggestions welcome, of course.
Right now I'm using naive DCT implementations, O(n^2). Once Juan (or another more skilled than me) implements fast DCT/DST transforms, I may change the code to use it.
As happens with the DFT, The DCT is separable. This means that the DCT of a 2D function is equivalent to performing the 1D DCT of each row followed by the 1D DCTs of the resulting columns. This reduces the complexity to O(2N) with respect to the naive implementation, which is O(N^2) as you know.