So I've got some benchmarks today. Let's look at the cold numbers and discuss the results for each tool in some detail.
At the risk of being repetitive, these are the conditions:
- Machine: DELL Precision M6500 laptop, Intel Core i7 Q820 (4-core w/HT = 8 logical processors), 8 GB DDR3 1300.
- OS: Fedora 14 Linux x86_64 (unoptimized, almost default settings and services running).
- Desktop: KDE 4.6.5
- PixInsight version: 1.7.0.702 Starbuck 64-bit
Best four runs out of eight tries. Times in seconds, copied from PI's console.
*******************************************************************************
ATrousWaveletTransform6270x4096 F32
4-layer transform with bias and noise reduction in the first 3 layers.
8 threads: 6.60, 6.64, 6.65, 6.65
4 threads: 7.25, 7.29, 7.31, 7.30
2 threads: 8.89, 8.92, 8.92, 8.96
1 thread : 9.32, 9.32, 9.34, 9.36
The results for this tool show poor scaling with increasing number of threads. The reason is rather simple: the ATWT process works on a per-layer basis. Although wavelet transforms are extremely fast and scale very well in PixInsight, this tool spends most of its processing time iterating through wavelet coefficients on each layer. The tasks of layer biasing, noise thresholding and noise reduction cannot be implemented as a single loop, and repeated loops scale quite poorly. I'll try to increase the amount of work done by each thread and this will improve scalability, but don't expect miracles for this tool.
*******************************************************************************
HDRWaveletTransform6270x4096 F32
4 layers B3 Spline
8 threads: 12.12, 12.13, 12.14, 12.14
4 threads: 14.27, 14.29, 14.32, 14.33
2 threads: 19.35, 19.36, 19.48, 19.49
1 thread : 20.41, 20.41, 20.43, 20.48
Conscious that this tool is one of our flagships, I invested significant time optimizing it during past Spring and the results are rather good, considering that this is another difficult tool for parallelization. Again, this tool works layer-by-layer operating with wavelet coefficients so there's not much more I can do to improve it.
*******************************************************************************
CurvesTransformation2985x1950 RGB F32
8 threads: 3.18, 3.20, 3.17, 3.16
4 threads: 4.64, 4.66, 4.71, 4.51
2 threads: 7.05, 7.03, 7.02, 7.04
1 thread : 10.41, 10.47, 10.41, 10.40
This is an excellent example of a tool that performs an
embarrassingly parallel task as a whole, and hence scales very well as you can see from the numbers above. Note also that hyperthreading is particularly efficient for this task, at least with the processor and OS used (Core i7 Q820 and Linux).
*******************************************************************************
PixelMath6270x4096 F32
Expression: a = alpha*Pi(); (1 + 1/Tan( a/2 ) * Tan( a*($T - 1/2) ))/2
Symbols: a, alpha=0.5
8 threads: 5.45, 5.45, 5.48, 5.58
4 threads: 7.67, 7.84, 7.79, 7.91
2 threads: 11.86, 11.91, 12.03, 12.10
1 thread : 13.80, 14.10, 14.15, 14.34
The PixelMath expression corresponds to a sigmoid contrast manipulation function, namely:
data:image/s3,"s3://crabby-images/4d7a9/4d7a96f46a40417924b6ae87329a16806b341f5d" alt=""
where the alpha parameter (0 < alpha <= 1) controls the function's strength (more aggressive for smaller alpha values). Again, I spent considerable time optimizing PixelMath (a couple years ago), as one of the most important tools in PixInsight. This tool operates on a pixel-by-pixel basis so it can be considered also as an embarrassingly parallel task at a high level.
*******************************************************************************
Fast Fourier TransformNow let's test fast Fourier transforms in PixInsight. As most of your probably know, the FFT is a fundamental 'brick' in the image processing building, so its implementation is absolutely crucial in terms of overall performance. However, we have a problem to test Fourier transforms with the standard FourierTransform and InverseFourierTransform tools: the FFT routine is so fast, that the execution times of these tools are strongly dominated by creation and displaying of new image windows (to render the FT components). This chiefly explains Yuriy's results.
The best way (the
only way, in fact) to test FFT routines in PixInsight is by means of a little script:
#define NITER 10
function benchmarkFFT()
{
var C = Image.newComplexImage();
C.assign( Matrix.gaussianFilterBySize( 4096 ).toImage() );
var t1 = Date.now();
for ( var i = 0; i < NITER; ++i )
C.FFT( true/*centered*/ );
var t2 = Date.now();
var secs = (t2 - t1)/1000;
console.writeln( format( "%.3f s", secs ) );
}
#undef NITER
console.show();
benchmarkFFT();
The above script performs ten FFTs for a 4096x4096 single-channel, 32-bit floating point image and shows the time spent in seconds on the console. These are my results:
8 threads: 6.316
4 threads: 8.809
2 threads: 14.506
1 threads: 17.031
*******************************************************************************
Separable ConvolutionIf the FFT is a fundamental brick, separable convolution is a keystone. Again, these routines are so fast that testing them with tools and their GUI interaction is not accurate. I have written another simple script to carry out these tests:
#define NITER 10
function benchmarkSeparableConvolution()
{
var I = Matrix.gaussianFilterBySize( 4096 ).toImage();
var H = Matrix.gaussianFilterBySize( 25 );
var v = H.rowVector( H.rows >> 1 );
var t1 = Date.now();
for ( var i = 0; i < NITER; ++i )
I.convolveSeparable( v, v );
var t2 = Date.now();
var secs = (t2 - t1)/1000;
console.writeln( format( "%.3f s", secs ) );
}
#undef NITER
console.show();
benchmarkSeparableConvolution();
This script performs 10 separable convolutions of a 4096x4096 monochrome image with a 25-pixel Gaussian filter (two one-dimensional filter vectors of 25 components each). These are the results:
8 threads: 7.478
4 threads: 10.651
2 threads: 16.654
1 threads: 18.293
*******************************************************************************
ConclusionScalability optimization is a very difficult and delicate task, and it is evident that there is plenty of room for improvement in many PixInsight tools, especially in the most complex ones where parallelization is not trivial. Being this an important topic, there are much more important tasks to be done for improvement in PixInsight. Flexibility and efficiency in terms of quality of the results is much more important for me than speed, especially when speed gains involve simplified implementations and solutions. Always keep in mind that my main goal is to provide you with the most powerful, accurate and versatile image processing platform,
not with the fastest implementation.
That said, I'll try to improve scalability of some critical tools as soon as possible, but please consider that this is not a high priority for me. For example, documentation and video tutorials are much more important now than squeezing the last bit of performance. Another example: I prefer to invest my time to integrate Georg's GradientsMergeMosaic with my StarAlignment than in achieving the fastest tool for a specific task.