Author Topic: PixInsight MultiCores CPU Scalability.... (Read 14220 times)

Sean · « **Reply #15 on:** 2011 August 18 19:44:01 »

This is an interesting and nicely geeky thread. If I can find a way to temporarily disable hyperthreading under OS X Lion (apparently not very easy) I will definitely run some PI tests. I know Juan, Apple wants it their way!

Sean

pfile · « **Reply #16 on:** 2011 August 18 19:57:38 »

Quote from: Sean on 2011 August 18 19:44:01

This is an interesting and nicely geeky thread. If I can find a way to temporarily disable hyperthreading under OS X Lion (apparently not very easy) I will definitely run some PI tests. I know Juan, Apple wants it their way!

Sean

do you have the developer tools (xcode) installed? it comes with a PreferencePane that should be able to turn off cores and turn off hyperthreading. i cant test this (or rather am afraid to) because, well, let's just say i should turn those things off in the BIOS if i want to

there's some evidence around the web that this prefpane does not work if you are booting a 64-bit kernel. not sure...

sigurd · « **Reply #17 on:** 2011 August 18 21:03:48 »

sysctl machdep.cpu.thread_count

will tell you your threads. I don't know if it will allow setting as my chips don't support HT. The GUI PrefPane and CHUD was removed in Xcode 4 I think.

-esy

pfile · « **Reply #18 on:** 2011 August 18 21:22:24 »

Quote from: sigurd on 2011 August 18 21:03:48

sysctl machdep.cpu.thread_count

will tell you your threads. I don't know if it will allow setting as my chips don't support HT. The GUI PrefPane and CHUD was removed in Xcode 4 I think.

-esy

yeah you're right, i have xcode 3 on this machine still.

don't want to try setting the #threads right now, maybe when i'm done working here..

Sean · « **Reply #19 on:** 2011 August 18 21:26:28 »

Yes, Xcode 4 installed here. I downloaded and tried the PrefPane from Xcode 3, but not surprisingly it won't install.

Sean

Andres.Pozo · « **Reply #20 on:** 2011 August 19 01:04:56 »

In the MaskedStretch script the bottleneck is in the hard disk. This script spends most of its time applying HistogramTransformation. This process when it is applied with a mask always writes the undo data (the original image) in the disk. Although HT is a parallel process it does not matter since writing the undo data is slower than the execution.

I would recommend using a SSD disk for the swap directory of Pixinsight.

georg.viehoever · « **Reply #21 on:** 2011 August 19 02:49:56 »

Just a few quick remarks:

- not parallelization or speedup is the goal. The absolute speed is what matters. The means by which you get it does not really matter.

- This whole multicore/HT stuff is only a clever marketing thing hiding technical problems. In reality, CPU manufacturers would need to say: "Guys, in the past we managed to give you faster CPUs that did not require anything special in software to give you more speed. We are sorry to say that now you need to have special software to have any benefit from our new chips". Instead, the marketing people managed to change this into "In the past we gave you only 1 core. Now we give you n cores. And n is always better than 1."

- The main idea about HT is that you can get somehing like 0-20% performance with just 5% additional hardware. Plus it got programmers into thinking parallel. The next step was multicores, where AMD/Intel just replicated whole CPUs (40-80% speed for 100% hardware). For the next AMD generation (Bulldozer) it will get more difficult to say how many cores there really are: According to literature, two cores share one FPU unit http://img145.imageshack.us/img145/4813/bully.jpg . If you look closer, you find something like 4 integer units in those two cores. So are the two cores in reality 1 or 4 units? Difficult to say. Its going to be interesting to see how AMD marketing is going to sell this.

- Parallelization is real hard work. And 12 cores rarely give you 12x speedup. And when you finally managed to parallelize something with good speed, this merely moves the bottleneck of some workflow to a different place. Some input like Yuriy's "Image Integration with the default parameters is THE CHAMPION of performance degradation" is most helpful to see where it really pays to invest work.

Georg

Juan Conejero · « **Reply #22 on:** 2011 August 19 08:40:11 »

So I've got some benchmarks today. Let's look at the cold numbers and discuss the results for each tool in some detail.

At the risk of being repetitive, these are the conditions:

- Machine: DELL Precision M6500 laptop, Intel Core i7 Q820 (4-core w/HT = 8 logical processors), 8 GB DDR3 1300.
- OS: Fedora 14 Linux x86_64 (unoptimized, almost default settings and services running).
- Desktop: KDE 4.6.5
- PixInsight version: 1.7.0.702 Starbuck 64-bit

Best four runs out of eight tries. Times in seconds, copied from PI's console.

*******************************************************************************
ATrousWaveletTransform
6270x4096 F32
4-layer transform with bias and noise reduction in the first 3 layers.

8 threads: 6.60, 6.64, 6.65, 6.65
4 threads: 7.25, 7.29, 7.31, 7.30
2 threads: 8.89, 8.92, 8.92, 8.96
1 thread : 9.32, 9.32, 9.34, 9.36

The results for this tool show poor scaling with increasing number of threads. The reason is rather simple: the ATWT process works on a per-layer basis. Although wavelet transforms are extremely fast and scale very well in PixInsight, this tool spends most of its processing time iterating through wavelet coefficients on each layer. The tasks of layer biasing, noise thresholding and noise reduction cannot be implemented as a single loop, and repeated loops scale quite poorly. I'll try to increase the amount of work done by each thread and this will improve scalability, but don't expect miracles for this tool.

*******************************************************************************
HDRWaveletTransform
6270x4096 F32
4 layers B3 Spline

8 threads: 12.12, 12.13, 12.14, 12.14
4 threads: 14.27, 14.29, 14.32, 14.33
2 threads: 19.35, 19.36, 19.48, 19.49
1 thread : 20.41, 20.41, 20.43, 20.48

Conscious that this tool is one of our flagships, I invested significant time optimizing it during past Spring and the results are rather good, considering that this is another difficult tool for parallelization. Again, this tool works layer-by-layer operating with wavelet coefficients so there's not much more I can do to improve it.

*******************************************************************************
CurvesTransformation
2985x1950 RGB F32

8 threads: 3.18, 3.20, 3.17, 3.16
4 threads: 4.64, 4.66, 4.71, 4.51
2 threads: 7.05, 7.03, 7.02, 7.04
1 thread : 10.41, 10.47, 10.41, 10.40

This is an excellent example of a tool that performs an embarrassingly parallel task as a whole, and hence scales very well as you can see from the numbers above. Note also that hyperthreading is particularly efficient for this task, at least with the processor and OS used (Core i7 Q820 and Linux).

*******************************************************************************
PixelMath
6270x4096 F32
Expression: a = alpha*Pi(); (1 + 1/Tan( a/2 ) * Tan( a*($T - 1/2) ))/2
Symbols: a, alpha=0.5

8 threads: 5.45, 5.45, 5.48, 5.58
4 threads: 7.67, 7.84, 7.79, 7.91
2 threads: 11.86, 11.91, 12.03, 12.10
1 thread : 13.80, 14.10, 14.15, 14.34

The PixelMath expression corresponds to a sigmoid contrast manipulation function, namely:

where the alpha parameter (0 < alpha <= 1) controls the function's strength (more aggressive for smaller alpha values). Again, I spent considerable time optimizing PixelMath (a couple years ago), as one of the most important tools in PixInsight. This tool operates on a pixel-by-pixel basis so it can be considered also as an embarrassingly parallel task at a high level.

*******************************************************************************
Fast Fourier Transform

Now let's test fast Fourier transforms in PixInsight. As most of your probably know, the FFT is a fundamental 'brick' in the image processing building, so its implementation is absolutely crucial in terms of overall performance. However, we have a problem to test Fourier transforms with the standard FourierTransform and InverseFourierTransform tools: the FFT routine is so fast, that the execution times of these tools are strongly dominated by creation and displaying of new image windows (to render the FT components). This chiefly explains Yuriy's results.

The best way (the only way, in fact) to test FFT routines in PixInsight is by means of a little script:

Code: [Select]

#define NITER 10

function benchmarkFFT()
{
   var C = Image.newComplexImage();
   C.assign( Matrix.gaussianFilterBySize( 4096 ).toImage() );

   var t1 = Date.now();

   for ( var i = 0; i < NITER; ++i )
      C.FFT( true/*centered*/ );

   var t2 = Date.now();

   var secs = (t2 - t1)/1000;

   console.writeln( format( "%.3f s", secs ) );
}

#undef NITER

console.show();
benchmarkFFT();

The above script performs ten FFTs for a 4096x4096 single-channel, 32-bit floating point image and shows the time spent in seconds on the console. These are my results:

8 threads: 6.316
4 threads: 8.809
2 threads: 14.506
1 threads: 17.031

*******************************************************************************
Separable Convolution

If the FFT is a fundamental brick, separable convolution is a keystone. Again, these routines are so fast that testing them with tools and their GUI interaction is not accurate. I have written another simple script to carry out these tests:

Code: [Select]

#define NITER 10

function benchmarkSeparableConvolution()
{
   var I = Matrix.gaussianFilterBySize( 4096 ).toImage();
   var H = Matrix.gaussianFilterBySize( 25 );
   var v = H.rowVector( H.rows >> 1 );

   var t1 = Date.now();

   for ( var i = 0; i < NITER; ++i )
      I.convolveSeparable( v, v );

   var t2 = Date.now();

   var secs = (t2 - t1)/1000;

   console.writeln( format( "%.3f s", secs ) );
}

#undef NITER

console.show();
benchmarkSeparableConvolution();

This script performs 10 separable convolutions of a 4096x4096 monochrome image with a 25-pixel Gaussian filter (two one-dimensional filter vectors of 25 components each). These are the results:

8 threads: 7.478
4 threads: 10.651
2 threads: 16.654
1 threads: 18.293

*******************************************************************************
Conclusion

Scalability optimization is a very difficult and delicate task, and it is evident that there is plenty of room for improvement in many PixInsight tools, especially in the most complex ones where parallelization is not trivial. Being this an important topic, there are much more important tasks to be done for improvement in PixInsight. Flexibility and efficiency in terms of quality of the results is much more important for me than speed, especially when speed gains involve simplified implementations and solutions. Always keep in mind that my main goal is to provide you with the most powerful, accurate and versatile image processing platform, not with the fastest implementation.

That said, I'll try to improve scalability of some critical tools as soon as possible, but please consider that this is not a high priority for me. For example, documentation and video tutorials are much more important now than squeezing the last bit of performance. Another example: I prefer to invest my time to integrate Georg's GradientsMergeMosaic with my StarAlignment than in achieving the fastest tool for a specific task.

Yuriy Toropin · « **Reply #23 on:** 2011 August 19 14:38:17 »

Juan, thanks a lot for very interesting details, insights and consideration!
As usual, it helps to understand PI way of life better!

Considering my initial "bad" results with ImageIntegration -
Georg, you were right, Buffer Size was the key, as soon as I changed it from (default) 16 Mb value to 768 Mb there were no degradation with increase of cores used.
Updated results of scalability could be found below, new runs are drawn with think lines, while initial one is in the thin ones.
Juan, with this it's still strange how did you get that good results on ImageIntegration with "default" parameters of the process, especially - of the Buffer Size?
May be it makes sense to change default value for Buffer Size parameter to something more appropriate?

Additionally, I've remembered about another trick - RAMDRIVE!!!
While it's not standard part of Win 7, there are wide selection of these aplications that create "virtual drive" in RAM available.
Some RAM disk software benchmarks could be found on this page. I've installed freeware version of DATARAM RAMDisk for evaluation in attempt to partially remove bottleneck related to I/O operations.

Scalability tests with PI swap files located on 4GB NTFS RAMDisk have been re-run for ImageIntegration (with buffer size at 768 Mb), MaskedStretch, HDRW, LocalHistogramEqualization.
Results are on the chart below, thick lines. RAMDisks somewhat reduces I/O bottleneck, cutting out another 10-20% of time spent by processes.

This time all processes behave nice, time in multi-cores mode was always shorter then in single core mode. Nice, but not great...

Juan,
while you extensively commented on the theme and state your vision of "speed" priority
(sorry, I can't agree with it, speed IS extremely important and could be another strong argument for PI selection by newbies),
don't you think that weak speed improvement going from single core to two cores looks... strange?
And with this there could be some general cause for this that will be easy to fix (as soon as it will be identified?

)?

Another (minor) pecularity is observation that ImageIntegration and HDRW require more time with 12 cores comparing to 9 ones.
MaskedStretch behavior is even more interesting

Anyway, this is refreshed graph on scalability, with RAMDisk utilization

Guys, JFIY, comparison of I/O operations using RAMDisk, SSD, HDD in my computer. RAMDisk is 15...100+ times faster in these tests vs SSD.

PS: ... love almost perfect scalability of the LocalHistogramEqualization using RAMDisk!

Nocturnal · « **Reply #24 on:** 2011 August 21 08:48:56 »

Thanks for the Dataram tip. Handy for all kinds of things. Speeds up DeepSkyStacker pretty nicely too. 4 GB should be enough for most applications. I would add a larger second temp directory to PI but then it'll start using both. At least I don't think you can configure a primary and an 'overflow' temp dir.

Juan Conejero · « **Reply #25 on:** 2011 August 21 12:44:06 »

I've got good news. The 'champion of poor performance' (make an experiment: be a software developer and read this sentence. Isn't it a good endurance test?

) has got superpowers.

Basically, I designed ImageIntegration to favor integration of large image sets (hundrads, thousands) minimizing memory usage. This had a severe performance penalty in terms of poor scalability, as Yuriy has kindly shown us. Now I have rewritten the tool to fix this performance problem (in just three days, not bad to be on vacation, huh?). Basically, the new version allows you decide whether you want to optimize in terms of memory consumption (the old behavior) or execution performance (new behavior). You can also balance between both paradigms.

Just one benchmark. The conditions are the same as my previous benchmarks (read my previous post on this thread) so I won't repeat them again.

ImageIntegration, 19 2048x2048 monochrome raw CCD images, all default parameters except rejection algorithm. As always, times are in seconds and I provide the best execution time out of eight tries.

Old version of the ImageIntegration module (you know, 'the champion of...')

Winsorized Sigma Clipping
8 threads: 22.41
4 threads: 20.39
2 threads: 22.36
1 thread : 24.52

New version of the ImageIntegration module (call it 'the atom ant')

Winsorized Sigma Clipping
8 threads: 11.34
4 threads: 14.34
2 threads: 19.24
1 thread : 21.12

More information very soon, and an update, of course

This forum is closed since 5 March 2020

PixInsight Forum is now available at:

https://pixinsight.com/forum/

News:

Author Topic: PixInsight MultiCores CPU Scalability.... (Read 14220 times)

Sean

Re: PixInsight MultiCores CPU Scalability....

pfile

Re: PixInsight MultiCores CPU Scalability....

sigurd

Re: PixInsight MultiCores CPU Scalability....

pfile

Re: PixInsight MultiCores CPU Scalability....

Sean

Re: PixInsight MultiCores CPU Scalability....

Andres.Pozo

Re: PixInsight MultiCores CPU Scalability....

georg.viehoever

Re: PixInsight MultiCores CPU Scalability....

Juan Conejero

Re: PixInsight MultiCores CPU Scalability....

Yuriy Toropin

Re: PixInsight MultiCores CPU Scalability....

Nocturnal

Re: PixInsight MultiCores CPU Scalability....

Juan Conejero

Re: PixInsight MultiCores CPU Scalability....