Author Topic: PixInsight MultiCores CPU Scalability.... (Read 14219 times)

Yuriy Toropin · « **on:** 2011 August 18 12:25:22 »

I've been working on some CPU power demanded projects under PixInsight recently.
Processing of shots from 16 or even 31 MP cameras, mosaics built of 10 MP color shots, etc

My previous rig built around Intel Core 2 Quad Q9650 (4 cores, no hyperthreading, 3 GHz speed, 8Gb of memory under Win 7 64 bit) has been felt "slow" on such tasks and I've decided to upgrade to the CPU that will bring more massive multiprocessing without bankrupting me...

I've upgraded to the system built around Intel I7 970 (6 cores, hyperthreading == 12 threads or "cores" available simultaneously, spin out to 3.9 GHz) couple of weeks ago. 12 Gb DDR3 RAM, Win 7 64bit.

First, without any magic, I've measured PixInsight (PI hereafter) performance with new system - I've grabbed some "test" results from Q9650 in advance for comparison.
New i7 970 system was faster by... ~20...25% only

... 12 cores vs 4 cores, higher speed, more memory running at higher speed...

Advantage has been mainly due to higher CPU speed of I7 970...

Second, I've decided to make some tests and understand how PI's performance changes with more CPU cores available for processing.
For this I've measured time needed to perform 6 typical processing actions, using the same standard image(s) set(s) as input.
I've controlled number of cores available for PI via Edit - Preferences - Parallel Processing and Threads - Maximum number of processors used parameter.
All other "Enable.." options at that preference sheet has been set to ON, priority was set to "Time Critical".

Time spent by each process has been measured for 1,2,3,4,5,6 then - 9 and 12 cores (threads) enabled.

Real load of CPU cores has been controled via Resource Monitor of Win 7. Typically, PI has been used as many cores as has been allowed...

Results could be found on the graph below.
It shows relations of time spent on related process with several CPU cores utilized to the time shown by 1 core.
Value below 1 means that PI has spent less time and make calculations faster using several cores, value above 1 means that it has spent MORE time in multi core runs comparing to the baseline run made by single CPU core...

Red line represents "ideal scalability (1/N)" curve - ideal case when calculations and other operations could be distributed equally among all available core, ideal parallelizm. It's unachiveable on practice 'cause there are some read/write operations, etc that are not parallelized, - but cache, etc. has to help even here, right?

Results are very ~~dissapointing~~ strange, to say the least... $:-\$

Almost no advantage going from single core to two cores utilized...

There is a strange process (Image Integration) that spends more time with number of used cores growing...
There are some other strange processes (MaskedStretch, FFT, HDRW) that, after advantage due to increas of cores' number then suddenly start to loose this advantage and adding new cores leads to inrease of execution time...

At the same time, there are some "normal process(es)" (StarAlignment, LocalHistogramEqualization) that, as expected and as desired, consistently benefit from more and more cores added.

That's it.

I don't know if this is something easily fixable via settings and preferences (please, give me some tips) - or the real issue with multicore scalability.
Anybody could do his/her own expiriments measuring time spent by PI on favorite processes using different number of cores enabled in PI.

For me it's the real issue 'cause ivestment in the new rig hasn't brought expected gain in performance.

Speed is important 'cause faster executions allows to spend less time processing or let you try more combinations of process'es parameters, try more processing scenarious, etc.
Parallel processing optimization is important for (astronomical) image processing software.

I believe that skilled PI team will be able to solve this puzzle and will give at least 5...7 times acceleration for 12 threads CPU vs the single core one...

georg.viehoever · « **Reply #1 on:** 2011 August 18 12:39:42 »

The times you measured, are they "elapsed time" (i.e. time spent by you waiting), or is it something like CPU time?
Did you try this on Linux? Windows is not very good with multiple CPUs...

All I can say now that writing software that scales well is very difficult, and sometimes impossible. It is not necessarily the CPU that is the bottleneck for performance. Sometimes it is the memory bandwidth, sometimes it may be the disk speed (maybe that is the reason why Image integration is not terribly good).

Georg

Nocturnal · « **Reply #2 on:** 2011 August 18 12:40:26 »

Hold on, let me grab some popcorn! You could probably pop some on that 970@3.9G

Seriously, sorry you're not getting the expected upgraded performance. Looks like perhaps some processes are IO, no CPU bound or not multi threaded or a combination of both.

Yuriy Toropin · « **Reply #3 on:** 2011 August 18 12:46:56 »

George,
>> The times you measured, are they "elapsed time" (i.e. time spent by you waiting), or is it something like CPU time?
It's time reported by PI in the console for the process.

Nocturnal,
No need in popcorn, I'm dissapointed but -- you're right, 970 @ 3.9GHz makes some difference that is a good news!

I understand what both of your talking about real bottlenecks, challenges with scalability, etc.

My only point is to highlight the issue - with this chances that it will be considered and some effort will be invested into it will go higher, right?
All of us will benefit then

BTW, is this Win 7 problem and how Linux handles it? Any tips that will give even 20% benefit in speed and time spent will be helpful...

Nocturnal · « **Reply #4 on:** 2011 August 18 13:04:41 »

Hi Yuriy,

with popcorn I meant "this is going to take a while and be interesting". It's what people eat when they go to the movies here in the US

I'm sure you'll agree that the only way to find out why things take as long as they do is to profile the action. That will show if CPUs are stalled waiting for IO for example. Or if CPUs are busy but only in short bursts only to be held up to synchronize again.

For example the debayer module which I wrote is multi thread capable but the operation is quite simple so it's possible that while the debayer goes faster and faster with more cores you reach a point of deminishing returns where single threaded actions (going from image to image for example and reading/writing them to/from disk) are constant and dominant.

Perhaps you can construct a single 'worst case' test case with image files and .psmx file with the sequence of processing steps for Juan to take a look at.

georg.viehoever · « **Reply #5 on:** 2011 August 18 13:18:26 »

Quote from: Nocturnal on 2011 August 18 13:04:41

...
Perhaps you can construct a single 'worst case' test case with image files and .psmx file with the sequence of processing steps for Juan to take a look at.

That would be my recommendation too. Find out which process hurts most in your current processing pipeline, create a typical test case, and give it to the head of our "PixInsight Jedi Council".

Georg

Yuriy Toropin · « **Reply #6 on:** 2011 August 18 13:24:09 »

Quote from: Nocturnal on 2011 August 18 13:04:41

with popcorn I meant "this is going to take a while and be interesting". It's what people eat when they go to the movies here in the US

just a side note -
globalization and internet exposed some "symbols" in many net cultures.
Reference to "popcorn", IMHO, is universal slang with the same meaning at many internet forums all over the globe.
BTW, real popcorn is sold now in almost any cinema theatres in almost any countries, it's not exclusive US goodies anymore

***
Georg, it's simple, Image Integration with the default parameters is THE CHAMPION of performance degradation with more cores available...

georg.viehoever · « **Reply #7 on:** 2011 August 18 13:30:05 »

Given the amount of RAM you have, I would probably first increase the BufferSize parameter in Image Integration. Maybe my first experiment would be to increase it to RamSize/numberOfImage/4 (four to give your OS sufficient other memory of file caches etc.).

Georg

Juan Conejero · « **Reply #8 on:** 2011 August 18 13:50:40 »

Hi Yuriy,

Thank you for the report. Indeed you have helped me find a weakness in the ImageIntegration tool. However, my tests don't confirm your results:

- Machine: DELL Precision M6500 laptop, Intel Core i7 Q820 (4-core w/HT = 8 logical processors), 8 GB DDR3 1300.
- OS: Fedora 14 Linux (unoptimized, almost default settings and services running).
- Desktop: KDE 4.6.5
- PixInsight version: 1.7.0.702 Starbuck

Results for ImageIntegration (module version 1.5.0.100):

- 19 2048x2048 FITS images previously registered with StarAlignment.
- All images already in ImageIntegration's cache.
- All parameters by default except rejection algorithm.

Results for the best four runs out of 8, times in seconds (from PI's console):

Min/Max Clipping
8 threads: 18.264, 18.346, 18.385, 18.413
4 threads: 15.369, 14.878, 15.199, 15.640
2 threads: 17.067, 17.424, 17.865, 19.546

Winsorized Sigma Clipping
8 threads: 22.711, 22.646, 22.407, 22.632
4 threads: 20.525, 20.389, 19.800, 20.507
2 threads: 22.363, 22.557, 25.083, 22.564

Linear Fit Clipping
8 threads: 30.356, 29.863, 29.971, 30.266
4 threads: 31.642, 31.378, 31.709, 32.738
2 threads: 39.167, 40.419, 39.492, 38.316

As you can see, with my machine ImageIntegration doesn't scale as badly (inversely) as in your test. There is a slight improvement for min/max and WSC when the number of threads equals the number of physical cores, and the linear fit algorithm scales slowly but uniformly. The differences are very little though.

These numbers denote that there's room for improvement in the ImageIntegration tool. My implementation should perform much better and I'll try to do my best to improve it for the next release.

As for other items in your test, just a few comments:

- In the StarAlignment tool, just a few parts are parallelizable. These are mainly the RANSAC point matching routine and the pixel interpolation routines. Most of the rest runs basically on one processor. To really improve this tool, a high-level parallelization must be implemented (just as I've done in the ImageCalibration tool). I'll try to do this also for the next release.

- I'm surprised with your results for the HDRWT and FFT tools. Let me make some tests before saying something conclusive.

- MaskedStretch is a script, and the JavaScript interpreter runs on a single processor. It calls low-level routines that have been parallelized, however, so your results (inverse scaling) are also quite surprising. I also need to carry out some detailed tests here.

- Keep in mind that many tools are complex and composed by many subtasks that cannot be parallelized. For example, if you try with relatively simple tools that perform massive pixel crunching such as CurvesTransformation, HistogramTransformation, ACDNR and UnsharpMask, among others, you should get results much closer to ideal scaling.

Again, thank you for showing me where I am not doing things well. I'll try to work more and better. This is the only way to improve, and PixInsight is here for this reason.

Sean · « **Reply #9 on:** 2011 August 18 13:52:57 »

Hi Yuriy,

This surprises me.

I don't know if you saw my recent post about my upgrade from a 2.4GHz Macbook Pro (Dual Core, no hyperthread), 6GB of RAM to an iMac 3.4GHz i7, 4 cores, 8 threads, only 4GB of RAM. Details are here: http://pixinsight.com/forum/index.php?topic=3330.0

I measured actual execution times from the PI console, and saw an almost 10x speedup in Image Calibration, Registration, and Integration. In both cases, I tested with PI 1.7 64 bit. Granted, this is on a completely different platform, but you should be seeing a bigger increase in performance than what you are. If you like, I could do some further tests with different numbers of cores, although it might not be that meaningful, being on the iMac.

I would be very curious to see the results of any further profiling that you can do.

Sean

sigurd · « **Reply #10 on:** 2011 August 18 15:40:17 »

Just as an aside, hyperthreading per se doesn't really help. As Juan demonstrated optimal processor utilization and performance should be found at the point where threads = cores. More threads without cores to back them up simply increases context switching overhead (and steals CPU cycles from the processes you actually care about). This is, of course, mostly impactful to highly CPU bound processing (of which PixInsight would be a perfect example).

-esy

Sean · « **Reply #11 on:** 2011 August 18 17:23:14 »

Depends on the application, doesn't it? Here's one recent reference that I found showing a 25% improvement with HT on an i7 4 core processor:

"I have an i7 920 (4 physical, 8 virtual cores with HT) on Windows 7 64-bit. I ran a Monte Carlo simulation on a single thread/core, then 4 (without HT) and then 8 (with HT). I do not remember the exact times now but I was able to perform about 5 times more simulations in the same time with HT, compared to when using a single core.
That is with HT it was like effectively having an extra physical core, or 25% more power. "

I realize it's very app specific, but doesn't hyperthreading help when there's a core stall due to processor cache miss or branch mis-prediction, etc?

sigurd · « **Reply #12 on:** 2011 August 18 18:22:16 »

Certainly it can help. And, as you say, it depends mightily on the way the application is written. I'm just trying to set expectations a little. What you are saying is basically exactly where hyperthreading can help: when there's a stalled core (cache miss, bad branch prediction, etc.). Generally one could reasonably expect 25-30% with a "normal" workload. However, in a very loaded situation where there is no lack of CPU demand, it can actually slow things down a bit. I think this is exactly what Juan's numbers showed. If I were guessing (and I am

) I'd say the Linear Fit clipping is rather less processor intensive than sigma or min/max clipping. Note that even in the case of the linear fit we are noting only an 8% improvement with a putative doubling of the core count. conversely, in the two scenarios where it impedes efficiency, it is an 18% and a 10% hit respectively.

It simply isn't a reasonable surmise that hyperthreading improves processing in the linear fashion actual cores can with a properly decomposed and scheduled algorithm.

-esy

pfile · « **Reply #13 on:** 2011 August 18 18:37:23 »

Quote from: sigurd on 2011 August 18 18:22:16

where hyperthreading can help: when there's a stalled core (cache miss, bad branch prediction, etc.). Generally one could reasonably expect 25-30% with a "normal" workload. However, in a very loaded situation where there is no lack of CPU demand, it can actually slow things down a bit...

It simply isn't a reasonable surmise that hyperhtreading improves processing in the linear fashion actual cores can with a properly decomposed and scheduled algorithm.

-esy

yep... a bad branch prediction can cost 10s-100s of cycles and i/o can take 1000s to 100,000s depending on what happens. it's not magic, it's just an ultra-fast context switch so that another thread in the same process can run immediately on the same core. there's only one copy of the compute resources; what's duplicated is all the state of the processor core. it's a way to increase the utilization of silicon that would otherwise go idle while stalled on i/o. yeah, the scheduler could swap the whole process out while it's blocked on i/o but that's a very heavyweight procedure. a hyperthread context switch is fast.

this stuff all comes out of the university of washington's simultaneous multithreading project (SMT). intel really picked it up and ran with it. check out their papers here: http://www.cs.washington.edu/research/smt/

i guess though with respect to imageintegration, is it not "embarassingly parallel?" that is, the image can be decomposed into a number of rectangles equal to the number of processors, and regardless of the rejection algorithm the entire process should scale reasonably linearly with the number of cores... or is it just way too i/o bound?

Nocturnal · « **Reply #14 on:** 2011 August 18 18:43:41 »

I guess it would be interesting to see a proposal from Juan that would include a few processes to run with expected scalability based on the algorithm and resource use. We can then try to replicate the results. Sometimes the way PI uses resources is a bit surprising. I was running 1000 deconv iterations on a relatively small image while watching CPU util. It peaked around 17%. 4 HT cores were less than half occupied, the other 4 were idle. So it appears that some processes do not have 'infinite' multi threading but rather divide the job in a number of parts that get run in different threads. Just speculating of course.

This forum is closed since 5 March 2020

PixInsight Forum is now available at:

https://pixinsight.com/forum/

News:

Author Topic: PixInsight MultiCores CPU Scalability.... (Read 14219 times)

Yuriy Toropin

PixInsight MultiCores CPU Scalability....

georg.viehoever

Re: PixInsight MultiCores CPU Scalability....

Nocturnal

Re: PixInsight MultiCores CPU Scalability....

Yuriy Toropin

Re: PixInsight MultiCores CPU Scalability....

Nocturnal

Re: PixInsight MultiCores CPU Scalability....

georg.viehoever

Re: PixInsight MultiCores CPU Scalability....

Yuriy Toropin

Re: PixInsight MultiCores CPU Scalability....

georg.viehoever

Re: PixInsight MultiCores CPU Scalability....

Juan Conejero

Re: PixInsight MultiCores CPU Scalability....

Sean

Re: PixInsight MultiCores CPU Scalability....

sigurd

Re: PixInsight MultiCores CPU Scalability....

Sean

Re: PixInsight MultiCores CPU Scalability....

sigurd

Re: PixInsight MultiCores CPU Scalability....

pfile

Re: PixInsight MultiCores CPU Scalability....

Nocturnal

Re: PixInsight MultiCores CPU Scalability....