Author Topic: PixInsight Benchmark (Read 64249 times)

pja · « **Reply #90 on:** 2015 February 02 10:55:34 »

George: Thanks for the reply. I should read a few pages close to the last page.

It is clear to me that for the processes that I am interested I just need to focus on the CPU benchmark.

Puck

slang · « **Reply #91 on:** 2015 February 18 03:59:13 »

Hey.

Given my endless pursuits for more performance, I have upgraded my machine, the upgrade and results are interesting, so I thought I would post here.

Machine specs basically the same as below;

Code: [Select]

RAM:    24Gbyte
DRIVES: 3 x WD Raptor 150Gbyte, 4 entries per drive in PI General Preferences File Locations
MOBO:   Gigabyte EX58-UD4P

The big diff is that I splashed out (US$80) on a second hand Xeon X5650 which dropped into my mobo without a hassle/change. (Well, had to reset CMOS to remove previous overclocking parameters, and the manual didn't say that CPU was supported, but others have done the same upgrade successfully). That gives a cool 6 CPU's and 12 threads (c.f. 4/8) over the prev i950. Although clockspeed moves down from 3.07GHz to 2.66, the extra cores, internal chip optimisations (more internal cache), faster memory controller make it fly even more.

Code: [Select]

Execution Times
Total time ............. 01:06.83
CPU time ............... 01:02.50
Swap time .............. 00:04.29
Swap transfer rate ..... 3862.641 MiB/s

Performance Indices
Total performance ......  7038
CPU performance ........  6056
Swap performance ....... 21393

So, a relatively cheap upgrade. Perhaps in a few years, the X5670 will be similarly priced, and I'll do the same. It appears to be stable which helps.

Oh, the X5650 uses less power than the prev i950, so same heatsink works OK. Details are; http://www.pixinsight.com/benchmark/benchmark-report.php?sn=3U5RI278XBQL9K45YZOPGW3UG56E9PU5

Cheers -

Quote from: slang on 2014 May 18 04:49:45

Hiya.

This is indeed a fascinating topic, and very timely for me. I recently acquired (hand me down) a new PC with quad core 3.07GHz with 24GByte of RAM and a stack of WD Raptor drives. Very lucky and very grateful am I. I'm (obviously, as there is no greater need) dedicating this for PI image processing.

So, I'm following these threads and understanding cpu, memory and disk tradeoff. It seems that a really good idea (if you have a bucketload of RAM) is to keep the files in RAM, or at least the temp/scratch files. I use Linux (ubuntu 12.04), and linux has some 'knobs' to tweak in this regards.

I haven't looked at setting up a RAM disk - that seems a bit coarse, but entirely do-able, I wanted something a bit smarter than a ram disk.

What I have done is investigate some kernel tuning parameters in /etc/sysctl.conf. There is a lot of information on this around the net, but http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ is a good description.

WARNING: This can be dangerous - do your own research, your mileage may vary, and no-one else will be accountable for any negative results.

I have set these as follows;
vm.dirty_background_ratio = 80
vm.dirty_ratio = 80
vm.dirty_writeback_centisecs = 1500

Descriptions;
vm.dirty_background_ratio is the percentage of system memory that can be filled with “dirty” pages — memory pages that still need to be written to disk
vm.dirty_ratio is the absolute maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk
vm.dirty_expire_centisecs is how long something can be in cache before it needs to be written

So, what this actually means is that the kernel will set aside up to 80% of RAM (in my case = ~19GBytes) for disk cache. When it gets full, it WILL be written to disk, files can reside in memory for up to 15 seconds before the WILL be written to disk.

Obviously, there is a huge risk with this - a power outage will cause potential large loss of files. This isn't an issue (generally) for PI temp files, but these settings are system wide, so if your system is doing other stuff...... Caveat Emptor as my economics teacher used to say.

Anyway. I don't profess to be an expert at all, it has been 15 years since I've attempted to tune a kernel to this degree, but the results are quite stunning in my case, and the PI benchmark is exceptionally useful for understanding benefits of system tuning.

Now I have 3 x 146GByte WD Raptor drives (10,000 rpm) each dedicated for temp storage, with ext4 filesystem, with journals, dirtime, atime and something else turned off, so they're pretty quick by themselves. The benchmark results are;

Standard system;
Execution Times
Total time ............. 01:46.11
CPU time ............... 01:24.50
Swap time .............. 00:21.57
Swap transfer rate ..... 768.381 MiB/s

Performance Indices
Total performance ...... 4433
CPU performance ........ 4480
Swap performance ....... 4256

With Kernel tuning (as above);

Execution Times
Total time ............. 01:27.97
CPU time ............... 01:23.02
Swap time .............. 00:04.90
Swap transfer rate ..... 3385.359 MiB/s

Performance Indices
Total performance ...... 5347
CPU performance ........ 4559
Swap performance ....... 18750

So, that's a massive improvement in performance by letting the system use much of RAM for disk cache. (From memory, if the system requires the RAM for applications, it will take precedence over this disk caching...)

Some notes:
* Interestingly, despite putting these in /etc/sysctl.conf, they do not persist across a reboot - I need to issue sysctl -p for them to take effect. Must investigate that.
* Whilst this does start to look like tuning to beat a benchmark, I have tested this config in some real world tests. Running BatchPreProcessing on 25 x .cr2 files (12.2MP), including integration (but using master bias/dark etc.) saw a massive improvement. Monitoring actual disk i/o at the time showed that there was NO disk i/o during this process. This is a typical set of my inputs, so I call this close to 'real world' for me.
* When the cache does get full, or the system does decide to actually flush the stuff, you may be in for a wait - that's the trade-off/risk here.
* This config seems to beat using SSD's in my case, although not that many desktops would support 24GByte of RAM, so maybe this is academic/theoretical?
* ext4 without journalling and mounted with defaults,data=writeback,noatime,nodiratime also seems to make a pretty decent difference
* The results can be a little inconsistent, sometimes the benchmark will result in ZERO actual disk i/o, other times there will be a little bit here and there
* Oh, I commissioned an old UPS I had lying around - shoudl give between 30 seconds and 5 minutes of protection ;-)
* Also acquired some caching SATA/SAS controllers which should help with speed when something actually needs to hit a disk platter, but haven't bothered yet - why would I, given these results?

Is anyone else doing anything similar? How dangerous is this? Am I mad? (OK, don't answer the last one...)

Cheers -

cs_pixinsight · « **Reply #92 on:** 2015 February 18 11:09:20 »

slang, I have the same motherboard but only have 12GB of ram (6 X 2GB). I've been toying with a memory upgrade, but have passed on the higher density memory (8GB sticks) thinking this old MOBO wouldn't support it. I'm wondering how you reached the 24GB limit of this board (6 X 4GB, 3 X 8GB) and which memory you are currently using?

Thanks, Craig

geomcd1949 · « **Reply #93 on:** 2015 February 18 12:57:57 »

I don't do astrophotography, but enjoy building computers in retirement. I was recently asked to build a machine specifically for Pixinsight use, with instructions to have it deliver “the most bang for the buck.” I share with the community what I came up with, in high hopes that it might aid others in building machines, and also so that, through discussion, errors in reasoning I made might be pointed out.

1. Operating system: Linux. Because almost all the fastest benchmarked machines use Linux. Also, it's free.

2. CPU: More cores are better (faster) than fewer. Four are good, six are better, eight are great, and 12 are fantastic. A six-core is optimal, on the bang-for-the-buck theory, with eight- and 12-core processors showing diminishing returns.

3. RAM: More is better, because the more memory, the more ramdisks that can be used, while still leaving a sufficient amount of RAM for the CPU to use. My configuration is two 8GB ramdisks with 16 GB left over.

4. Hard drives: The OS and Pixinsight are placed on an [EVO 840 500GB] SSD, and less expensive HDDs used to store data.

5. Power supply: Modular. The “System Build” page on pcpartpicker.com gives the estimated wattage needed to power a given system's components. I used a Platinum-certified unit for efficiency. Because the psu uses only necessary cables, the inside of the case is less cluttered, and allows for better airflow.

6. Cabling: There is disagreement over whether SATA 3Gb/s can carry data as fast as SATA 6Gb/s cables. To be safe, I used 6Gb/s cables. And, the shorter, the better. And, round cables allow greater airflow than flat cables. And, I believe it is important that the cables and their connection points be clean, and that care be taken to insure the cables are securely connected to the connectors, and can't work loose.

7. Cooling: I used an aftermarket heatsink with a 140mm fan to cool the CPU. There are three 140mm intake fans and three similar fans for exhaust, plus the power-supply fan, which blows outward. Heat is the enemy of electrical efficiency. With all six cores running at 100%, no temp. was higher than 125* F. (52* C.). I noticed that when the computer was moved from a 72* F. room to a 64* F. room, the benchmark score increased about 1%.

8. Overclocking: I didn't do it, only because I don't know how to. I see that many benchmarked machines are overclocked, and the resulting benchmarking scores are significantly higher.

To illustrate the increase in benchmark speeds when using a six-core CPU instead of four-core models, here are results from similarly-configured machines:

Intel I7-3770 [4 cores] with 16GB RAM: 7056
Intel I7-3770 [4 cores] with 32GB RAM: 7360
Intel I7-4930K [6 cores] with 32 GB RAM: 10341

georg.viehoever · « **Reply #94 on:** 2015 February 18 13:37:46 »

The key reason why the 4930K performs better are not the cores, but rather the 4 memory channels (4 instead of 2) and the larger CPU cache (12MB instead of 8MB), not the additional cores. Try deactivating the two additional cores, and you'll see what I mean. Modern CPUs starve for memory bandwidth in operations such as image processing.

Georg

slang · « **Reply #95 on:** 2015 February 18 16:50:38 »

Hiya.

My mobo has 6x4Gbyte, PC1333 RAM. That was how I inherited it, and it seems to work well. The Xeon family (well, x5650) has a faster memory controller than my prev i950, so it is possible that part of the speed jump was memory speed related. 24Gbyte is the max for this mobo, but I'm not sure if any 8Gbyte modules would work at all - I have a suspicion that it supports only 4Gbyte modules.

Cheers -

Quote from: cs_pixinsight on 2015 February 18 11:09:20

slang, I have the same motherboard but only have 12GB of ram (6 X 2GB). I've been toying with a memory upgrade, but have passed on the higher density memory (8GB sticks) thinking this old MOBO wouldn't support it. I'm wondering how you reached the 24GB limit of this board (6 X 4GB, 3 X 8GB) and which memory you are currently using?

Thanks, Craig

vicent_peris · « **Reply #96 on:** 2015 March 01 05:40:40 »

Hi all,

Since I bought my new workstation I've been researching a lot about how to optimize multi node computers because at this moment PixInsight is not optimized for this kind of infrastructure (Non-Uniform Memory Access, AKA NUMA). The main problem is when working with tools having a highly parallelized code and at the same time complex data structures. In these tools, like TGV, performance is terribly degraded using NUMA (activated by default in the BIOS, of course). By deactivating NUMA from BIOS and kernel (adding the kernel parameter numa=off) the CPU performance index of the benchmark rises from 12400 to 13800 and TGV runs 250% faster.

But I suspect I can further optimize the workstation because the core usage as seen in KSysGuard is around 60 - 70 % while using TGV. Moreover, I'm interested in using NUMA architecture to be able to process several BPP at the same time, thus dedicating specific cores and memory to a single PixInsight instance.

I've tried to use numactl command and activating NUMA support in kernel. While I can control the execution of simple applications, it doesn't work with PixInsight. The main problem seems that the threads created by the application change PID continuously. If you list the /proc/$PID/ directory, you'll see the generated threads; they change at a rate faster than 1 second (which is also not good, I suspect). Moreover, numactl does not work with PIDs, only with commands, so there is no way to control the execution of these threads.

I also tried to control the execution of a single thread with taskset, which works fine. But taskset does not manage memory access. OTOH, taskset works fine with my i7 laptop simply by using it to launch PixInsight, but it only works in the workstation by executing taskset on the specific thread PID.

Is there anything I'm missing? Any misterious addition kernel parameter?

Thanks,
Vicent.

georg.viehoever · « **Reply #97 on:** 2015 March 01 06:43:53 »

Hi Vicent,

it is interesting that you are looking into this. I also had a few looks into PI performance - so far with inconclusive results. Here some findings from looking at PI and other programs:

- A problem is that PI has built its own thread management tools (not OpenMP, not TBB, not PThreads, ...), and the "interesting" parts are hidden behind the API wall (for instance Thread::Start()), with no source available. That makes it difficult to understand what is really going on.
- In multi-core scenarios, in particular with NUMA, memory access patterns are often more important than use of cores. For instance, the way that memory blocks are allocated to different CPU sockets is quite important. Unfortunately, you have very limited influence on this from outside of the code, and again some of the important bits are hidden behind the API. numactl --interleave=... might produce some effect
- Optimizing for NUMA is a difficult topic. Optimizations that work nicely on one workstation cause slow downs on others, or cause unexpected slow downs in some scenarios. For instance using thread binding sometimes helps, but as soon as there is a second load on the system, everything slows down dramatically because the OS cannot move threads to less loaded cores. I strongly recommend to avoid these tricks as long as they dont show considerable benefit with a number of scenarions and workstations.
- If you optimize for performance by using the PI benchmark: It spends almost 50% of its CPU time in Bicubic*Interpolation::operator(). So whatever you measure is heavily biased toward these operations.

Georg

vicent_peris · « **Reply #98 on:** 2015 March 01 07:29:48 »

Thanks Georg,

I think the scenario of running several PixInsight instances is easier than trying to optimize the performance of a single instance outside the code (which will be useless at the end, IMO). What I'm looking for is an optimization to run maybe 8 - 10 PixInsight instances at the same time, each running different processes, and each one using the local node memory.

Anyway, actually you get some performance increase when you run several instances even if they run on every core; this should be because the scheduler fills the holes between latencies in single threads. :-P

Best regards,
Vicent.

georg.viehoever · « **Reply #99 on:** 2015 March 01 10:26:32 »

If you want to run several instances of PI concurrently, and want to see the highest possible aggregate performance, here is what you most likely need to do:

- limit use of cores to one core/PI instance (in the Global Preferences dialog)
- run one instance of PI per core
- use numactl to bind each instance to its private core
- use numactl to make sure that memory is allocated to the socket that owns the core.

If your set of processes contains a lot of I/O (disk/network), it may be beneficial to run more than one instance per core, and not to do core binding.

Its really difficult to find the optimal configuration for a software that has such a high diversity of operations as can be found in PI.

Georg

vicent_peris · « **Reply #100 on:** 2015 March 01 10:44:54 »

Quote from: georg.viehoever on 2015 March 01 10:26:32

- use numactl to make sure that memory is allocated to the socket that owns the core.

How can I do this??

V

georg.viehoever · « **Reply #101 on:** 2015 March 01 11:50:08 »

Quote from: vicent_peris on 2015 March 01 10:44:54

Quote from: georg.viehoever on 2015 March 01 10:26:32
- use numactl to make sure that memory is allocated to the socket that owns the core.

How can I do this??

V

numactl man page:

Code: [Select]

--membind=nodes, -m nodes
    Only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes. nodes may be specified as noted above.

vicent_peris · « **Reply #102 on:** 2015 March 01 11:55:05 »

Quote from: georg.viehoever on 2015 March 01 11:50:08

Quote from: vicent_peris on 2015 March 01 10:44:54
Quote from: georg.viehoever on 2015 March 01 10:26:32
- use numactl to make sure that memory is allocated to the socket that owns the core.

How can I do this??

V
numactl man page:
Code: [Select]
--membind=nodes, -m nodes Only allocate memory from nodes. Allocation will fail when there is not enough memory available on these nodes. nodes may be specified as noted above.

Sorry, I mean how can I check that numactl is effectively binding the instance to the local memory?

georg.viehoever · « **Reply #103 on:** 2015 March 01 12:24:41 »

Quote from: vicent_peris on 2015 March 01 11:55:05

...
Sorry, I mean how can I check that numactl is effectively binding the instance to the local memory?

Franky, I dont know how to do this from "outside". I guess it would be possible to find this information somewhere in the /proc filesystem. I once did this from C from within a process, and numactl worked as advertised.

Georg

georg.viehoever · « **Reply #104 on:** 2015 March 01 12:29:46 »

It appears that "cat /proc/<PID>/numa_maps" gives the information you are seeking. See "man numa". Note that I dont have a way to test this right now.
Georg

This forum is closed since 5 March 2020

PixInsight Forum is now available at:

https://pixinsight.com/forum/

News:

Author Topic: PixInsight Benchmark (Read 64249 times)

pja

Re: PixInsight Benchmark

slang

Re: PixInsight Benchmark

cs_pixinsight

Re: PixInsight Benchmark

geomcd1949

Re: PixInsight Benchmark

georg.viehoever

Re: PixInsight Benchmark

slang

Re: PixInsight Benchmark

vicent_peris

Re: PixInsight Benchmark

georg.viehoever

Re: PixInsight Benchmark

vicent_peris

Re: PixInsight Benchmark

georg.viehoever

Re: PixInsight Benchmark

vicent_peris

Re: PixInsight Benchmark

georg.viehoever

Re: PixInsight Benchmark

vicent_peris

Re: PixInsight Benchmark

georg.viehoever

Re: PixInsight Benchmark

georg.viehoever

Re: PixInsight Benchmark