fixed 2nd CPU not recognized/utilized

drizf,
There is no solution yet. Astro-Not will be writing some test code in C++ to validate the use of all cores and CPU's soon. I think that will be the next step in trying to figure this out.

Thanks.




Thank you very much for your reply. I hope this problem can be solved as soon as possible. Thank you again for the efforts of the PixInsight team!
 
Was unable to reproduce this with a quick and dirty C++ program. Will need feedback from someone in the PixInsight team regarding which multi-threading library they are using in PixInsight.
 
Was unable to reproduce this with a quick and dirty C++ program. Will need feedback from someone in the PixInsight team regarding which multi-threading library they are using in PixInsight.

Thread support in the PixInsight core application is based on the QThread class. To acquire the total number of logical processors available on the host machine we use the GetLogicalProcessorInformation Win32 API function on Windows. I'll review this function's documentation since I see some changes have been made in recent Windows versions. However, after a quick read I haven't seen anything that would invalidate our current implementation.

Unfortunately, testing the behavior and issues you are reporting is something we cannot afford at present. We are involved in too many ongoing development projects and cannot have access to the required hardware resources.
 
Did you ever get this resolved?

I just got a HP Z8-G4 with 2x Xeon Platinum 8273CL (28 cores each, 2.2GHz base clock) and 320GB of RAM, but PI does not seem to be fully utilizing both CPUs. Upon PI loading, PI is seeing 112 threads, but when running WBPP on 253x QHY268C subs, PI seems to be using only 1 CPU. The Z8 is using the latest 1.8.9-2 1601. By comparison, the prior Z840 with 2x Xeon E5-2680v3 (12 cores each, 2.5GHz base clock) and 256GB of RAM would use both cores with ~equivalent loads.

Here is what the Z8-G4 CPU use look like during an intense part of WBPP (debayering). Notice that CPU0 is pegged at 100%, with CPU1 is lightly exercised. Also, at this intense point CPU clock is only 2.6GHz, far slower than its 3.7GHz turbo max. I am running several other things as well: many Chrome tabs, several Excel sessions, and JRiver playing music.

1710097371888.png
 
We resolved this CPU detection problem several versions ago. Our code base correctly detects all processor cores on multi-CPU systems on all supported platforms, including Windows. What you are reporting represents the way Windows is assigning threads to running tasks, which is out of our control and scope. We have no further control besides detecting logical processors and creating and running the appropriate number of threads for each task. Use Preferences settings to define the optimal number of processors PixInsight uses for each task and set of working conditions. This can only be determined through benchmarks.
 
We resolved this CPU detection problem several versions ago. Our code base correctly detects all processor cores on multi-CPU systems on all supported platforms, including Windows. What you are reporting represents the way Windows is assigning threads to running tasks, which is out of our control and scope. We have no further control besides detecting logical processors and creating and running the appropriate number of threads for each task. Use Preferences settings to define the optimal number of processors PixInsight uses for each task and set of working conditions. This can only be determined through benchmarks.
I have been using 2x Xeons in Z840 for more than 4 years, and PI always seemed to use everything (e.g. pegging CPU at 100% for long stretches), both with Win10Pro.

I was expecting this new Z8 to be ~2x the performance of my old Z840, with 56 cores vs. 24, higher Turbo clock, 2966 RAM vs. 2133, and 6x RAM instead of 4x. Passmark's benchmarks have the Platinum 8273CL at 3.3x the E5-2680v3. I even setup RamDisk w/ 100x folders for Swap vs. 24x folders in HP EX950 MVMe. I have set the number of read/write threads at 100 to match the 100 swap folders.

When running WBPP on the exact same OSC data set, the new Z8 was only 12% faster, as it was fighting with one arm tied behind its back.

As an aside, I find WBPP's processing of OSC surprising slow compared to a similar amount of mono data. I find it takes ~4-5x as long. In this case, it took 2'04" for 253x QHY268C subs, outputting separate R,G,B. WBPP also seems to sometimes not calibrate OSC, while it always works with mono. This could be user error, as this is my 1st time stacking OSC, having always done mono for 4+ years.

Also after I 1st got the Z8, Windows Update wanted to migrate it to Win11P. I updated to Win11, but ran PI Benchmark and it was only ~60% of the benchmark w/ Win10. So I reverted back to Win10P. This seems very strange.

Do you know of any Windows settings that I can change?

Any other suggestions?
 
I have been using 2x Xeons in Z840 for more than 4 years, and PI always seemed to use everything (e.g. pegging CPU at 100% for long stretches), both with Win10Pro.

I was expecting this new Z8 to be ~2x the performance of my old Z840, with 56 cores vs. 24, higher Turbo clock, 2966 RAM vs. 2133, and 6x RAM instead of 4x. Passmark's benchmarks have the Platinum 8273CL at 3.3x the E5-2680v3. I even setup RamDisk w/ 100x folders for Swap vs. 24x folders in HP EX950 MVMe. I have set the number of read/write threads at 100 to match the 100 swap folders.

When running WBPP on the exact same OSC data set, the new Z8 was only 12% faster, as it was fighting with one arm tied behind its back.

As an aside, I find WBPP's processing of OSC surprising slow compared to a similar amount of mono data. I find it takes ~4-5x as long. In this case, it took 2'04" for 253x QHY268C subs, outputting separate R,G,B. WBPP also seems to sometimes not calibrate OSC, while it always works with mono. This could be user error, as this is my 1st time stacking OSC, having always done mono for 4+ years.

Also after I 1st got the Z8, Windows Update wanted to migrate it to Win11P. I updated to Win11, but ran PI Benchmark and it was only ~60% of the benchmark w/ Win10. So I reverted back to Win10P. This seems very strange.

Do you know of any Windows settings that I can change?

Any other suggestions?
I'd start by getting rid of the ramdisk. That could be costing you performance. There's no advantage to using it, and it reduces available RAM, which definitely can be a problem. Also, no reason to have more than a few PI swap folders set up.
 
I'd start by getting rid of the ramdisk. That could be costing you performance. There's no advantage to using it, and it reduces available RAM, which definitely can be a problem. Also, no reason to have more than a few PI swap folders set up.
I am not coming close to running out of RAM, so I doubt it is the constraint. Also, I like the idea of 6x RAM x 2 controllers per CPU x 2 CPUs for theoretical ability of addressing 24 channels of RAM at the same time vs. using a single PCIe 3.0 SSD. I have only allocated 56GB of RAM to the RamDisk, leaving me more RAM than I had on the Z840, with extra RAM coming to fill out all 24 slots w/ 16GB each. However, I am not a computer expert, but have been a PC power user for decades.

I played around with adding more swap folders and it seem to clear improve PI Benchmark up to at least 50-60, and seem to "maybe" improve slightly from there (there seems to be other variation). With a total of 112 total threads (assuming Windows allows PI to use), I would want ~ a separate folder for each thread to write to. My son-in-law is an Engineering PhD that runs very CPU intensive projects at national research labs using HP Zs and SuperMicro machines, and he said that only a single core (thread???) can write to a specific folder at the same time, hence the need to scale the number of folders to the cores (threads???). Previously, I was matching the number of folders to number of cores, which seemed to be a common CN suggestion. PI would never be able to use 100% of the Z8's throughput, as there are other things going on.
 
I am not coming close to running out of RAM, so I doubt it is the constraint. Also, I like the idea of 6x RAM x 2 controllers per CPU x 2 CPUs for theoretical ability of addressing 24 channels of RAM at the same time vs. using a single PCIe 3.0 SSD. I have only allocated 56GB of RAM to the RamDisk, leaving me more RAM than I had on the Z840, with extra RAM coming to fill out all 24 slots w/ 16GB each. However, I am not a computer expert, but have been a PC power user for decades.

I played around with adding more swap folders and it seem to clear improve PI Benchmark up to at least 50-60, and seem to "maybe" improve slightly from there (there seems to be other variation). With a total of 112 total threads (assuming Windows allows PI to use), I would want ~ a separate folder for each thread to write to. My son-in-law is an Engineering PhD that runs very CPU intensive projects at national research labs using HP Zs and SuperMicro machines, and he said that only a single core (thread???) can write to a specific folder at the same time, hence the need to scale the number of folders to the cores (threads???). Previously, I was matching the number of folders to number of cores, which seemed to be a common CN suggestion. PI would never be able to use 100% of the Z8's throughput, as there are other things going on.
The number of cases of PI crashing on out-of-memory errors when the usual monitors didn't show the memory being fully used suggests that it can use a lot more than it seems like at times. Almost all the time PI spends on actual tasks involves CPU and memory. Very little time is spent moving things to and from disks.

I also wouldn't bother with the PI benchmark tool. I don't think it realistically represents actual performance.
 
I found the solution to my problem by upgrading the OS to Windows 11 Pro.

On Sunday, I ran several, real-world tests with my Z8 using the exact same data set and the exact same other programs open.

My Z8 has a mismatch in the RAM, but it is consistent by CPU, and there are no post warnings or warnings in HP Performance Advisor.

CPU0: Hynix 16x16GB 2Rx8 PC4-2933Y-RE2-12 ; total 192GB
CPU1: Samsung 16x16GB 1Rx4 PC4-2933Y-RC2-12 ; total 192GB

I did not know if my issue was due to OS settings, BIOS settings, or RAM mismatches between the CPUs.

The only things running before executing WBPP process were: File Explorer, Task Manager, and PixInsight. The PixInsight session had nothing else going on, I just opened PI and ran WBPP, and WBPP's cache was purged. Before running WBPP total RAM usage was ~11-12GB and CPU usage was ~1% with both CPUs being very similar. I have setup a RamDisk using up to 56GB and 56 swap folders for PI's swap. What I was testing was WBPP stacking 227x QHY268M subs that had already been calibrated/registered. While WBPP was running, I monitored Task Manager's measurement of load, and the CPU load and RAM usage varied depending on what WBPP was doing at the time.

Identical RAM (96GB per CPU; 192GB total) and Win10P

To test if the problems were due to RAM mismatch, I removed all the 2Rx8 RAM from CPU0, and moved 6x16GB 1Rx4 from CPU1 to CPU0, leaving both CPUs have 96GB of RAM in the 1st RAM bank (as HP advises).

There was basically no change. Again, it was basically all in CPU0 during the intense parts of WBPP, with CPU1 being minimal at ~10% (which as higher than it was not running WBPP), and CPU0 pegged at 100%. Total RAM usage peaked ~112GB, so it is using both CPU's RAM banks. Total execution time was 25:26.

Execution Z8 1Rx4 192GB Split WBPP 227 Mono Integration.png

Mismatch RAM (96GB 1Rx4 CPU0 + 96GB 2Rx8 CPU0; 96GB 1Rx4 CPU1; 288GB total) and Win10P

I powered off the Z8 and then put 6 DIMMs of 2Rx8 RAM into CPU0's 2nd RAM bank, mixing Ranks by CPU but consistent by RAM controller. I left CPU1 at 96GB of 1Rx4. Upon reboot, I got Post warnings about the RAM, and it only showed 96GB (256GB was in the 18/24 slots), and Windows was not loading (spinning wheel). So, it seems that the Z8 was only seeing the CPU1's RAM, and without CPU0 RAM the OS could not boot. Definitive: RAM cannot be mixed within a CPU, even if each bank had identical RAM.

Mismatch RAM (192GB 2Rx8 CPU0; 192GB 1Rx4 CPU1; 384GB total) and Win10P

So, I powered up and put the RAM back to my starting configuration of 384GB, with CPU0 having 2Rx8 and CPU1 having 1Rx4. In this configuration and the exact same other programs open, the results were almost identical, with total time only 2sec faster and peak RAM at 114GB (up 2GB).

Execution Z8 384GB WBPP 227 Mono Integration.png

Mismatch RAM (192GB 2Rx8 CPU0; 192GB 1Rx4 CPU1; 384GB total) and Win11P

I then changed my OS to Window 11 Pro, and reran the same WBPP under the same conditions. With Win11 both CPUs are used ~ the same and process time is ~80% that of Win10 on real-world case. Initial RAM usage was ~1-2GB more than Win10P and total RAM peaked at 115GB.

Execution Z8 384GB WBPP 227 Mono Win11P.png

Conclusion:

PI's Benchmark is not representative of heavier real-world loads, and can give contrary conclusions. PI's Benchmark on my dual-CPU, high-core Xeon showed Win11P was much slower than Win10P, but in real-world loads, Win11P's processing time was 80% that of Win10P.

Windows 11 Pro's improved Scheduler definitely benefits multi-socket, high-core CPU machines.

With nothing else running and PI having nothing else going on, WBPP stacking 227x QHY268M subs uses slightly more than 112-115GB of RAM when the machine is not CPU or RAM limited. Add to this any other RAM consumed by other apps running, browser tabs open, or PI project having anything in it. Also, if the subs are larger (e.g. IMX455), or stacking more subs, RAM usage will need to scale higher. I also find that processing OSC subs take MUCH more resources than mono, though I have not tracked how much of this is CPU cores, RAM, or just more time.

With my typical workstation load, with 5-6 large Excel models, 100 Chrome tabs open, JRiver playing music (digital to analog conversion, converting digital formats (e.g. SACD or bit-rates) and real-time parametric EQ), DSS, Word, Affinity Photo, and large PI projects), I would see my RAM usage peak well above 200GB. If the PC has less RAM, PI processing will just take longer as it has to break down the process into smaller matrices and more loops. My typical use case definitely benefits from my Z8-G4 or Z840 capabilities. My new Z8-G4 definitely has a performance improvement over my old Z840 (2.33x more cores, higher turbo clock, faster RAM, 6-channel RAM, and 50% more RAM slots), but the performance improvement of 42% (estimated 1/(0.8*0.88) mixing 2 different comparisons) was less than expected. I expect there will be more improvement on more intensive cases, such as more QHY600 subs.
 
Here is what Task Manager showed during the intense part of WBPP's integration of the 227x subs, using Windows 11 Pro. It is showing CPU usage by NUMA core, with both clearly pegged at 100%. It also seemed that peak CPU clock was higher. I believe I saw ~3.5GHz at one spot, while using Win10P it peaked ~3.1GHz. I expect that having a balanced load spread across the two CPUs had better thermal conditions allowing the higher speed.
Task Mgr Z8 WBPP Win11P.png
 
PI's Benchmark is not representative of heavier real-world loads, and can give contrary conclusions. PI's Benchmark on my dual-CPU, high-core Xeon showed Win11P was much slower than Win10P, but in real-world loads, Win11P's processing time was 80% that of Win10P.

i agree. the benchmark script is focused mainly on compute performance, and i/o performance is kind of an afterthought. WBPP is extremely disk intensive so if your system has high i/o bandwidth and can sustain multiple i/o threads at full performance, it's going to do much better with WBPP.
 
Did all the folk with high core count systems under Windows 10/11 (i.e. those greater than 64 logical cores) try turning off hyperthreading to half your number of logical cores - to get them under the 64 logical core limit.

Above 64 cores PI needs to add about a dozen extra lines of code to allocate work on a NUMA node other than where the main PI program is running. WIthout this extra coding - when you have over 64 logical cores - only the NUMA node running PI will get work assigned to it (until PI adds the correct Processor affinity code to distribute work accros NUMA nodes with system having 65 or more logical cores.

As Juan stated - the coding to detect the correct number of total logical nodes is correct - except when your system has over 64 logical cores - in which cases it seems to only report half the number of logical cores. Turning off hyper threading doesn't fix this issue - which might lead to an under allocation of work. Again PI would have to add about 10 extra lines of code to correctly count the number of logical cores on dual processor systems with a hight core count.

Hope that helped - try halving your logical cores by turning off hyper threading and see if your processes speed up about 80% - especially in WBPP - worked perfectly for me dropping from 72 logical cores (dual Xeon E5 2699v3) to 36 logical cores (each being a physical core). The benchmark code still only counts half the available logical cores (reporting 18 - not 36 for my set up with HT off) but work does get distributed to all cores; possibly less that it can handle becuase it under counted the number of logical processors by 50%.
 
Did all the folk with high core count systems under Windows 10/11 (i.e. those greater than 64 logical cores) try turning off hyperthreading to half your number of logical cores - to get them under the 64 logical core limit.

Above 64 cores PI needs to add about a dozen extra lines of code to allocate work on a NUMA node other than where the main PI program is running. WIthout this extra coding - when you have over 64 logical cores - only the NUMA node running PI will get work assigned to it (until PI adds the correct Processor affinity code to distribute work accros NUMA nodes with system having 65 or more logical cores.

As Juan stated - the coding to detect the correct number of total logical nodes is correct - except when your system has over 64 logical cores - in which cases it seems to only report half the number of logical cores. Turning off hyper threading doesn't fix this issue - which might lead to an under allocation of work. Again PI would have to add about 10 extra lines of code to correctly count the number of logical cores on dual processor systems with a hight core count.

Hope that helped - try halving your logical cores by turning off hyper threading and see if your processes speed up about 80% - especially in WBPP - worked perfectly for me dropping from 72 logical cores (dual Xeon E5 2699v3) to 36 logical cores (each being a physical core). The benchmark code still only counts half the available logical cores (reporting 18 - not 36 for my set up with HT off) but work does get distributed to all cores; possibly less that it can handle becuase it under counted the number of logical processors by 50%.
I think you must distinguish between Windows 10 and Windows 11, as the scheduler in Windows 11 was changed to use all processors automatically. So what you see on your machine is most likely due to the OS: Windows 10 gives a process only access to a single processor by default.
Starting with Windows 11 and Windows Server 2022, it is no longer the case that applications are constrained by default to a single processor group. Instead, processes and their threads have processor affinities that by default span all processors in the system, across multiple groups on machines with more than 64 processors.
More details here: https://learn.microsoft.com/en-us/windows/win32/procthread/processor-groups

PI most likely reports the correct number of logical processors but as Windows 10 only uses a single processor by default the system gets overwhelmed with twice as many threads it can handle (72 cores in total but only 36 available to the scheduler) which makes it slow. You should better limit this in PIs global preferences where you can specify the number of processors it can use instead of modifying your whole system.

Regarding the Benchmark script (we actually discussed this earlier!) it uses the wrong method to query the system for this number and only uses it for display purposes and nothing else. So please, just ignore it. PI determines the number on startup and logs it to the console which is the number it uses. Just check what it reports there after enabling hyperthreading again.
 
I think you must distinguish between Windows 10 and Windows 11, as the scheduler in Windows 11 was changed to use all processors automatically. So what you see on your machine is most likely due to the OS: Windows 10 gives a process only access to a single processor by default.

More details here: https://learn.microsoft.com/en-us/windows/win32/procthread/processor-groups

PI most likely reports the correct number of logical processors but as Windows 10 only uses a single processor by default the system gets overwhelmed with twice as many threads it can handle (72 cores in total but only 36 available to the scheduler) which makes it slow. You should better limit this in PIs global preferences where you can specify the number of processors it can use instead of modifying your whole system.

Regarding the Benchmark script (we actually discussed this earlier!) it uses the wrong method to query the system for this number and only uses it for display purposes and nothing else. So please, just ignore it. PI determines the number on startup and logs it to the console which is the number it uses. Just check what it reports there after enabling hyperthreading again.
Hi Stephen,

The > 64 processor test was indeed run on a Windows 10 Pro version - as half the Xeon's out there aren't certified for Windows 11 Pro (for reasons not shared by MS). Also yes the work when there are over 64 logical processing units only being distributed on the CPU running the parent program - unless processor affinity is correctly specificied - is a very well known Windows 10 Pro requirement - it was the same in the last 3 version of Windows; only Windows 11 Pro has changed that default - though I am unsure if anyone here with > 64 cores on Windows 11 Pro has tested the actual behaviour and confirmed it is correct. Its even the same with popular video encoding programs I use - turn off Hyper Threading and work flows across all cores - leave it off and exceed the 64 cores requiring processor affinity to be nominated and work only goes to the cores of the CPU running that host the parent process.

To your next point - even when one lowers the logical processors used in PI via global preferences to well under 64 cores - PI doesn't assign work across processor cores - one has to add the extra processor affinity assignment logic for such machines to get the work to flow to the other CPU than PI started on. Once you get below 64 cores work flows across both CPUs and all there cores seemlessly.

But many thanks for being the first to actually confirm both "Regarding the Benchmark script ... uses the wrong method to query the system for this number and only uses it for display purposes and nothing else. So please, just ignore it. PI determines the number on startup and logs it to the console which is the number it uses. Just check what it reports there after enabling hyperthreading again."

That is the first official word given I can recall that the number used is the correct number - detected at start up - and that this number is cosmetic in effect - and does not thereby limit or throttle the benchmark by spawning way too few threads if the number is only half of what is reported.

When hyperthreading is on - it reports 36 logical processors (not 72) - so its incorrect by 50% - when hypertreading is off - it reports 18 logical processors (not 36) - so wrong by 50% again. So I wonder if it has the correct number - why in both the end of run report - and the logged benchmark - it shows the wrong number - that makes the claim it is using the correct number seem doubtful!

1737597248984.png
 
Last edited:
Jaun's comment that this is "exclusively a 'cosmetic' issue without any repercussion on the accuracy of measured benchmark indexes" is very reassuring to know!
 
Im having this issue with a 64 core threadripper (128 logical), its only utilizing half of the cpu. After reading this thread there doesn't seem to be a solution to this for Win 11. So theres nothing we can really do to get this to use all logical processors windows?
 
Back
Top