I found the solution to my problem by upgrading the OS to Windows 11 Pro.
On Sunday, I ran several, real-world tests with my Z8 using the exact same data set and the exact same other programs open.
My Z8 has a mismatch in the RAM, but it is consistent by CPU, and there are no post warnings or warnings in HP Performance Advisor.
CPU0: Hynix 16x16GB 2Rx8 PC4-2933Y-RE2-12 ; total 192GB
CPU1: Samsung 16x16GB 1Rx4 PC4-2933Y-RC2-12 ; total 192GB
I did not know if my issue was due to OS settings, BIOS settings, or RAM mismatches between the CPUs.
The only things running before executing WBPP process were: File Explorer, Task Manager, and PixInsight. The PixInsight session had nothing else going on, I just opened PI and ran WBPP, and WBPP's cache was purged. Before running WBPP total RAM usage was ~11-12GB and CPU usage was ~1% with both CPUs being very similar. I have setup a RamDisk using up to 56GB and 56 swap folders for PI's swap. What I was testing was WBPP stacking 227x QHY268M subs that had already been calibrated/registered. While WBPP was running, I monitored Task Manager's measurement of load, and the CPU load and RAM usage varied depending on what WBPP was doing at the time.
Identical RAM (96GB per CPU; 192GB total) and Win10P
To test if the problems were due to RAM mismatch, I removed all the 2Rx8 RAM from CPU0, and moved 6x16GB 1Rx4 from CPU1 to CPU0, leaving both CPUs have 96GB of RAM in the 1st RAM bank (as HP advises).
There was basically no change. Again, it was basically all in CPU0 during the intense parts of WBPP, with CPU1 being minimal at ~10% (which as higher than it was not running WBPP), and CPU0 pegged at 100%. Total RAM usage peaked ~112GB, so it is using both CPU's RAM banks. Total execution time was 25:26.
Mismatch RAM (96GB 1Rx4 CPU0 + 96GB 2Rx8 CPU0; 96GB 1Rx4 CPU1; 288GB total) and Win10P
I powered off the Z8 and then put 6 DIMMs of 2Rx8 RAM into CPU0's 2nd RAM bank, mixing Ranks by CPU but consistent by RAM controller. I left CPU1 at 96GB of 1Rx4. Upon reboot, I got Post warnings about the RAM, and it only showed 96GB (256GB was in the 18/24 slots), and Windows was not loading (spinning wheel). So, it seems that the Z8 was only seeing the CPU1's RAM, and without CPU0 RAM the OS could not boot.
Definitive: RAM cannot be mixed within a CPU, even if each bank had identical RAM.
Mismatch RAM (192GB 2Rx8 CPU0; 192GB 1Rx4 CPU1; 384GB total) and Win10P
So, I powered up and put the RAM back to my starting configuration of 384GB, with CPU0 having 2Rx8 and CPU1 having 1Rx4. In this configuration and the exact same other programs open, the results were almost identical, with total time only 2sec faster and peak RAM at 114GB (up 2GB).
Mismatch RAM (192GB 2Rx8 CPU0; 192GB 1Rx4 CPU1; 384GB total) and Win11P
I then changed my OS to Window 11 Pro, and reran the same WBPP under the same conditions. With Win11 both CPUs are used ~ the same and
process time is ~80% that of Win10 on real-world case. Initial RAM usage was ~1-2GB more than Win10P and total RAM peaked at 115GB.
Conclusion:
PI's Benchmark is not representative of heavier real-world loads, and can give contrary conclusions. PI's Benchmark on my dual-CPU, high-core Xeon showed Win11P was much slower than Win10P, but in real-world loads, Win11P's processing time was 80% that of Win10P.
Windows 11 Pro's improved Scheduler definitely benefits multi-socket, high-core CPU machines.
With nothing else running and PI having nothing else going on, WBPP stacking 227x QHY268M subs uses slightly more than 112-115GB of RAM when the machine is not CPU or RAM limited. Add to this any other RAM consumed by other apps running, browser tabs open, or PI project having anything in it. Also, if the subs are larger (e.g. IMX455), or stacking more subs, RAM usage will need to scale higher. I also find that processing OSC subs take MUCH more resources than mono, though I have not tracked how much of this is CPU cores, RAM, or just more time.
With my typical workstation load, with 5-6 large Excel models, 100 Chrome tabs open, JRiver playing music (digital to analog conversion, converting digital formats (e.g. SACD or bit-rates) and real-time parametric EQ), DSS, Word, Affinity Photo, and large PI projects), I would see my RAM usage peak well above 200GB. If the PC has less RAM, PI processing will just take longer as it has to break down the process into smaller matrices and more loops. My typical use case definitely benefits from my Z8-G4 or Z840 capabilities. My new Z8-G4 definitely has a performance improvement over my old Z840 (2.33x more cores, higher turbo clock, faster RAM, 6-channel RAM, and 50% more RAM slots), but the performance improvement of 42% (estimated 1/(0.8*0.88) mixing 2 different comparisons) was less than expected. I expect there will be more improvement on more intensive cases, such as more QHY600 subs.