CCX PARDISO Low Multi-Core Usage

prop_design · September 2021

ok,

here are the latest results. the original topic of this post is now solved, thanks to the v2.18 ccx_dynamic.exe. you can see it scales better with core increase and it's also the fastest option.

the multi-core improvement for pardiso is impressive. the previous version only had a 14% improvement. the current version has a 31.6% improvement.

spooles fails to run this model. it just bails without any explanation. i've read that it needs more memory than the other solvers. this model was meant to test the memory limits. so it must be too big for spooles. i did get spooles to run on a very small model.

thanks for all the help.

update; the latest pdf file is now in a post below. the picture attached to this post updates one from an earlier post.

3rav · September 2021

Hi,
I have a few questions for these results:
1. What libraries of MKL, PARADISO requires on your computer?
2. Was pastix with the flag: PASTIX_MIXED_PRECISION=1 (and OPENBLAS_NUM_THREADS=1)?

I am a bit surprised by the difference between pastix (static) vs pastix (dycamic), the only difference between them is the use in the makefile -DPARDISO and add link to mkl_rt.lib (only PARDISO is dynamic).

Sergio · September 2021

Nice results, will try this new 2.18 version. As you say, I have noticed in the past that CCX with Pardiso doesn't take a lot of CPU while is running, and that for more than 6-8 cores there is no significative improvement in performance. Also, what some times makes my models to fail is not the CPU but the RAM, even with 32 GB a few models that I did doesn't fit in memory and can be solved, so I still keep looking for the holy grial of CCX versions, one with "out of core" capacities.

prop_design · September 2021

hi 3rav,

i don't know anything about pastix yet. it only ran because that was the default for the new exe files. so whatever it's default settings are is what was used. i'll look into it more and get back with you.

i have to do more testing to know exactly what files are needed. there is a post above that shows what i used so far. the previous version worked via env variables to the files. the current version won't run unless i copy the files to the folder with the ccx_dynamic.exe file. that is really weird. i don't know how that is happening.

i think pardiso may be using avx-512 now. it asked for that file and wouldn't run until i copied it to the folder. the previous version didn't even use avx2 and was hardly doing anything in parallel. now it does a little bit more in parallel. not nearly as much as pastix though. the scaling is much better now. it's on par with the pastix scaling. before, it wasn't scaling good with cores. the cpu throttles down the more cores are used. so it could be since pardiso isn't doing as much in multi-core it is running faster. i definitely see higher clock speeds with pardiso than pastix. the cpu is varying anywhere between 2ghz to 4.2ghz. those are the ranges but doesn't hit those numbers exactly. overall, i'm happy with how pardiso is running. it uses less memory and releases it and grabs it as needed while solving. pastix seems to take all the memory and holds it the whole time. so they definitely have different characteristics. i always doubted the fabulous claims about pastix from the ccx manual. i am seeing what i figured out of it. that pardiso is just as good or better. there is no massive performance leap with pastix. however, those with other processors and/or gpu solve will get more i'm sure. but for a laptop, pardiso seems the best option.

i also wonder about all the pardiso iparm settings. somewhere those have to be set, but i don't see where that is. i think if those were tinkered with pardiso would probably do more in multi-core. however, that may end up being a bad thing for a laptop. hard to say without being able to spend a lot of time with it. it's good enough as is though.

i'll get back with you with better answers to your questions. the pdf file attached in my last post has more details than the picture. i can provide the spreadsheet and the actual model if anyone wants it. the model was for a very difficult mesh and the size is huge for my computer. so i didn't post it for that reason.

anthony

PS; about static vs dynamic. yeah i was surprised by that too. for some reason the dynamic version of pastix seemed to be slightly faster than the static version. have no idea why that is.

prop_design · September 2021

hi sergio,

yeah, memory is a big problem for mecway and ccx. ansys was a lot better in that regard. this model i have been testing was to push the memory limits. it's about the biggest i can solve without going out of core. however, when i run it as a modal analysis it did seem to go out of core, which supposedly wasn't possible. however, after about 3 hours i killed it. my hard drive isn't fast enough for out of core. i'm not entirely sure if out of core is supported or not. supposedly you had to compile it yourself with a patch to get it to work. but like i said, the 2.17 version seemed to go out of core. in my case though, not really worth it. if you have a fast ssd then it may be a better option.

MikeMcMullen · September 2021

My experience with out of core and an SSD is that the bandwidth required is within the capabilities of a SATA SSD. Apparently a lot of overhead in the operating and Callculix or Mecway with the writing. What seems to work though is the freeing up of RAM seldom used in favor of ram use that is more intensive. Don't have 2.18 for windows yet. May be better. I run AMD 3700X so AVX-512 not available.

prop_design · September 2021

hi mike,

in updating the tests for 3rav, i just found that avx-512 is speeding up pardiso 2.18 a lot. so having those extensions could be a factor. also, having more real cores will be good for pastix. so i'm guessing you will have better performance with pastix than pardiso. in my case they are close with pardiso maybe a little better.

3rav,

i updated the tests with the changes you wanted. moreover, adding two more env vars. it helped speedup pastix. i attached updated results. in testing what exactly is needed to run each version, i found that avx-512 is being used by the latest version of pardiso and it helps speed it up a lot. in the 2.17 version of pardiso it was using the sequential file and not using avx2 or avx-512. so there is a big improvement in pardiso between 2.17 and 2.18.

there is some variability in the results. i think both are running pretty close to each other right now. pastix seems to favor 3 cores and pardiso 4 cores. there is an oddity with pastix. before the env vars you wanted added pastix was using a little over 1 core than specified. now it uses the specified core but there is mystery usage that can be as high as 4 more cores. in either case, i think this has something to do with it crapping out at 8 cores. if i watch it run it can't hit 100% because it has to back off for the mystery usage. pardiso doesn't have this behavior. it only uses what is specified.

below are updated results. i'll post again with the requirements you asked about.

update: i compared a nonlinear prestress modal analysis, with 3 cores, using the same model. pardiso was much faster:

pardiso 10min 11sec
pastix 13min

prop_design · September 2021

so here is what i found as far as requirements. i attached them in a text file. let me know if you have questions.

Also added the following environment system variables. These help with speeding up pardiso and pastix:

MKL_INTERFACE_LAYER=LP64
MKL_THREADING_LAYER=INTEL
OMP_NUM_THREADS=(Set to desired number of cores)
OPENBLAS_NUM_THREADS=1

All four of the env vars were mentioned by 3rav.

MikeMcMullen · September 2021

I had a sucesssful rin witn Mecway internal (Spooles I think) with about 4.3 million nodes. Worked OK if you have a day and night for a run. Screen sluggish viewing results too. MKL Pardiso will run non-lin on 2 Meg of nodes in about an hour.

Mecway

Forum

CCX PARDISO Low Multi-Core Usage

Comments

Howdy, Stranger!