I can't say for sure, however the thread linked below talks about how to get pardiso to run with the ccx_dynamic.exe file from version 2.18. also, updating your intel mkl might help. intel does still make improvements to pardiso. some releases don't have any changes to pardiso though.
some info from the above thread. to get pardiso or spooles to run with the ccx_dynamic.exe file from v2.18 you have to use the modify keyword for your solver type.
SOLVER=type
type options; SPOOLES, PASTIX, PARDISO. if this change isn't made, pastix is used by default now. i'm hoping future versions of mecway will make it easier to specify which solver you want to use.
the files I copied to the ccx_dynamic.exe folder are:
libiomp5md.dll (you have to install at least one intel oneapi compiler to find this file) mkl_avx2.1.dll mkl_avx512.1.dll mkl_core.1.dll mkl_def.1.dll mkl_intel_thread.1.dll mkl_rt.1.dll
version 2.17 didn't use mkl_avx2.1.dll or mkl_avx512.1.dll. instead, it used; mkl_sequential.1.dll
for some reason, you can no longer use a path environment variable to point to the mkl files. i had to physically copy all of the above files to the same folder as the ccx_dynamic.exe file. in previous releases of ccx, path variables to the mkl worked.
I copied the mkl_avx2.1.dll file just in case it was used, however, it doesn't seem to be. it looks like the mkl_avx512.1.dll file is used in my case. this offered some speed up. you might have to experiment some to see which file you need for your cpu.
below are the environment variables i specified based on input from 3rav. they seemed to help speedup pastix and pardiso:
MKL_INTERFACE_LAYER=LP64 MKL_THREADING_LAYER=INTEL OMP_NUM_THREADS=(Set to desired number of cores) OPENBLAS_NUM_THREADS=1 PASTIX_MIXED_PRECISION=1
v2.18, with all of the above settings, did help pardiso run better than the 2.17 release. however, it still has long periods of single core use. pastix uses more cores more of the time. the run characteristics are different for each solver. one may work better than the other depending on your hardware. in my case, pardiso was slightly faster than pastix, for most of the tests I ran. spooles wasn't an option for me, due to the memory requirements of that solver.
your answer was pretty accurate , i recently changed the folder solver with the dlls, and i got an error from AVX2 so i got confirmation that it is working.
I also added some Environment Variables that seem to work together , see the description in the attached image
glad you figured it out. i just noticed an error in my previous post. i meant to say that with the v2.17 release mkl_sequential.1.dll was being used by pardiso, instead of the avx2 or avx512 dll files. i just updated my previous post with the correction. i'm getting some speedup now since it's using avx512. i'm not sure how it decides what dll files to use. i think 3rav is the one who has been compiling these for windows use. he might know more. all i can confirm is v2.18 is running better than v2.17 was. to test which dll files are used, I moved them in and out of the folder with the ccx_dynamic.exe file.
-1st run time March 2020 (CalculiX 2.16 PARDISO) : 5 Min 17 Sec -Actual run time October 2021 (CalculiX 2.18 PASTIX) : 1 Min 2 sec
Thank you very much to all the people who has work to adjust optimum setup parameters, variables, find required dll , ... and those who provided Benchmark files for the users to set up and detect misconfigurations. Spectacular change !!
Just to clarify an important point, the message history of this thread is unclear. there are two trial versions
File - Bolt Assembly Test File - Bolt Mount Test Update 2
in the file Bolt Assembly Test Calculix 2.18 Pardiso - CalculiX Time: 60.19s - Time Solver Pastix - CalculiX Time: 36.87s - Time Solver
in the file Bolt Assembly Test Update 2 Calculix 2.18 Pardiso - Time CalculiX: 254.95 - [4m 14s] - Time solver Pastix - CalculiX Time: 169.54s - [2m 49s] - Time solver
It is important to always use the same benchmark for effective comparisons of hardware versus time to solution. I spent a lot of time looking for solutions thinking there was a problem with my hardware, because I was looking at the Bolt Assembly Test 1 results and running Bolt Mount Test Update 2
thanks for pointing that out. yeah i don't even have the original file on my computer anymore. when i made the updates, i asked victor if he could remove the old version but he didn't want to. he didn't want to cause confusion. unfortunately, that's what happened anyway.
i made those files for a different thread. the op on this thread started randomly using it as a benchmark. the intent of the file was to show a way to model bolted joints.
it's fine if people want to use it as a benchmark but any number of models could serve the same purpose. in fact, it would better to run a benchmark with a known solution. this file is arbitrary.
when I run benchmarks, i use other files that i made.
i tested the file you requested. unfortunately, the results were really bad. it's using a ton of hard drive throughout the solve. this slows it way down on my computer. also, it says pardiso is not linked. in any event, below are the results. the model is an update of one used in a different thread. the update from that thread is using the latest netgen allows me to get the mesh i have been wanting. previous versions of netgen didn't work as well and i had to play a lot of tricks to get any mesh at all. this model is meant to test model size. it uses all of my available memory. it's a nonlinear solve for stress. there is no contact. it takes 10 iterations to converge.
as with before, pardiso uses the least physical resources and ends up being faster than pastix. pastix uses a lot of resources and ends up being slower. i think my result is strictly for a budget laptop. depending on hardware, it could be pastix ends up faster for some people.
as mentioned previously, in this thread, i don't use the bolt assembly model to benchmark. that is because the model has changed a lot over time. i could download the old models off this thread and run them, but that really has no value for me. i can also run the current model, but that wouldn't match with the posts here.
you can use any model you want to benchmark. just be consistent with your tests. meaning, just use the same model for all the comparisons you do.
lastly, i had no part in creating this thread or using this model for benchmarking. i've tried to make that clear many times. i made the model to show how to model a bolt for a different thread on this forum.
Hi, I am new here and to Mecway. Nice software and I like the multi threading. At the moment I use CCX Pardiso 2.16 and in contrast to others on this forum setting the maximum amount of cores gives the best result. Maybe the quad channel memory controller helps with that?
My results with 10 cores:
Bolt assembly test: 49s. (55s. 8cores and 68s. 6 cores) Bolt assembly test Update 2: 196s.
Those numbers started to appear automatically with each update of the OneAPI toolkit.
EDITED: I managed to make ccx_dynamic.exe v2.18 to work. I added the route to all the drivers into the system variables path and change the name of mkl_intel_thread.1.dll to mkl_intel_thread.dll
Pardiso is now working but slower than pastix. Pastix imporved.
After also adding the environment variables below, the time is down to 1min 13s. It feels good to be able to halve the computing time this easily. MKL_INTERFACE_LAYER=LP64 MKL_THREADING_LAYER=INTEL OPENBLAS_NUM_THREADS=1 PASTIX_MIXED_PRECISION=1
A further reduction by 10 seconds is achieved by not having the window with the solver details open while it's running.
Below is a link to the latest version of this example. As the OP of this thread found, it serves as a pretty good performance benchmark, also. The mesh size will max out 16gb of physical memory. So if you have less than that, it will most likely crash your system. If you have more memory, you can increase the mesh density. The latest version is 'Update 3'.
We have come to accept that there is a flattening at around 8 cores. We set up computers in this range. I think there is always a tradeoff between cores and "share time" between cores. This will be problem specific, so you may find success with a particular test case but later find inefficiency in others.
PS- If someone has this problem beat, I'm listening!
JohnM, you ar right. Around 6-8 cores we get a horizontal asymptote(flattening of the plot cores vs time). My experience running the bolt benchmark using 24 cores was that the computer continued running while minuts were passing by. So there is clearly something wrong. I never expected to get better performance but I wanted to try it out. According to one of your studies posted here I knew performance would drop down on more than 6 cores calculations. I hope that in the near future mathematicians will improve PASTIX algorithm. I will go ahead with more models and let you know. I set up OMP_NUM_THREADS=8 despite the fact I have 24;32 cores available. But I will change this configuration to find out if there is any other kind of model or analysis type willing to use more cores.
I have a suspecion that this partly depends on the ratio between memory bandwidth and core speed. Most problems require at least the solution out of cache, usually most of the calculations, and L3 is usually shared. Looking through the literature it looks like most systems, including high end workstations, max out at 8 cores, sometimes 6, on Abacus as well as calculix. There is a similar limit on the number of CPU's. It would be interesting to see if some of the new processors with a lot of L3 cache do better than their brothers with about the same core speed. Anyone out there running a 5800X3D or 7800X3D? Page file in a fast pcie 5 ssd with a 7800X3D?
Comments
mingw64/mingw-w64-x86_64-libwinpthread-git 9.0.0.6215.788f57701-1 (mingw-w64-x86_64-toolchain) [installed]
on my laptop it took 05:17 using the MKL pardiso on 4 cores and 8 threads according to the discription included within Mecway 14.
Hardware:
i7-8565U (boosts up to around 3,2-3,5 GHz)
16 GB ddr4 2400MHz
Maybe someone knows a way to speed it up a bit?
[but I remain very disappointed with the performance]
Legion Y540 -[Notbook]
intel Core i7-9750H @ 2.60GHz -4.5Ghz 12mb cache
32gb DDR4 2667mhz
ADATA SSD m2 8200 - 3500/3000MB
RTX 2060
[Windows 11]
Mecway 14.0 - [win compatible mode adm]
Solver 2.18 - Pastix [ccx_dynamic.exe - [win compatible mode adm]
variable environment
OMP_NUM_THREADS = [ 6 ; 8; 10; 12]
OPENBLAS_NUM_THREADS =1
PASTIX_MIXED_PRECISION = 1
PASTIX_ORDERING = 0
PASTIX_SCHEDULER = 0
________________________________________________
OMP_NUM_THREADS = [ 6 ] - 170s - 2:50
OMP_NUM_THREADS = [ 8 ] - 168s - 2:48
OMP_NUM_THREADS = [10] - 158s - 2:48
OMP_NUM_THREADS = [12] - 155s - 2:35
Solver 2.17 - Pardiso MKL [ccx_MKL.exe] [win compatible mode adm]
Environment Variables
MKL_NUM_THREADS = 12
all dlls and libs copy in solver folde
________________________________________________
OMP_NUM_THREADS = [ 8 ] - 268s - 4:28
OMP_NUM_THREADS = [10] - 303s - 5:03
OMP_NUM_THREADS = [12] - 301s - 5:01
I was a little curious why many managed to solve with simpler computers in less than 1:40 minutes, some in less than 1 minute
so i got confused about what could be wrong
any suggestion?
http://mecway.com/forum/discussion/1039/ccx-pardiso-low-multi-core-usage/p1
some info from the above thread. to get pardiso or spooles to run with the ccx_dynamic.exe file from v2.18 you have to use the modify keyword for your solver type.
SOLVER=type
type options; SPOOLES, PASTIX, PARDISO. if this change isn't made, pastix is used by default now. i'm hoping future versions of mecway will make it easier to specify which solver you want to use.
the files I copied to the ccx_dynamic.exe folder are:
libiomp5md.dll (you have to install at least one intel oneapi compiler to find this file)
mkl_avx2.1.dll
mkl_avx512.1.dll
mkl_core.1.dll
mkl_def.1.dll
mkl_intel_thread.1.dll
mkl_rt.1.dll
version 2.17 didn't use mkl_avx2.1.dll or mkl_avx512.1.dll. instead, it used; mkl_sequential.1.dll
for some reason, you can no longer use a path environment variable to point to the mkl files. i had to physically copy all of the above files to the same folder as the ccx_dynamic.exe file. in previous releases of ccx, path variables to the mkl worked.
I copied the mkl_avx2.1.dll file just in case it was used, however, it doesn't seem to be. it looks like the mkl_avx512.1.dll file is used in my case. this offered some speed up. you might have to experiment some to see which file you need for your cpu.
below are the environment variables i specified based on input from 3rav. they seemed to help speedup pastix and pardiso:
MKL_INTERFACE_LAYER=LP64
MKL_THREADING_LAYER=INTEL
OMP_NUM_THREADS=(Set to desired number of cores)
OPENBLAS_NUM_THREADS=1
PASTIX_MIXED_PRECISION=1
v2.18, with all of the above settings, did help pardiso run better than the 2.17 release. however, it still has long periods of single core use. pastix uses more cores more of the time. the run characteristics are different for each solver. one may work better than the other depending on your hardware. in my case, pardiso was slightly faster than pastix, for most of the tests I ran. spooles wasn't an option for me, due to the memory requirements of that solver.
anthony
your answer was pretty accurate , i recently changed the folder solver with the dlls, and i got an error from AVX2 so i got confirmation that it is working.
I also added some Environment Variables that seem to work together , see the description in the attached image
@prop_design
Thanks a lot for the help.
RAM: 16.00GB DDR3-1600
Bolt Assembly Test.zip
-1st run time March 2020 (CalculiX 2.16 PARDISO) : 5 Min 17 Sec
-Actual run time October 2021 (CalculiX 2.18 PASTIX) : 1 Min 2 sec
Thank you very much to all the people who has work to adjust optimum setup parameters, variables, find required dll , ... and those who provided Benchmark files for the users to set up and detect misconfigurations.
Spectacular change !!
File - Bolt Assembly Test
File - Bolt Mount Test Update 2
in the file
Bolt Assembly Test
Calculix 2.18
Pardiso - CalculiX Time: 60.19s - Time Solver
Pastix - CalculiX Time: 36.87s - Time Solver
in the file
Bolt Assembly Test Update 2
Calculix 2.18
Pardiso - Time CalculiX: 254.95 - [4m 14s] - Time solver
Pastix - CalculiX Time: 169.54s - [2m 49s] - Time solver
It is important to always use the same benchmark for effective comparisons of hardware versus time to solution. I spent a lot of time looking for solutions thinking there was a problem with my hardware, because I was looking at the Bolt Assembly Test 1 results and running Bolt Mount Test Update 2
my hardware is
Legion Y540 - [Notbook]
intel Core i7-9750H @ 2.60GHz -4.5Ghz 12MB cache
32gb DDR4 2667mhz
ADATA SSD m2 8200 - 3500/3000 MB
RTX 2060
[Windows 11]
i made those files for a different thread. the op on this thread started randomly using it as a benchmark. the intent of the file was to show a way to model bolted joints.
it's fine if people want to use it as a benchmark but any number of models could serve the same purpose. in fact, it would better to run a benchmark with a known solution. this file is arbitrary.
when I run benchmarks, i use other files that i made.
https://calculix.discourse.group/t/calculix-and-pastix-solver-windows-version/130/42?u=rafal.brzegowy
i tested the file you requested. unfortunately, the results were really bad. it's using a ton of hard drive throughout the solve. this slows it way down on my computer. also, it says pardiso is not linked. in any event, below are the results. the model is an update of one used in a different thread. the update from that thread is using the latest netgen allows me to get the mesh i have been wanting. previous versions of netgen didn't work as well and i had to play a lot of tricks to get any mesh at all. this model is meant to test model size. it uses all of my available memory. it's a nonlinear solve for stress. there is no contact. it takes 10 iterations to converge.
as with before, pardiso uses the least physical resources and ends up being faster than pastix. pastix uses a lot of resources and ends up being slower. i think my result is strictly for a budget laptop. depending on hardware, it could be pastix ends up faster for some people.
as mentioned previously, in this thread, i don't use the bolt assembly model to benchmark. that is because the model has changed a lot over time. i could download the old models off this thread and run them, but that really has no value for me. i can also run the current model, but that wouldn't match with the posts here.
you can use any model you want to benchmark. just be consistent with your tests. meaning, just use the same model for all the comparisons you do.
lastly, i had no part in creating this thread or using this model for benchmarking. i've tried to make that clear many times. i made the model to show how to model a bolt for a different thread on this forum.
My results with 10 cores:
Bolt assembly test: 49s. (55s. 8cores and 68s. 6 cores)
Bolt assembly test Update 2: 196s.
Hardware:
X299 platform
I9-10900X 3.7GHz OC 4.5GHz All core
4x8Gb DDR4 3600MHz
CUP i7 6700 CPU 3.40GHZ. 4 coresx2
MEMORY: 64GB DDR4
Run Time: 40.40 seconds
MEMORY: 16GB 3200MHz
calculix_2.19\ccx_static.exe
MKL_INTERFACE_LAYER=LP64
MKL_THREADING_LAYER=INTEL
OMP_NUM_THREADS = [8]
OPENBLAS_NUM_THREADS =[1]
PASTIX_MIXED_PRECISION = [1]
Pastix --Bolt Assembly Test.liml -------------21 sec
Pastix --Bolt Assembly Test Update 2.liml-----1:19 sec
Edited: No OneApi installation seems to be required. I just copy and paste the .dll from my previous computer. ccx_dynamic.exe do not respond.
glut64.dll
mkl_avx2.1.dll
mkl_avx512.1.dll
mkl_core.1.dll
mkl_def.1.dll
mkl_intel_thread.1.dll
mkl_rt.1.dll
mkl_sequential.1.dll
How do I get the list you show?
Those numbers started to appear automatically with each update of the OneAPI toolkit.
EDITED: I managed to make ccx_dynamic.exe v2.18 to work.
I added the route to all the drivers into the system variables path and change the name of mkl_intel_thread.1.dll to mkl_intel_thread.dll
Pardiso is now working but slower than pastix.
Pastix imporved.
AMD Ryzen 9 5900HX 8 cores.
MEMORY: 16GB 3200MHz
calculix_2.18\ccx_dynamic.exe
MKL_INTERFACE_LAYER=LP64
MKL_THREADING_LAYER=INTEL
OMP_NUM_THREADS = [12]
OPENBLAS_NUM_THREADS =[1]
PASTIX_MIXED_PRECISION = [1]
Pastix -Bolt Assembly Test Update 2.liml-----50 seconds
RAM: 2x8GB DDR3L
SSD harddisk.
Spooles solver: 8min 1s
PaStiX solver, OMP_NUM_THREADS=1: 2min 22s
PaStiX solver, OMP_NUM_THREADS=2: 2min 17s
PaStiX solver, OMP_NUM_THREADS=3: 1min 54s
PaStiX solver, OMP_NUM_THREADS=4: 1min 31s
PaStiX solver, OMP_NUM_THREADS=5: 1min 25s
After also adding the environment variables below, the time is down to 1min 13s. It feels good to be able to halve the computing time this easily.
MKL_INTERFACE_LAYER=LP64
MKL_THREADING_LAYER=INTEL
OPENBLAS_NUM_THREADS=1
PASTIX_MIXED_PRECISION=1
A further reduction by 10 seconds is achieved by not having the window with the solver details open while it's running.
To those others who are confused of where to edit environment variables as I was until 30 minutes ago, you can do it like explained on this link: https://phoenixnap.com/kb/windows-set-environment-variable
Please find my contribution to the benchmark test I've run on my computer (Intel Core i7 M 640 2.8 GHz base speed, 8 GB RAM, Solid-State Drive):
Filename: Bolt Assembly Test Update 2.liml
Mecway14 (Calculix Pardiso 2.17) - 1 CPU: 38 min 27 s (Total CalculiX Time: 2283.807720)
Mecway15 Beta (Calculix Pardiso 2.19) - 1 CPU: 41 min 19 s (Total CalculiX Time: 2461.041211)
Regards,
Ivan.
CPU: AMD Ryzen 9 5900X 12-core processor 3,7 GHz.
RAM: 64 GB
Mecway 15, calculix 2.19, ccx_static.exe. Running PaStiX. The same environment variables as my comment above.
Runtime with 1 CPU core:1min 8s.
Runtime with 5 CPU cores: 23 seconds
https://mecway.com/forum/discussion/737/bolt-assembly-example#latest
i9 14900 24+32 6GHz. DDR5(5700MHz refreshing rate);128GB RAM;
time elapsed: 31 seconds setting 8 threads.
time elapsed: 34 seconds setting 12 threads.
About bolt benchmark: CCX 2.20 PASTIX ccx_static
i9 14900 24+32 6GHz. DDR5(5700MHz refreshing rate);128GB RAM;
time elapsed: 19 seconds setting to 12 threads.
About bolt benchmark: CCX 2.19 PASTIX ccx_static
i9 14900 24+32 6GHz. DDR5(5700MHz refreshing rate);128GB RAM;
time elapsed: 21 seconds setting to 12 threads.
PS: Setting environmental variables on : 24 cores+32 threads make the calculation much slower(never endding).
PS- If someone has this problem beat, I'm listening!
I set up OMP_NUM_THREADS=8 despite the fact I have 24;32 cores available.
But I will change this configuration to find out if there is any other kind of model or analysis type willing to use more cores.
Manuel Martín
OMP_NUM_THREADS = [ 6 ; 8; 10; 12]
OPENBLAS_NUM_THREADS =1
PASTIX_MIXED_PRECISION = 1
PASTIX_ORDERING = 0
PASTIX_SCHEDULER = 0
Let's see what happends!
mmartin
i9 12900K ,32GB RAM,
time elapsed: 29.65 seconds setting to 12 threads.
PASTIX_MIXED_PRECISION = 1
It can speed up the analysis but there has been reported some issues.