interesting comparison 3rav. victor actually suspected this from viewing their website. the examples they used to show a speed-up were not relevant to what we use it for. nice to have confirmation though.
it doesn't look like anyone has been able to compile it on windows using parsec. parsec seems to take a serial code and automatically turn it into a parallel code. but not entirely sure.
Oh. I was curious where mmartin got the brand new compilation he refers to and if it was available to install on Windows in his benchmark run above. Knowing where to get that would be very useful.
+-------------------------------------------------+ Ordering step : Ordering method is: Metis Time to compute ordering: 0.0592 +-------------------------------------------------+ Symbolic factorization step: Symbol factorization using: Fax Direct Number of nonzeroes in L structure: 1750149 Fill-in of L: 3.828349 Time to compute symbol matrix: 0.0040 +-------------------------------------------------+ Reordering step: Split level: 0 Stoping criteria: -1 Time for reordering: 0.0074 +-------------------------------------------------+ Analyse step: Number of non-zeroes in blocked L: 3500298 Fill-in: 7.656698 Number of operations in full-rank LU : 670.40 MFlops Prediction: Model: AMD 6180 MKL Time to factorize: 0.0569 Time for analyze: 0.0010 +-------------------------------------------------+ Factorization step: Factorization used: LU Time to initialize internal csc: 0.0361 Time to initialize coeftab: 0.0050 Time to factorize: 0.0276 (23.71 GFlop/s) Number of operations: 670.40 MFlops Number of static pivots: 0 RHS only consists of 0.0 ________________________________________
one thing i wonder is if they are using the parallel versions of METIS (ParMETIS) and SCOTCH (PT-SCOTCH). hopefully they are, but I don't know that's the case.
I finally found the latest version of the solver being discussed (3rav) and tried it on the benchmark bolt assy on two different machines with the following results:
@3rav, we'd like to try a modification to CCX that I discussed with Guido, but we are not proficient at compiling a Windows executable. Can you provide a recipe here?
Comments
I’ve run calculation on Mecway 13.0 with ccx 2.16 pardiso using 8 core.
Hardware configuration:
CPU: AMD Ryzen 9 3900x
RAM: Crucial UDIMM 16GB ECC 2666MHz
SSD: Samsung SSD 970 EVO Plus 500GB
Run time: 1:09
As antte mentioned I also added MKL_DEBUG_CPU_TYPE=5 in environment variable.
I'm trying to compile pardiso, one core → 12min ;(
I can only see that ↑
thanks in advance
you do have to sign up to get it. i don't know what 'EX studio' is
anthony
I ran the calculation on Mecway 13.1 with ccx 2.16 pardiso as follows:
Hardware configuration:
CPU: AMD Ryzen 9 3950x (16 core/32 virtual processors - runs at about 4.2GHz
RAM: Crucial Ballistix UDIMM 32GB clocked at 3200MHz
SSD: Samsung NVMre 1Tb
Run time: 1:01
I have not added the MKL_DEBUG_CPU_TYPE=5 in environment variable at this time
CUP i7 6700 CPU 3.40GHZ. 8 cores
MEMORY: 64GB DDR4
Running Time: 0 minutes + 52 seconds. (MW+calculix 2.17 PASTIX !!!!! )
It seems to work but I had a bug running a modal análisis
however, the windows version isn't working completely. mainly it's missing parsec, which would make it work even better.
i'd hold off for awhile, until there is a fully functional windows version
+-------------------------------------------------+
+ PaStiX : Parallel Sparse matriX package +
+-------------------------------------------------+
Version: 6.0.1
Schedulers:
sequential: Enabled
thread static: Started
thread dynamic: Disabled
PaRSEC: Disabled
StarPU: Disabled
Number of MPI processes: 1
Number of threads per process: 8
Number of GPUs: 0
MPI communication support: Disabled
Distribution level: 2D( 128)
Blocking size (min/max): 1024 / 2048
Matrix type: General
Arithmetic: Float
Format: CSC
N: 12339
nnz: 457155
+-------------------------------------------------+
Ordering step :
Ordering method is: Metis
Time to compute ordering: 0.0592
+-------------------------------------------------+
Symbolic factorization step:
Symbol factorization using: Fax Direct
Number of nonzeroes in L structure: 1750149
Fill-in of L: 3.828349
Time to compute symbol matrix: 0.0040
+-------------------------------------------------+
Reordering step:
Split level: 0
Stoping criteria: -1
Time for reordering: 0.0074
+-------------------------------------------------+
Analyse step:
Number of non-zeroes in blocked L: 3500298
Fill-in: 7.656698
Number of operations in full-rank LU : 670.40 MFlops
Prediction:
Model: AMD 6180 MKL
Time to factorize: 0.0569
Time for analyze: 0.0010
+-------------------------------------------------+
Factorization step:
Factorization used: LU
Time to initialize internal csc: 0.0361
Time to initialize coeftab: 0.0050
Time to factorize: 0.0276 (23.71 GFlop/s)
Number of operations: 670.40 MFlops
Number of static pivots: 0
RHS only consists of 0.0
________________________________________
CSC Conversion Time: 0.003112
Init Time: 0.075961
Factorize Time: 0.068795
Solve Time: 0.000013
Clean up Time: 0.000000
---------------------------------
Sum: 0.147881
Total PaStiX Time: 0.147881
CCX without PaStiX Time: 2.520762
Share of PaStiX Time: 0.055414
Total Time: 2.668644
Reusability: 0 : 1
________________________________________
Important, for the best possible performance on PaStiX, set:
set OPENBLAS_NUM_THREADS=1
use only for physical processors (set OMP_NUM_THREADS= max physical cores)
Please try with this settings.
I typically see about 30-35% speedup over PARDISO.
I tried OPENBLAS_NUM=1 and 4, it seemed to run slower. I deleted and it ran the fastest.
I also noticed that OMP_NUM_THREADS setting makes only a few seconds difference, and seems to be the best when I set =1
Thoughts?
set PASTIX_MIXED_PRECISION=1
PASTIX + all DLLS from PARDISO
OMP_NUM_THREADS=6 (8 ran 20% slower)
OPENBLAS_NUM_THREADS=1 (helped by 8%)
PASTIX_MIXED_PRECISION=1 (helped by 8%)
Now over 40% speedup over PARDISO.
CUP i7 6700 CPU 3.40GHZ. 4 cores, 8 threads.
MEMORY: 64GB DDR4
Running Time: 0 minutes + 44 seconds. (MW+calculix 2.17 PASTIX+PARDISO=SCOTCh+STATIC+OPENBLASS_NUM_TRHEADS=1 )
PASTIX+PARDISO, OPENBLAS_NUM_THREADS=1, PASTIX_MIXED_PRECISION=1
MW+ccx2.17
Intel Zeon W2123, 3.6GHz, 4 cores, 8 threads
Memory: 16GB
Run time: 0 min 43 sec
and
AMD Ryzen 3950, 4.2GHz, 16 cores, 32 threads
Memory 32GB, 3200DDR4
Run time: 0 min 30 sec
I like it!
$ pacman -Sy mingw-w64-x86_64-toolchain
$ pacman -Sy make
2. Install packages needed for CalculiX:
$ pacman -Sy mingw-w64-x86_64-openblas
$ pacman -Sy mingw-w64-x86_64-spooles
$ pacman -Sy mingw-w64-x86_64-arpack
3. Simple file modification ccx_2.17 and CalculiXstep.c like:
and ccx_2.17.c (already after modification):
4. Modify the Makefile to this form: