Mecway PC benchmarks

13

Comments

  • ?
    mingw64/mingw-w64-x86_64-libwinpthread-git 9.0.0.6215.788f57701-1 (mingw-w64-x86_64-toolchain) [installed]
  • Hi there,

    on my laptop it took 05:17 using the MKL pardiso on 4 cores and 8 threads according to the discription included within Mecway 14.

    Hardware:
    i7-8565U (boosts up to around 3,2-3,5 GHz)
    16 GB ddr4 2400MHz

    Maybe someone knows a way to speed it up a bit?
  • If you are not doing modal, try using Pastix solver - there is a thread on this.
  • With MKL, also do option 3) here https://mecway.com/forum/discussion/1012/improving-performance-of-ccx-solver which is copying over extra DLLs to enable CPU-specific optimizations which speeds it up over the basic set described with the source code.
  • Today I managed to perform my calculix performance tests
    [but I remain very disappointed with the performance]

    Legion Y540 -[Notbook]
    intel Core i7-9750H @ 2.60GHz -4.5Ghz 12mb cache
    32gb DDR4 2667mhz
    ADATA SSD m2 8200 - 3500/3000MB
    RTX 2060
    [Windows 11]


    Mecway 14.0 - [win compatible mode adm]


    Solver 2.18 - Pastix [ccx_dynamic.exe - [win compatible mode adm]
    variable environment
    OMP_NUM_THREADS = [ 6 ; 8; 10; 12]
    OPENBLAS_NUM_THREADS =1
    PASTIX_MIXED_PRECISION = 1
    PASTIX_ORDERING = 0
    PASTIX_SCHEDULER = 0
    ________________________________________________
    OMP_NUM_THREADS = [ 6 ] - 170s - 2:50
    OMP_NUM_THREADS = [ 8 ] - 168s - 2:48
    OMP_NUM_THREADS = [10] - 158s - 2:48
    OMP_NUM_THREADS = [12] - 155s - 2:35

    Solver 2.17 - Pardiso MKL [ccx_MKL.exe] [win compatible mode adm]
    Environment Variables
    MKL_NUM_THREADS = 12
    all dlls and libs copy in solver folde
    ________________________________________________

    OMP_NUM_THREADS = [ 8 ] - 268s - 4:28
    OMP_NUM_THREADS = [10] - 303s - 5:03
    OMP_NUM_THREADS = [12] - 301s - 5:01


    I was a little curious why many managed to solve with simpler computers in less than 1:40 minutes, some in less than 1 minute

    so i got confused about what could be wrong
    any suggestion?
  • edited October 2021
    I can't say for sure, however the thread linked below talks about how to get pardiso to run with the ccx_dynamic.exe file from version 2.18. also, updating your intel mkl might help. intel does still make improvements to pardiso. some releases don't have any changes to pardiso though.

    http://mecway.com/forum/discussion/1039/ccx-pardiso-low-multi-core-usage/p1

    some info from the above thread. to get pardiso or spooles to run with the ccx_dynamic.exe file from v2.18 you have to use the modify keyword for your solver type.

    SOLVER=type

    type options; SPOOLES, PASTIX, PARDISO. if this change isn't made, pastix is used by default now. i'm hoping future versions of mecway will make it easier to specify which solver you want to use.

    the files I copied to the ccx_dynamic.exe folder are:

    libiomp5md.dll (you have to install at least one intel oneapi compiler to find this file)
    mkl_avx2.1.dll
    mkl_avx512.1.dll
    mkl_core.1.dll
    mkl_def.1.dll
    mkl_intel_thread.1.dll
    mkl_rt.1.dll

    version 2.17 didn't use mkl_avx2.1.dll or mkl_avx512.1.dll. instead, it used; mkl_sequential.1.dll

    for some reason, you can no longer use a path environment variable to point to the mkl files. i had to physically copy all of the above files to the same folder as the ccx_dynamic.exe file. in previous releases of ccx, path variables to the mkl worked.

    I copied the mkl_avx2.1.dll file just in case it was used, however, it doesn't seem to be. it looks like the mkl_avx512.1.dll file is used in my case. this offered some speed up. you might have to experiment some to see which file you need for your cpu.

    below are the environment variables i specified based on input from 3rav. they seemed to help speedup pastix and pardiso:

    MKL_INTERFACE_LAYER=LP64
    MKL_THREADING_LAYER=INTEL
    OMP_NUM_THREADS=(Set to desired number of cores)
    OPENBLAS_NUM_THREADS=1
    PASTIX_MIXED_PRECISION=1

    v2.18, with all of the above settings, did help pardiso run better than the 2.17 release. however, it still has long periods of single core use. pastix uses more cores more of the time. the run characteristics are different for each solver. one may work better than the other depending on your hardware. in my case, pardiso was slightly faster than pastix, for most of the tests I ran. spooles wasn't an option for me, due to the memory requirements of that solver.

    anthony
  • Sorry for the delay in responding.

    your answer was pretty accurate , i recently changed the folder solver with the dlls, and i got an error from AVX2 so i got confirmation that it is working.

    I also added some Environment Variables that seem to work together , see the description in the attached image

    @prop_design
    Thanks a lot for the help.
  • edited October 2021
    glad you figured it out. i just noticed an error in my previous post. i meant to say that with the v2.17 release mkl_sequential.1.dll was being used by pardiso, instead of the avx2 or avx512 dll files. i just updated my previous post with the correction. i'm getting some speedup now since it's using avx512. i'm not sure how it decides what dll files to use. i think 3rav is the one who has been compiling these for windows use. he might know more. all i can confirm is v2.18 is running better than v2.17 was. to test which dll files are used, I moved them in and out of the folder with the ccx_dynamic.exe file.
  • i7 4700MQ @ 2.4GHz
    RAM: 16.00GB DDR3-1600
    Bolt Assembly Test.zip

    -1st run time March 2020 (CalculiX 2.16 PARDISO) : 5 Min 17 Sec
    -Actual run time October 2021 (CalculiX 2.18 PASTIX) : 1 Min 2 sec

    Thank you very much to all the people who has work to adjust optimum setup parameters, variables, find required dll , ... and those who provided Benchmark files for the users to set up and detect misconfigurations.
    Spectacular change !!
  • edited October 2021
    Just to clarify an important point, the message history of this thread is unclear. there are two trial versions

    File - Bolt Assembly Test
    File - Bolt Mount Test Update 2

    in the file
    Bolt Assembly Test
    Calculix 2.18
    Pardiso - CalculiX Time: 60.19s - Time Solver
    Pastix - CalculiX Time: 36.87s - Time Solver

    in the file
    Bolt Assembly Test Update 2
    Calculix 2.18
    Pardiso - Time CalculiX: 254.95 - [4m 14s] - Time solver
    Pastix - CalculiX Time: 169.54s - [2m 49s] - Time solver

    It is important to always use the same benchmark for effective comparisons of hardware versus time to solution. I spent a lot of time looking for solutions thinking there was a problem with my hardware, because I was looking at the Bolt Assembly Test 1 results and running Bolt Mount Test Update 2

    my hardware is
    Legion Y540 - [Notbook]
    intel Core i7-9750H @ 2.60GHz -4.5Ghz 12MB cache
    32gb DDR4 2667mhz
    ADATA SSD m2 8200 - 3500/3000 MB
    RTX 2060
    [Windows 11]
  • edited October 2021
    thanks for pointing that out. yeah i don't even have the original file on my computer anymore. when i made the updates, i asked victor if he could remove the old version but he didn't want to. he didn't want to cause confusion. unfortunately, that's what happened anyway.

    i made those files for a different thread. the op on this thread started randomly using it as a benchmark. the intent of the file was to show a way to model bolted joints.

    it's fine if people want to use it as a benchmark but any number of models could serve the same purpose. in fact, it would better to run a benchmark with a known solution. this file is arbitrary.

    when I run benchmarks, i use other files that i made.
  • hi 3rav,

    i tested the file you requested. unfortunately, the results were really bad. it's using a ton of hard drive throughout the solve. this slows it way down on my computer. also, it says pardiso is not linked. in any event, below are the results. the model is an update of one used in a different thread. the update from that thread is using the latest netgen allows me to get the mesh i have been wanting. previous versions of netgen didn't work as well and i had to play a lot of tricks to get any mesh at all. this model is meant to test model size. it uses all of my available memory. it's a nonlinear solve for stress. there is no contact. it takes 10 iterations to converge.

    as with before, pardiso uses the least physical resources and ends up being faster than pastix. pastix uses a lot of resources and ends up being slower. i think my result is strictly for a budget laptop. depending on hardware, it could be pastix ends up faster for some people.

    as mentioned previously, in this thread, i don't use the bolt assembly model to benchmark. that is because the model has changed a lot over time. i could download the old models off this thread and run them, but that really has no value for me. i can also run the current model, but that wouldn't match with the posts here.

    you can use any model you want to benchmark. just be consistent with your tests. meaning, just use the same model for all the comparisons you do.

    lastly, i had no part in creating this thread or using this model for benchmarking. i've tried to make that clear many times. i made the model to show how to model a bolt for a different thread on this forum.


  • Hi, I am new here and to Mecway. Nice software and I like the multi threading. At the moment I use CCX Pardiso 2.16 and in contrast to others on this forum setting the maximum amount of cores gives the best result. Maybe the quad channel memory controller helps with that?

    My results with 10 cores:

    Bolt assembly test: 49s. (55s. 8cores and 68s. 6 cores)
    Bolt assembly test Update 2: 196s.

    Hardware:
    X299 platform
    I9-10900X 3.7GHz OC 4.5GHz All core
    4x8Gb DDR4 3600MHz
  • My results on the bolt assembly using CCX 2.19 STATIC:

    CUP i7 6700 CPU 3.40GHZ. 4 coresx2
    MEMORY: 64GB DDR4
    Run Time: 40.40 seconds
  • edited March 2022
    AMD Ryzen 9 5900HX 8 cores Laptop.
    MEMORY: 16GB 3200MHz

    calculix_2.19\ccx_static.exe

    MKL_INTERFACE_LAYER=LP64
    MKL_THREADING_LAYER=INTEL
    OMP_NUM_THREADS = [8]
    OPENBLAS_NUM_THREADS =[1]
    PASTIX_MIXED_PRECISION = [1]

    Pastix --Bolt Assembly Test.liml -------------21 sec
    Pastix --Bolt Assembly Test Update 2.liml-----1:19 sec

    Edited: No OneApi installation seems to be required. I just copy and paste the .dll from my previous computer. ccx_dynamic.exe do not respond.

    glut64.dll
    mkl_avx2.1.dll
    mkl_avx512.1.dll
    mkl_core.1.dll
    mkl_def.1.dll
    mkl_intel_thread.1.dll
    mkl_rt.1.dll
    mkl_sequential.1.dll
  • my *.dlls all seem to be "not quite" your list. (example mkl_rt.dll not *.1, avx not avx2

    How do I get the list you show?
  • edited March 2022
    @JohnM

    Those numbers started to appear automatically with each update of the OneAPI toolkit.

    EDITED: I managed to make ccx_dynamic.exe v2.18 to work.
    I added the route to all the drivers into the system variables path and change the name of mkl_intel_thread.1.dll to mkl_intel_thread.dll

    Pardiso is now working but slower than pastix.
    Pastix imporved.

    AMD Ryzen 9 5900HX 8 cores.
    MEMORY: 16GB 3200MHz
    calculix_2.18\ccx_dynamic.exe

    MKL_INTERFACE_LAYER=LP64
    MKL_THREADING_LAYER=INTEL
    OMP_NUM_THREADS = [12]
    OPENBLAS_NUM_THREADS =[1]
    PASTIX_MIXED_PRECISION = [1]

    Pastix -Bolt Assembly Test Update 2.liml-----50 seconds
  • edited May 2022
    CPU: Intel i7 4700MQ @ 2.4 GHz
    RAM: 2x8GB DDR3L
    SSD harddisk.


    Spooles solver: 8min 1s
    PaStiX solver, OMP_NUM_THREADS=1: 2min 22s
    PaStiX solver, OMP_NUM_THREADS=2: 2min 17s
    PaStiX solver, OMP_NUM_THREADS=3: 1min 54s
    PaStiX solver, OMP_NUM_THREADS=4: 1min 31s
    PaStiX solver, OMP_NUM_THREADS=5: 1min 25s

    After also adding the environment variables below, the time is down to 1min 13s. It feels good to be able to halve the computing time this easily.
    MKL_INTERFACE_LAYER=LP64
    MKL_THREADING_LAYER=INTEL
    OPENBLAS_NUM_THREADS=1
    PASTIX_MIXED_PRECISION=1

    A further reduction by 10 seconds is achieved by not having the window with the solver details open while it's running.

    To those others who are confused of where to edit environment variables as I was until 30 minutes ago, you can do it like explained on this link: https://phoenixnap.com/kb/windows-set-environment-variable
  • Hi all,

    Please find my contribution to the benchmark test I've run on my computer (Intel Core i7 M 640 2.8 GHz base speed, 8 GB RAM, Solid-State Drive):

    Filename: Bolt Assembly Test Update 2.liml

    Mecway14 (Calculix Pardiso 2.17) - 1 CPU: 38 min 27 s (Total CalculiX Time: 2283.807720)

    Mecway15 Beta (Calculix Pardiso 2.19) - 1 CPU: 41 min 19 s (Total CalculiX Time: 2461.041211)


    Regards,
    Ivan.
  • So now I got a new workstation and Damn it is fast!

    CPU: AMD Ryzen 9 5900X 12-core processor 3,7 GHz.
    RAM: 64 GB

    Mecway 15, calculix 2.19, ccx_static.exe. Running PaStiX. The same environment variables as my comment above.

    Runtime with 1 CPU core:1min 8s.
    Runtime with 5 CPU cores: 23 seconds
  • Below is a link to the latest version of this example. As the OP of this thread found, it serves as a pretty good performance benchmark, also. The mesh size will max out 16gb of physical memory. So if you have less than that, it will most likely crash your system. If you have more memory, you can increase the mesh density. The latest version is 'Update 3'.

    https://mecway.com/forum/discussion/737/bolt-assembly-example#latest
  • edited December 2023
    About bolt benchmark: CCX 2.21 PASTIX ccx_static

    i9 14900 24+32 6GHz. DDR5(5700MHz refreshing rate);128GB RAM;
    time elapsed: 31 seconds setting 8 threads.
    time elapsed: 34 seconds setting 12 threads.


    About bolt benchmark: CCX 2.20 PASTIX ccx_static

    i9 14900 24+32 6GHz. DDR5(5700MHz refreshing rate);128GB RAM;

    time elapsed: 19 seconds setting to 12 threads.


    About bolt benchmark: CCX 2.19 PASTIX ccx_static

    i9 14900 24+32 6GHz. DDR5(5700MHz refreshing rate);128GB RAM;

    time elapsed: 21 seconds setting to 12 threads.



    PS: Setting environmental variables on : 24 cores+32 threads make the calculation much slower(never endding).
  • Thanks @mmartin. Do you mind clarifying what do you had to do with environment variables for cores and threads? @bobs also had a problem with high core count: https://mecway.com/forum/discussion/1404/error-allocating-memory-error-different-outcomes-for-different-computers and I wonder if it's the same thing.
  • edited December 2023
    We have come to accept that there is a flattening at around 8 cores. We set up computers in this range. I think there is always a tradeoff between cores and "share time" between cores. This will be problem specific, so you may find success with a particular test case but later find inefficiency in others.
    PS- If someone has this problem beat, I'm listening!
  • JohnM, you ar right. Around 6-8 cores we get a horizontal asymptote(flattening of the plot cores vs time). My experience running the bolt benchmark using 24 cores was that the computer continued running while minuts were passing by. So there is clearly something wrong. I never expected to get better performance but I wanted to try it out. According to one of your studies posted here I knew performance would drop down on more than 6 cores calculations. I hope that in the near future mathematicians will improve PASTIX algorithm. I will go ahead with more models and let you know.
    I set up OMP_NUM_THREADS=8 despite the fact I have 24;32 cores available.
    But I will change this configuration to find out if there is any other kind of model or analysis type willing to use more cores.

    Manuel Martín
  • Tomorroy I will set up environmental variables as follows:

    OMP_NUM_THREADS = [ 6 ; 8; 10; 12]
    OPENBLAS_NUM_THREADS =1
    PASTIX_MIXED_PRECISION = 1
    PASTIX_ORDERING = 0
    PASTIX_SCHEDULER = 0

    Let's see what happends!

    mmartin
  • I have a suspecion that this partly depends on the ratio between memory bandwidth and core speed. Most problems require at least the solution out of cache, usually most of the calculations, and L3 is usually shared. Looking through the literature it looks like most systems, including high end workstations, max out at 8 cores, sometimes 6, on Abacus as well as calculix. There is a similar limit on the number of CPU's. It would be interesting to see if some of the new processors with a lot of L3 cache do better than their brothers with about the same core speed. Anyone out there running a 5800X3D or 7800X3D? Page file in a fast pcie 5 ssd with a 7800X3D?
  • About bolt benchmark: CCX 2.21 PASTIX ccx_static
    i9 12900K ,32GB RAM,
    time elapsed: 29.65 seconds setting to 12 threads.
  • Be careful if you are using:

    PASTIX_MIXED_PRECISION = 1

    It can speed up the analysis but there has been reported some issues.
Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!