Improving performance of CCX solver

edited September 6
Here's a summary of useful options for CCX. I'll keep this post updated as an easy reference. Feel free to add other knowledge as a reply and I'll incorporate it here if appropriate.

1) SPOOLES. The default CCX included with Mecway.
Speed: Slow
Node limit*: 350 000
Difficulty: None

2)** MKL CCX downloaded from http://www.dhondt.de/ where it says "For an update of the bconverged distribution replace the executables in the bconverged download by the following files .".
Speed: Fast
Node limit*: over 550 000
Difficulty: Hard or impossible

3)** MKL CCX as in 2) but also install Intel oneAPI Base Toolkit from
https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=window&distributions=webdownload&options=online
and copy all the DLL files from %ProgramFiles(x86)%/Intel/oneAPI/mkl/latest/redist/intel64/ to the same location as the CCX .exe file. This takes advantage of CPU features like AVX2.
Speed: Faster
Node limit*: over 550 000
Difficulty: Hard or impossible

4)** Compile CCX with MKL and a patch to enable Out-Of-Core (OOC) mode. The source code with step-by-step instructions is in https://mecway.com/download/ccx_win64_mkl_pardiso_source_2.21_2.zip . After compiling, set it in Mecway through Tools -> Options -> CalculiX -> Solver.
Speed: Fast or Faster
Node limit*: over 1 400 000
Difficulty: Hard

5) Download CCX 2.22 compiled with PastiX from https://dhondt.de/calculix_2.22_4win.zip. Extract ccx_static.exe then set it in Mecway through Tools -> Options -> CalculiX -> Solver. For better correctness, set the environment variable PASTIX_MIXED_PRECISION=0.
Speed: Fast
Node limit*: over 700 000
Difficulty: Easy


*Node limit is the approximate maximum number of nodes for hex20 elements with static analysis. It also depends on mesh connectivity, available RAM and available disk space.

**For multithreading, set the environment variable OMP_NUM_THREADS to the number of threads, eg. 8.
«1

Comments

  • edited June 2021
    From our company intranet:


    Basically, run PASTIX but keep PARDISO handy :)
    We have found that 4-6 processors is useful, over that is diminishing return.

    For model size, we try to keep things under 500k nodes.
  • edited January 2022
    Hi there

    How many things in the Intel oneAPI base toolkit is required?
    For example, the Intel Distribution for Python takes up a lot of space so i'd prefer to only install what is nessecary for this to work.

    Also, the files "mkl_core.1.dll" did not exist but "mkl_core.2.dll" and so on did. This was the case for all of the files that had a number before the .dll. renaming the files made it work for MSYS64 to build it.

    After doing as per 4) and solving with ccx.exe, the solver output still states that the symmetric spooler solver was used when doing a static analysis and ccx_MKL.exe does not work. Solving with ccx_MKL.exe gives the response "solver did not produce an output file".

    I do notice that the files in /mecway 14/ccx does not have .1 before the .dll. With that removed from the new files, nothing has changed.
  • the question about ccx_MLK.exe i can't answer. victor will have to address that. however the rest of the things i may be able to help with.

    yes the dll re-numbering is annoying. intel keeps adding numbers to the dll files for some reason. first it was .1 now it's .2. previously, there were no numbers. this causes many of the ccx distributions to not work, unless you rename one particular file.

    i know what you mean about the install size. first you have to install microsoft visual studio, then intel base toolkit, then intel hpc toolkit. so this can get huge. there is one dll file where you have to install at least one compiler to get the dll. i'm not 100% sure what options would lead to the minimum install size. i actually use intel fortran. so i install that. but i have since added most of the others, just in case someone comes up with compiler instructions for ccx using the intel compilers. if there is a language you actually use, i would try that first. if you don't use them, the base of C or C+ or whatever they are calling it will work.

    below are my latest install instructions.

    ------------------------------------------------------------------------------------------------------

    ~ Getting the Calculix Windows build to run properly ~

    ------------------------------------------------------------------------------------------------------

    Download the Calculix Windows binary files from the Calculix website.

    Set the following Windows system environment variables:

    MKL_INTERFACE_LAYER=LP64
    MKL_THREADING_LAYER=INTEL
    OMP_NUM_THREADS=(Set to desired number of cores)
    OPENBLAS_NUM_THREADS=1
    PASTIX_MIXED_PRECISION=1

    Copy the following files to the same folder as the ccx_dynamic.exe file. The following files come from installing the Intel oneAPI Base and HPC toolkits.

    libiomp5md.dll (Doesn't come with the 'Base' toolkit, have to install the 'HPC' toolkit)
    mkl_core.2.dll
    mkl_def.2.dll
    mkl_intel_thread.2.dll
    rename mkl_rt.2.dll to mkl_rt.1.dll

    One of the following three files will also be needed. You will have to experiment to find out which your computer can use. Move each file in and out of the folder with the ccx_dynamic.exe file, to find out which one you need. Try to run PARDISO each time you move a different file into the folder. You may get a message saying a file is missing or the solver may just quit without any indication of what's wrong. Only have one of the files in the folder when you test.

    mkl_avx512.2.dll (fastest)
    mkl_avx2.2.dll (faster)
    mkl_sequential.2.dll (slowest)

    Make sure to keep the Intel oneAPI toolkits up to date. After you update the toolkits, copy all of the files you needed into the ccx_dynamic.exe folder again.

    Use the Calculix SOLVER= option to call one of three available solvers:

    SPOOLES (This solver requires a lot of memory for large problems)
    PARDISO (I generally get the lowest run times by using this solver. It also has the least hardware utilization)
    PASTIX (Best multi-core utilization, but not necessarily the fastest option)

    Examples; SOLVER=SPOOLES, SOLVER=PARDISO, SOLVER=PASTIX

    If the above command is not specified, the ccx_dynamic.exe file uses PASTIX by default.

    The run times you get with the solvers seems to depend greatly on the computer hardware you have. For my budget laptop, PARDISO is the fastest solver. It also uses the hardware the most efficiently. Meaning, the power draw is the lowest. PASTIX does a great job using multi-core. However, the run times I get are longer than with PARDISO. Also, it's using the most power. SPOOLES really isn't an option for me, because it can not solve large models with a reasonable amount of RAM.

    ------------------------------------------------------------------------------------------------------
  • @Sebastianmaklary you can uninstall OneAPI after making copies of those DLLs so I would just install everything. However, anything to do with Python is probably not needed.

    Looks like I need to update the build script and makefiles for the new OneAPI filenames. If you want to do it yourself sooners, the two files that refer to these DLLs are
    ccx/src/build.sh
    ccx/src/patches/CalculiX/ccx_2.17/src/Makefile_MKL

    I don't recommend renaming them since they may refer to each other and expect the 2 in the name.
  • Are there any updates to steps for improving the performance of CCX since the release of Mecway v.15 ?

    I had upgraded my PC to Windows 10 some time back and got spun around trying to regain Pardiso functionality. Never tried Pastix, but would also like to.
  • options 2 and 5 are probably your best best in post at top of this thread. Pastix is often the fastest, but can sometimes be flaky, so keep Pardiso around.
  • Unpacked the files from http://www.dhondt.de/ and was able to get Pastix running by pointing to the ccx_static_i8.exe file from Tools/Options/Calculix/Solver line. Thank You. (What's the difference between static.exe & static_i8.exe?)

    Pardiso remains elusive. Option #2) above says extract ccx_pardiso.exe, but I don't think it's called that anymore. Tried pointing to ccx_dynamic.exe, but got the red death screen (No solve). I confess ignorance. I don't know if the pardiso binary I'm searching for was compiled with the libraries, or if I have to add the libraries to its directory, or if I have to build the thing myself. Victor's update RE: Version 15.0 release -- "CCX updated to 2.19 with source code that includes all required MKL files" -- does that include the library files for Pardiso?

    So many forum members have been good enough to post their (evolving) "recipe", I'm just stuck in the intersection and need a helping Scout to cross the street. Line-by-line instruction set for third graders would suit me fine.
  • I think the i8 is for double precision integers. People were having crashes for very large models and that seems to be the fix. However, in the version I downloaded, I don't think that file is in there. So it seems to be new.

    I've posted the setup quite a few times here and in the Calculix forum. Not sure if it's worth repeating. Victor has a different method. In fact, I think everyone does it slightly differently. So it's hard to answer. It also changes when Intel renames their files.

    The last time I tested the Calculix for Windows files, the dynamic version was running slightly faster than the static version. The person who created the files was surprised by that. I run the ccx_dynamic.exe file. It runs Pastix by default now. So I have to manually force Pardiso. I prefer to stick with Pardiso. To run that, you have to get all the files from the Intel oneAPI distribution. I'm not sure what the current version of Calculix for Windows is doing. I would have to download it and see. The version I downloaded was when 2.19 first came out.
  • @prop_design

    Thanks. That clears up a few things.

    Are you also using the CCX folder modify keyword *STATIC ==> Name=SOLVER , Value=PARDISO ?
  • I just downloaded the files again and tested them. I'm confused as well. There is a ccx dynamic i8 that doesn't run. I can run ccx dynamic or the static versions. So there is another library or something that he hasn't mentioned what it is. I guess if you need the i8 you can use the static version. The readme file has always been out of date or missing things. I feel your frustration.

    I attached my personal readme file. It's to get the ccx dynamic without the i8 running. To use PARDISO, there are things you can modify in mecway. I have a few different ways of doing it. One way is via importing the attached file.

    Whatever the missing libraries are to get ccx dynamic i8 to run, don't appear to be anything from the Intel oneAPI. I tried all of the files there and it still doesn't run.
  • Oh, I think I remember testing the i8 versions and they were much slower at the time. I think there is an old post about it. In any event, I can't really remember. This time, I just did a quick check to see if they run. I didn't look for the speed differences. So something you might want to check. the ccx dynamic file can handle pretty large files. I just solved a model with a little over one million nodes, using a laptop with 16gb of memory and the Pardiso solver. I try to stick to in core memory usage. So I don't get a major speed decrease. I think you will need a lot of memory before i8 would come into play.
  • A year or so ago I was having trouble with large pastix runs going into limbo (hang) and ran accros something in a dscourse about Pastix at the time that large problems could overflow the array counters. I think it was 3rav recompiled the windows version of Pastix i8 for me. The i8 version of Callculix is slightly slower for problem sizes where the i4 default version would work, and the memory needed is slightly more than needed for Paridiso. Therefore I use Pastix i8 for problems up to about 1,100,000 nodes then shift to Paridiso for larger (the larger memory demand of pastix slows it down more with significant paging). I run with sixty four GB of ram and about 128GB of page file in very fast ssd. I suspect with sixteen GB of memory and no major pageing i8 is unneeded. I have not run into any problems with Pastix otherwise, though my problems are straight forward nonlinear plastic, with many passes for final convergence. Pardon my typing... keyboard dying.
  • edited July 2022
    .
  • For option 5, I downloaded Calculix_2.20_4win. There is no ccx_static_i8.exe, but there is ccx_static.exe. Should I use this, or am I missing something? Thanks, Dave
  • Depends upon your problem size. The i8 only seems to be needed for problems over 500,000 nodes or so, so if you don't need it the i4 compile should be fine and perhaps a touch faster. It was a factor for me as my problems tend to run over 500,000 nodes, but I set my computer up to be able to run them. If you run into the issue it is annoying as there is no warning, just a lack of progress somewhere mid run. For me Pastix i4 for problens 500,000 nodes or less, Pastix i8 for problems with from 500,000 nodes to 1,100,000 nodes, then Pardiso for problems over 1,100,000 nodes. as pastix uses more active memory, and the virtual memory is on the verge of thrashing, but this latter is dependent on your memory and set up. If you don't a machine set up with a lot of memory and significantly use virtual memory, you would switch to paridiso soner and the advantage of the I8 compile would not exist for you.
  • i8 doesn't seem to be included in 2.20 anymore. I've updated the top post to link to 2.19.
  • I have been using ccx_dynamic.exe in 2.20. it executes pastix unless you are doing modal, then switches to pardiso. I have found it to be more robust than previous pastix releases. it also lets you set to pardiso if you need, and can be done with a modified keyword command in Mecway.
  • Mike, Victor, John,

    Thanks, I have extracted static i8 and just need to persuade someone with the right admin permissions to copy it over. I have neither the confidence not the permissions to attempt anything other than Victor's option 5! Will I have the option to change the SOLVER keyword as per cwharpe & prop_design's suggestion above?

    My models often have very large numbers of nodes. We do a lot of thin film thermal analysis, also components that have one thin dimension. It is very difficult to be parsimonious with the nodes and have sufficient resolution within these components. The bits of the models that I can ditch are already meshed very coarsely so it doesn't save me much. I quite often have no symmetry to take advantage of.

    One more question: I have a laptop with nearly 16 GB RAM and an Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz 1.99 GHz (copied straight from my computer settings). How many threads should I specify? By using more threads is there an increased danger of 'out of memory' failures?


  • hi dave,

    you should be able to use the SOLVER keyword, if using the CCX solver. I don't think the number of threads affects the memory. The memory is mainly set by the number of nodes. You can experiment with the number of threads. With new CPUs they don't always scale the way you would think. Especially with laptops. On my laptop, PASTIX seems to make things slower. It looks like PASTIX hammers the CPU a lot more than PARDISO and that makes the CPU frequency go down. So PARDISO ends up working better for me. For the CPU model you specified, you could try 2, 4, 6, and 8 threads and see how they perform.

    I'm not exactly sure what the i8 binary applies to. I know it means double precision integers. However, I'm not sure which solvers it applies to (SPOOLES, PASTIX, or PARDISO). I think I saw something on the CCX forum that mentioned it only applied to one of the solvers. Perhaps you or someone else knows. I tried benchmarking an earlier version of the i8 binary and it was a lot slower than the normal binary. I also have a laptop with 16GB of memory. I have been keeping my models in core to keep the solve times reasonable. It sounds like you can't do that. Once things go out of core, it takes so long I have abandoned the solve. I think I keep the node count around 700k. You may very well need the i8 binary for higher node counts.

    anthony
  • 6-8 processors, then significantly diminishing returns :)
  • It's so nice there are quite few option of the solvers.

    I experimented once without success with:
    3)** MKL CCX as in 2) but also install Intel oneAPI Base Toolkit from
    https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=window&distributions=webdownload&options=online
    and copy all the DLL files from %ProgramFiles(x86)%/Intel/oneAPI/mkl/latest/redist/intel64/ to the same location as the CCX .exe file. This takes advantage of CPU features like AVX2.
    Speed: Faster
    Node limit*: over 550 000
    Difficulty: Hard or impossible

    I have AMD chip and it looked like there were 3rd party items that wouldn't be compatibile with it or it was me just getting lost.

    After that I selected:
    2)** MKL CCX downloaded from http://www.dhondt.de/ where it says "For an update of the bconverged distribution replace the executables in the bconverged download by the following files .".
    Speed: Fast
    Node limit*: over 550 000
    Difficulty: Hard or impossible

    and this one worked for me just fine, no issues at all, but after installation of new version of Mecway and uninstal of the previous one I realised some dll's were missing. This made me explore option I am having now, which is:
    4)** Compile CCX with MKL and a patch to enable Out-Of-Core (OOC) mode. The source code with step-by-step instructions is in ccx_win64_mkl_pardiso_source_2.19.zip in Mecway's install location. After compiling, set it in Mecway through Tools -> Options -> CalculiX -> Solver.
    Speed: Fast or Faster
    Node limit*: over 1 400 000
    Difficulty: Hard

    And it was slightly longer process but pretty well defined in the description file but the advantage to me is that straight away I put all the exe and dll in stand alone directory and any new install will be hopefully just easy to connect. I must admit that speed fast vs very fast isn't too descriptive but runnig a model of 550k vs 1.4m nodes is a massive improovement.

    My spec:
    AMD Ryzen 5 2600 Six-Core Processor 3.40 GHz
    RAM DDR4 64,0 GB
    I have both HDD and SDD
    And I am quite happy with the speed to results and very happy with the 1.4m nodes limit ;)
  • I've tryed

    4)** Compile CCX with MKL and a patch to enable Out-Of-Core (OOC) mode. The source code with step-by-step instructions is in ccx_win64_mkl_pardiso_source_2.19.zip in Mecway's install location. After compiling, set it in Mecway through Tools -> Options -> CalculiX -> Solver.
    Speed: Fast or Faster
    Node limit*: over 1 400 000
    Difficulty: Hard

    But when running, ccx_MKL.exe takes more time than ccx.exe

    Is this normal?
  • That's not normal. Though maybe if your model has a small mesh and large number of time steps, it could be slower due to possible overhead of calling Pardiso. It should be clearly faster with a big linear static model.

    Maybe it doesn't include the right MKL dlls for your platform (what CPU model?) and is defaulting to something generic.
  • Acer Aspire VN7-792G

    Procesador Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz 2.60 GHz
    RAM instalada 8,00 GB (7,85 GB utilizable)
    Id. del dispositivo 61AFF701-BB57-4C66-8CD8-FAB777CE7D60
    Id. del producto 00325-95924-00879-AAOEM
    Tipo de sistema Sistema operativo de 64 bits, procesador x64
    Lápiz y entrada táctil Compatibilidad con entrada manuscrita
  • I ran it several more times, and it seems that solving time varies depending of other task running in the PC.
    Leting all things equal and running several times, ccx_MKL is twice as fast as ccx.
  • 150 seconds to solve Bolt assembly test instead dof 320 seconds for ccx.exe
  • Victor,
    Im looking for latest version of pardiso ccx solver and couldnt find it online. I saw this post on building one from your option 4) and followed the instructions but no files were in the install folder in that last step. Could you help me with this please?
  • JohnM,

    Where to download ccx_dynamic.exe in 2.20? Could you help me?
  • edited July 2023
    If there are no binaries in the install directory, check ccx/src/x64/buildlog.txt for error messages, typically in the last few lines of the file. Disregard warnings because there are always a lot of them.

    A common reason it fails is if there's a space in the home directory name.
  • Victor, thanks for yoyr reply. Ill look into it.
Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!