Optimization of Computer Hardware for Computational Fluid Dynamics Solving in Solidworks Flow Simulation

 

Introduction

Computational Fluid Dynamics (CFD) programs such as Solidworks Flow Simulation can be a powerful tool in the design, analysis and optimization of turbomachinery. CFD simulations of a functioning turbine engine requires high power computer processors and a lot of calculation time.  This computer hardware test report compares and contrasts the performance differences between Intel and AMD, overclocking the CPU, memory type, memory overclock speed and the effect on Solidworks Flow Simulation calculation times.

Analysis work in CFD can take many months of non-stop CPU processing time for a single project.   In the interest of saving time a computer(s) with the right parts and configuration can cut your CFD processing time in half or better.   Setting up a computer workstation properly optimized for CFD use can save considerable engineering time.

System

Two different computer builds using the latest high-end consumer processors from Intel and AMD are listed in the table below.  The two computers will be referred to by their respective processors.

Table 1: System information of the two computers

ProcessorIntel i9-9900KAMD Ryzen 9 3900X
Processor Details8-core, 16 Threads at 4.8GHz12-core, 24 Threads at 4.3GHz
MotherboardGigabyte Aorus Ultra z390Gigabyte Aorus Ultra x570
RamCorsair Dominator [4×16] DDR-3200 Cl 16-18-18-36G.Skill Trident Z Royal [2×16] DDR-3600 Cl 19-20-20-40
StorageSamsung 970 Evo Plus 1TBSamsung 970 Evo 500GB
GraphicsNvidia Quadro RTX 4000Nvidia GTX 650 Ti Boost

At the time each processor was purchased they both cost $500 USD, and each are generally considered high end consumer processors.  However, it should be noted that the AMD processor is comprised of chiplets, with 8 cores on one chiplet and 4 cores on the second chiplet connected with an interconnect called the Infinity Fabric.  The first 8 core chiplet is binned higher and the 4 core chiplet is binned lower, thus the first chiplet should perform better than the second chiplet.

Both motherboards are of same category for their respective chipsets.  Both motherboards support up to 4400MHz memory, however the AMD CPU links the frequency of the Infinity Fabric, which has a limit of 1800MHz, to half of the memory frequency.  Thus, after 3600MHz there is not nearly as much performance to be gained.  This also correlates to the price of memory, which increases greatly after 3600MHz, so no faster memory was tested.  The memory used in both systems differs a considerable amount: The Corsair Ram has lower clock speed, and lower Cl latency, while also being four sticks instead of two.  This difference does translate into a decent change in solve time, and each memory set was cross tested to both CPU’s for comparison.

The storage and graphics cards do not have an appreciable impact on Flow Simulation solve time.

Methodology

Ram Benchmarking

The first set of testing was RAM speed at different set frequencies in Megahertz.  Solidworks Flow Simulation is a ram dependent task because the entire simulation is held in the memory, and the CPU continuously reads and writes its calculations to and from the memory.  Therefore, memory speed is critical to the speed of calculation. Tests were necessary to define and isolate the differences between the two sets of ram before CPU based testing and comparisons could be performed.

All CFD simulation tests were standardized and conducted using a simple model of flow of  hot gas through a housed radial turbine vented to atmosphere. The model used for simulation is a standard turbocharger running at a Pressure ratio of 3, a mass flow of 110 lb/s and the turbine wheel spinning at 66,000RPM.  The mesh was automatically generated by Solidworks with 10 cells in x, 12 cells in y and 8 in z.  The mesh was further refined with a minimum of 5 cells across each channel with a refinement level of 4, small feature refinement level of 4 and a curvature refinement of 3.  This produced a mesh with 119,279 cells in total, see Figure 3.  The solver was run for four travels, with 106 iterations per travel.

The tests were run at varying ram speed and slight overclock frequencies of 4.8GHz on the Intel CPU and 4.3GHz on the AMD CPU.  For more information on RAM overclocking see this link.  The Ram frequencies used were conducted at 2133MHz, 2666MHz, 3200MHz and 3600MHz.  There were a few exceptions to this pattern.  First the Corsair memory is only rated for 3200MHz, so it was not tested above this frequency.  The AMD using four Corsair modules was not able to post above 2400MHz, which was determined to be a known bios limitation on Gigabyte motherboards at the time of testing.  The Intel using two G.skill modules was not able to post at 2666MHz, and instead was tested at 2600MHz, it is unknown why this occurred.

Core Testing

When given a very large number of simulations to process, how can CPU core allocation be optimized for the quickest overall time? The next set of tests contrast the allocated core count per simulation versus solve time.  As shown below, Solidworks Flow Simulation does not scale well with added cores.  For example, a rendering operation would show around 99% of theoretical scaling with additional cores.  Solidworks Flow simulation achieves only ~90% work efficiency with the second core added for solving.  For every additional core allocated the solve time decreased but with rapidly diminishing returns.  This indicates it is best to use a low core count (2-4 cores) for each simulation and have multiple simulations running at once.  Running parallel simulations scales much better with added core count, therefore the tests were conducted to determine how much better they scale to minimize solve time over many simulations.

In order to do multiple simultaneous solvers, the projects can be batch run with two solvers at once.  However, this menu only allows for calculating up to two simulations at once.  To go further, multiple Solidworks windows need to be opened, and the simulation file needs to be copied to a new location, so that they are not reading/writing the same file.  By opening up more Solidworks windows there is more idling load on the processor, so the results are unavoidably skewed.  The tests were performed with a single solver open, dual solvers, quad solvers and 8 solvers at once on each system, and additionally 6 solvers and 12 solvers on the AMD.

One of the main factors in this test is hyperthreading, which involves splitting up a single core into two logical cores that allows for the core to always be in use if it is waiting on some other part of the system.  Hyperthreading can only get about a 15-30% gain in performance, if the load scales well with multiple processes, however this is not necessarily the case for Solidworks Flow Simulation.  Since the Intel processor is 8 cores and 16 threads, we should see the performance start to level off at 8 cores, and only see a bit more gains with the full 16 threads.  Similarly, the AMD processor with 12 cores and 24 threads should taper off at 12 threads, and the chiplet design might also make an inflection point at 16 threads, after which the slower 4 core chiplet is being used.

The model used for this testing used the same setup as the Ram testing, however the mesh was coarser, at 42,608 cells.  This was done to save time when running the 83 test scenarios.

Remote Solver Testing

The final portion of testing was conducted as a workstation offloading its Flow Simulation computation to a remote solving computer over a network.  This is especially important in the industry since not all workstations can run such a demanding CPU load, and not all engineers need to be using Flow Simulation at once.  As such, an engineering department can configure many workstations of low-cost CPUs networked to a server computer having a much higher power CPU to solve very large computational tasks.

Using the same test model for core testing, remote solving was conducted to compare the remote solving time against local solving time.  Only the points of full core utilization were tested since these are the points of highest efficiency.  It is expected the same trends should be present with differing numbers of cores, so only the highest divisible number was tested.

Results and Discussion

Ram Benchmarking

In all tests, increasing the ram frequency lowered the solving time.

At any given ram frequency tested the Intel i9-9900K CPU solved faster than the AMD Ryzen 9 3900X.   However, the AMD benefited more from increasing ram frequency.  This is to be expected since the AMD processor can increase its Infinity Fabric clock speed with increasing RAM frequency, allowing its two chiplet design to communicate better with one another.

Comparing two Vs four sticks of Corsair memory did not make a significant change in solve time.  This could be for several reasons.  First, the test simulation was rather small and could be run using only one 16 GB module, so any additional memory was not utilized.  Secondly, both CPUs only support dual channel memory, meaning four sticks of memory use the same two channels by dividing each channel into two, thus not gaining any more bandwidth to the CPU.

The base frequency of 2133MHz showed similar results across all sets of ram.  At all points the AMD CPU was slower, however the AMD benefited more from increasing ram frequency, gaining 22.7% with the G.Skill 2x going from 2133MHz to 3600MHz ram rather than the 14.6% with the Intel CPU.  This is to be expected since the AMD processor will increase the Infinity Fabric’s clock speed with increasing ram frequency, thus the two chiplets can communicate better with one another.

With the Corsair latency of 16ns and G.Skill latency of 19ns, the Corsair is 3 nanoseconds or 17% lower latency, but this change in latency produced only a small performance advantage in testing.  When both types of memory were tested at the same frequency of 2133MHz, the Corsair RAM yielded a 0.6% faster solve time on the AMD and a 2% faster solve time on the Intel.

Conversely, the G.Skill RAM has a maximum clock frequency 11% higher than the Corsair. When the G.Skill RAM frequency was increased from 3200MHz to 3600MHz, the solve time was reduced by 3.1% on AMD and 2.4% on the Intel.

Thus, higher frequency memory should be prioritized over lower latency memory.  Additionally, the cost of lower latency, lower frequency modules is generally more than the cost of higher latency, higher frequency modules.   If RAM cost is not a consideration, the shortest solve time can be obtained with the lowest latency,  highest frequency modules available.

Table 2: Solve time with varying Ram frequency

  Ram Frequency (MHz) Percent Difference from 2133MHz
CPUModules213324002600266632003600 24002600266632003600
AMDG.skill 2x707  639581563   10.1%19.6%22.7%
AMDCorsair 2x707  628577    11.8%20.2% 
AMDCorsair 4x704658     6.8%    
IntelG.Skill 2x611 571 541528  6.8% 12.2%14.6%
IntelCorsair 2x609  558530    8.7%13.9% 
IntelCorsair 4x606  559533    8.1%12.8% 
Figure 1: Ram Frequency vs Solve Time
Figure 6: Ram Frequency vs Solve Time

Core Testing

For the core testing, the solve time for one core on the Intel was specified as 1, therefore if two cores were used the solve time should be cut in half.  However, Solidworks Flow Simulation is known not to scale very well with additional cores, so when two cores were tested, they performed 1.79 times faster than one cores performance, or 90% efficiency on the Intel CPU.  As expected, the core utilization efficiency continued to slowly diminish as more cores were dedicated to solving it. When more than 8 cores were used, the core efficiency dropped sharply as the solve time started to level out.

Figure 7: Intel Core Count Scaling with Multiple Parallel Simulations

The same testing was done on the AMD CPU with efficiency calculated with one Intel core solve time as 1.00 and one AMD core solve time as 1.00.  This gave us a metric to compare Intel to AMD.  The first single core point showed that the AMD CPU was 78% of the Intel, which is much worse than what was expected since AMD (4.3GHz) is only 87% of the clock speed of the Intel (4.8GHz).  Similarly, to Intel, adding a second core to the computation was only 81% efficient, which was worse than the Intel scaling of 90%.  Going further with additional cores, AMD did not have a sharp transition point at 12 cores, instead it reached 3 times its one core performance at 10 cores and leveled out.  There was no inflection point at the expected 8 core mark where the additional cores would be on the second chiplet, nor was there an inflection point at the start of hyperthreading (12 cores and beyond).  This could be indicative that Flow Simulation does not scale past 10 cores.

Figure 8: AMD Core Count Scaling with Multiple Parallel Simulations

The next phase of testing was with two simultaneous solvers at once.  The first point started at one core for each simulation, giving both an increase over their 2-core single simulation performance.  Intel was at 90% efficient with a single solver, but 96% with two solvers, likewise AMD was at 81% with a single solver and 99% with two solvers.  This clearly shows AMD working better with multiple simulations since there was a higher delta from one solver to two, however it did not continue like this.  The next additional cores showed Intel with a larger increase each time another core was added, with the AMD falling quickly behind.  Again, the Intel had an inflection point at 8 cores, afterwards the increases were much less, yet the AMD had already started to taper off earlier than this 12 core hyperthreading point.

The next portion of four simultaneous solvers showed Intel lagging significantly, barely being able to best it’s two core efficiency, whereas AMD continued to grow significantly over its two-solver performance.  Four simultaneous solvers are where we finally see the inflection point of hyperthreading going away on the Intel, yet this still is not present on the AMD.

Beyond four simultaneous solvers, there is rapidly diminishing return with much less gains than the jump from two to four solvers.  This is most likely due to the additional Solidworks window open for each additional two solvers added, thus at 8x solvers, there needs to be four Solidworks windows open at once.  The Intel was 2% more efficient going from four solvers with four cores each to eight solvers with two cores each.  However, this 8x solve time was 47% efficient which was the highest all core efficiency.

With the AMD, the gains were much less as more solvers were added.

Four solvers with six cores each produced a 33% efficiency.

Six solvers with four cores each produced a 38% efficiency.

However, increasing from six solvers to eight decreased efficiency.  Then going further to 12 simultaneous solvers only gained 1% efficiency.  Thus, going past four simultaneous solvers does have an increase in performance, but the performance gains are so marginal that the increased work to set up more than four Solidworks windows (with separate folders) is not worth the added manual setup time.

Additional testing was then performed to examine the use of solvers assigned to use more than an evenly divisible number of cores.  On the Intel this was done with two simultaneous solvers assigned with nine cores and beyond for each solver.  This was interesting because Solidworks defaults to all cores per solver, meaning they need to share the cores.  Going from eight cores each to all sixteen each decreased performance by 2%.  Thus, care should be used to evenly divide up the cores per solver, rather than let Solidworks automatically put all cores on each solver.

Figure 9: Both AMD and Intel Core Count Scaling with Multiple Parallel Simulations, with 1 Intel Core as a Baseline 1x Performance

In further testing that used the method of multiple Solidworks windows to run more than two simultaneous simulations, it was found that the tests were more unstable. The solver would give the error message “Solver Abnormally Terminated” significantly more often than with just one instance of Solidworks running, by as much as 15% of the time.  This was only detected with testing that had a much higher mesh and took one to two hours per simulation rather than the less than ten-minute simulations that were tested here.  These larger simulations were not large enough that it was crashing due to lack of memory.  Changing the number of simultaneous solvers to less than two did not reduce the rate of which the simulation would crash because it was already less than 1% of the time with only two solvers.  It is not recommended to exceed two simultaneous solvers due to this added instability.

Remote Solver Testing

There is a general increase in performance of the offloaded simulation and a decrease in performance of the locally solved simulation.  In the case that the local solve was shorter than the offloaded solve, the remote solves could finish the last portion quicker.  Likewise, in the opposite case when the local solve was longer than the offloaded solve, the local solve was able to finish the last portion quicker.  This confounding variable of differing solve time on either system is not possible to isolate unless twice the number of tests were run, with each case forcing a longer solve time on the system that is not being looked at.  The data was tabulated in Table 5, showing the solve times of either CPU with varying numbers of simultaneous solvers.  The first half of Table 5 shows AMD as the host computer with Solidworks open and the second half shows Intel as the host computer.

The first metric to look at is the total solve time of both the Intel and AMD and their respective reference solve times (as taken from the previous Core Testing section).  These were then compared as a percent difference.  A positive percentage indicates the simulation ran faster than the reference and a negative percentage indicates the simulation took longer to solve.   Clearly the Intel benefitted more from offloading the simulation from the AMD, since it gained 3.8% to 6.7% faster solve time.  Offloading the simulation from the Intel to the AMD only gained 1.8% to 3.2%.  However, the performance impact of offloading a simulation on the local solver was larger the AMD than the Intel, ranging from 2.3% to 11.3% on the AMD versus 0.9% to 6.3% on the Intel.

Combining the performance gain on the offloaded solver and the performance drop of the local solver is shown in the last column of Table 5.  A positive number indicates that hosting on the Intel is better, and negative numbers indicate hosting on the AMD is better.  The first block of three shows just how much faster the Intel is, especially with alleviating it from the background process of the solver. however, Intel’s lead diminishes as the number of parallel simulations increases.  The numbers highlighted in green are cases where each CPU tackled the same number of simulations, each of which are positive, meaning that Intel was the better computer to host the simulations.  This is surprising considering the extra cores the AMD holds should make handling that many tasks at once easier.

Conclusion

Across all the ram testing, it was clear that the Intel i9 9900k was the far better option.  The AMD Ryzen 9 3900x showed a considerably steeper decline in solve time as the ram frequency was increased, showing the need to pair that CPU with the fastest ram.  Latency of the memory did not show a significant change in performance and can be overlooked due to the higher cost of lower latency modules.  Further testing could be done on memory higher than 3600MHz, where both curves should start to level off.

In the core testing the Intel CPU was still on top, having the lowest solve times across every configuration.  However, as the number of parallel simulations increased to beyond four, the AMD CPU did not fall short by nearly as much as in the start.  This likely due to the trend that the AMD increased its performance by the same margin from double to quadruple as it had going from single to double, and only slowed down to a halt after 6 parallel simulations.   Thus, in an ideal case the Intel CPU should be run at 4 parallel simulations and the AMD CPU should be run at 6 parallel simulations, at which both CPUs would take nearly identical time to solve 8 simulations, at 578s for the Intel and 581s for the AMD.  However, as it stands Solidworks can only run parallel simulations with multiple windows open at once and different files being accessed, so it requires much more work to set up the high quantity of parallel simulations.  Also because of the instability of exceeding two simultaneous solvers, it is not recommended to exceed two solvers, however with future versions of Solidworks this instability might be mitigated.

The remote solver testing showed an increase in performance of the offloaded simulation, and a reduction in performance of the local simulation.  The performance increase was more when offloading to the Intel, than offloading to the AMD, and the performance decrease was more on the AMD than the Intel.  Together this showed that the best situation is to run the simulations on the Intel and offload them to the AMD, however it was only by 0.1% to 3.7%.  Further testing could be conducted by reducing the core out allotted to the solvers to allow for at least one core to be reserved for each solver window, in which case the AMD CPU with its higher core count should be the host computer.

Leave a Comment

Your email address will not be published. Required fields are marked *