Optimization of Computer Hardware for Computational Fluid Dynamics In Solidworks Flow Simulation with AMD Ryzen 9 3900x and AMD Threadripper 3970x

Introduction

Solidworks Flow Simulation can help save design time and prototyping costs immensely.  For Turbomachinery the conventional calculations can be used to get a general approximation, however these models typically do not consider all the characteristics of the flow and of the design, often making a very large generalization to simplify calculations based on maximum efficiency.  Therefore, computational fluid dynamics (CFD) with Solidworks Flow Simulation has been used to refine a turbomachinery system.  In the beginning phases of a gas turbine engine development the refinement is completely open ended and therefore hundreds of simulations can be designed and tested.  Each of these simulations can take 8-16 hours to run, and therefore the computer system hardware doing the solving needs to be tested and optimized to the application as well as the way the simulations are run on the system.  Solidworks Flow Simulation does not scale perfectly efficiently with added cores; thus, a large core count processor will have diminishing returns over a smaller processor.  However, multiple simulations can be run in parallel, granting more speedup than if one simulation was run one after another.

System

The first system’s processor is an AMD Threadripper 3970x.  It is a 32-core processor with a base clock frequency of  3.7GHz and has a thermal design power (TDP) of 280W.  A similar processor, the AMD Ryzen 9 3950x, has half the cores and base clock frequency of  3.5GHz and has a thermal design power (TDP) of 105W.  The Threadripper package is more than two times as large and therefore can deliver more power to the cores which is why it can achieve such a higher base clock than the 3950x.   For maximum performance per dollar the 3950x is a clear winner, however the 12 core 3900x is not very far off in terms of performance per dollar and can be viewed as a budget option.  The Threadripper was chosen due to the high core count and high clock speed that should make a single simulation solve time the fastest in the market.  A newer processor, the AMD Threadripper 3990x is due to release in February, and is a 64-core processor which retains the same TDP as its smaller version, but with a base clock of 2.9GHz.  The base clock is 24% less, which combined with the fact that Solidworks Flow Simulation scales poorly with added cores in a logarithmic way, and it only has quad channel memory support, the 3990x might actually be slower in Solidworks Flow Simulation than the 32 core 3970x, except when running an extremely high number of simultaneous solvers.  The only other processors potentially faster than even a 3990x would be server setups of dual CPUs of AMD Epyc 7742’s or 7H12; however, server chips run at much lower clock speeds for stability, making them not ideal.  Solidworks Flow Simulations does not actually increase performance noticeable with a second socket CPU, making server grade dual socket systems completely not worth it.  Intel processors generally have higher clock speeds, making general modeling, loading, rebuilding and setup faster.  Also, Intel processors are better optimized for Solidworks Flow Simulation.  However, at the time of writing this report Intel does not have any offerings that can come close to competing with the raw performance and performance per dollar of the any AMD processor, even with the added performance optimization that Intel processors have for Solidworks Flow Simulation.

The cooler used on the Threadripper was the Cooler Master Wraith Ripper.  This cooler is the best air cooler on the market, however even this is not fully sufficient to cool 280W, since it was originally designed for the 2990WX which only consumed 250W.  Only one all in one (AIO) liquid cooler has full contact on Threadripper’s large integrated heat spreader (IHS), the Enermax Liqtech II.  However, this AIO is not reliable with the pump failing or corrosion clogging up the waterways.   The other AIO’s only contact the center of the Threadrippper’s IHS therefore they do not actually perform much better than the Wraith Ripper, while also having the potential of pump failure.  The chosen motherboard is excessive with all of its overclocking features that will go entirely unused; however, it was chosen due to its smaller than normal EATX form factor and its five M.2 SSD slots.  The memory was repurposed from a previous build and would have ideally been a 3600MHz or higher kit or an ECC kit for stability.  The storage was an array of varying speeds.  The fastest Gigabyte SSD utilized M.2 which give it 4 PCIE 4.0 lanes and should be 40% faster than the Samsung 970 Evo which only uses PCIE 3.0 specifications.  The Samsung 850 PRO should be 145% slower than the 970 EVO since it only utilizes a SATA connector.  The Seagate Ironwolf NAS drive is not ideal to put into a PC, however it should perform just as well as any 5400 RPM disk drive for the testing.  The Quadro RTX 4000 is simply the best Solidworks graphics card of this generation without being excessively costly, however it only has 8GB of video RAM.

The second system’s processor is an AMD Ryzen 9 3900x.  This 12-core processor has the highest performance per dollar of any of the processors discussed here, but it is the smallest core count.  The cooler on this system is a custom water loop with an EKWB Velocity waterblock, and two ST30 Alphacool 360mm radiators.  The motherboard is a midrange x570. The memory is two sticks of 3600MHz, yet at much lower timings of Cl 19 instead of the Cl 16 Corsair memory.  The storage is the same fastest Gigabyte Aorus SSD as in the Threadripper.  The graphics card is the same Quadro RTX 4000; however, this one is water cooled, but this will not matter for Flow Simulation.

Solidworks Flow simulation is capable of remote solving on another separate computer, and while this does leverage more out of a single instance of Solidworks, it is not tested here.  We approximate that a setup of two separate computers each with a 3950x could outperform a 3970x when utilizing many parallel simulations, however they cannot work together on a single simulation.  The Threadripper was picked over a dual setup for the times when one simulation needs to get done as fast as possible.

Table 1: System Information of the Computers

ProcessorAMD Threadripper 3970xAMD Ryzen 9 3900X
CPU CoolerCooler Master Wraith RipperEKWB Velocity (AM4) Custom Loop
MotherboardAsus Zenith Extreme IIGigabyte Aorus Ultra x570
RamCorsair Dominator [4×16] DDR-3200 Cl 16-18-18-36G.Skill Trident Z Royal [2×16] DDR-3600 Cl 19-20-20-40
StorageGigabyte Aorus NVMe Gen 4 1TB SSD (M.2)
Samsung 970 EVO 1TB SSD (M.2)
Samsung 850 PRO 250GB SSD
Seagate IronWolf NAS Drive 5400RPM 1 TB
Gigabyte Aorus NVMe Gen 4 1TB SSD (M.2)
GraphicsNvidia Quadro RTX 4000Nvidia Quadro RTX 4000

Methodology

Thermal Testing

The first testing done was with regard to temperature and power usage on the Threadripper 3970x.  It can consume 280W of power, however at this power level, it can easily spike hotter than AMD posted specification of 68 °C maximum.  Thus, the 3970x can instead be limited on its power usage to lower temperatures.  The benchmark is a simple thermal flow simulation named “Plate Fin Forced Benchmark” as shown in Figure 1.   It is set at 500K mesh so that a single thread simulation does not take over an hour.  From previous testing it was known that Solidworks Flow Simulation would not actually push the processor to use its full power unless it was running many simulations in parallel.  Therefore, the benchmark simulation was run with 16 simulations at once with a variable number of threads given to each solver.  Going beyond 4 threads per solver, actually will make them overlap and share the threads since the 3970x only has 64 threads.  The interaction of this overlap on solve time and thermals was recorded.

The 3900x was not tested with regard to thermal testing since it draws only 105W of power under full Flow Simulation load, which the cooler is more than capable of handling with an average temperature of 55°C.  A less overkill cooling solution would be the Noctua NH-D15 air cooler that also is more than capable dissipating 100W of power.  The 3950x can use more power, but only marginally, so either cooler should still be sufficient.

Core Testing

In the exploration design phase, we have created hundreds of different turbomachinery configurations.   Each of these is known to take about 8-16 hours to solve.  With hundreds of designs this could easily take months to solve.  Therefore, optimal usage of limited hardware is necessary.  Solidworks flow simulation is known to not scale perfectly with additional cores, the first additional thread scales up the performance by 90% but then quickly diminishes.  Rendering applications scale almost perfectly with added cores, because they split the work into many parallel processes.  The same idea can be applied to Solidworks Flow Simulation if more solvers are run in parallel.

In order to do more solvers in parallel, a specific setup is necessary.  The batch run menu and the parametric solver window both have a drop-down menu for number of simultaneous solvers, however this is unfortunately limited to two.  In order to get more than two, another whole Solidworks window must be opened up and a separate project file opened.  This is rather inconvenient since a single parametric study cannot be utilized, instead copies have to be opened, requiring lots of manual folder optimization just to keep track of the simulations.  If the drop-down menu’s options were increased from 2, the CPU could be best utilized at no increased work on the user.  After 8 Solidworks windows have been opened the error message “Solidworks cannot start up because system desktop application resources have been exceeded”.  This is not a Solidworks issue, rather we believe it is a limitation of the Quadro’s 8GB of video ram running out.  We do not have a larger memory graphics card to test with, nor do we think it is practical to use more than 8 Solidworks windows in the first place.  Thus only 16 solvers can be setup with a single computer and license.  Remote solving will not increase this 16-solver limit.  An increase to Solidwork’s limit of two solvers per window would alleviate this issue.

A main idea for core testing is hyperthreading or simultaneous multithreading (SMT).  SMT allows for a core to be utilized by multiple applications at once, independently of one another.  The Threadripper 3970x has 32 cores and two threads per core or 64 threads.  Solidworks Flow Simulation has options to change the number of cores a solver can use; however, this is actually changing the number of threads given to a solver.  The core testing has been conducted with a varying number of threads per solver, up to the maximum of 64, while also introducing multiple simultaneous solvers at once up the maximum of 16 solvers.  The testing only exceeded the maximum of 64 threads when the solvers were using more than an equally divisible number of threads each, thus they shared the threads.  This edge case was tested to show just how much performance can be lost or gained when the solvers try and share threads.  Sharing threads will work better in the future when SMT 4 becomes mainstream, thus allowing 4 threads per core, leaving less idle cores, and reducing the need for us to force the cores to share threads, since it would be implemented on a hardware level.

The trend of CPU technology is moving to a rapidly increasing number of cores.  This is because increasing the frequency of a core is already hitting an asymptote that can only be improved by shrinking the process node, while stacking many duplicate cores within a processor is comparatively easy and efficient for the processing performance gained.   So rather than relying only on increasing core frequency, CPU power is now being multiplied by its core count and the ability to run many processes in parallel.

The 3900x was also tested with this same benchmark, with up two 8 simultaneous solvers, and an equally divisible number of threads.  Sixteen solvers were not tested since there was already little performance to be gained from going from 4 solvers to 8 solvers; the asymptote of performance was already hit.

The project file for these benchmarks was the same as the thermal testing, named “Plate Fin Forced Benchmark.”  See Figure 1.  The 500K mesh was picked for the shortest solve time, since a single thread still takes over 20 minutes to solve this benchmark.

Figure 1: Velocity Cutplot of the Benchmark Model

Storage Testing

The same benchmark test was run on each of our storage drives in order to see if there was any impact on solve time due to the storage device.  This is important since a AMD Zen 2 processors support PCIE 4.0, and this could be a significant increase in cost of the storage solution.  The first two drives are M.2 SSDs which are significantly faster than the other two but getting more than one M.2 SSD on a motherboard could either jump up the product stack or take a whole PCIE expansion card just to add more slots to a motherboard.  The last two are SATA drives that would be much more budget options.  The disk drive would not be suggested to use, since project files can take up massive amounts of space and moving them to another drive can take hours, however it is the slowest drive, so it is worth testing.

Results and Discussion

Thermal Testing

The thermal testing was conducted by varying the power limit of the 3970x in Ryzen Master, and then running 16 simultaneous benchmark simulations at once with four cores per solver.  The tests were run again with overlapping threads, allowing each solver to use 12, 8, and 6 threads.  The data showed that Solidworks does not often push the Threadripper to consume 280W.  Even with 16 solvers at once it still only averaged 256W, not reaching the limit of 280W.  Thus, the next step down at a 240W limit dropped the temperature and power down 6.0% and 6.3% respectively, with only a 0.6% penalty of solve time.  Each step down in power dropped the temperature down significantly while not making much of a change to solve time.    For Solidworks Flow Simulation, this is important since the computer will be constantly running and a slight performance increase for a large chunk of power is not worth it in the potential thermal degradation and instability.

While computational speed is very important in this application, it must also be balanced against the need for long term CPU reliability.    Electronics reliability decreases rapidly when devices are too hot for too long.  The life span of electronic devices closely follows the Arrhenius equation where chemical reactions and diffusion rates are dependent on temperature. It is widely known in the physical chemistry community that chemical reaction and diffusion rates double for every 10°C increase in temperature.  The life of a CPU is roughly halved for every 10°C increase in operational temperature.  Since this workstation is expected to run unusually heavy CFD workloads over the long term, running the CPU continuously at its full 280W power rating may increase the risk of premature failure, thermal degradation and electromigration.

The testing of overlapping solvers showed a large decrease in solve time, especially on the lower power levels which experience the greatest change, however there was minimal change in power draw, frequency or temperature when overlapping solvers.  See Figures 9 to 12.  It is noteworthy that the standard deviation of solve time increased with the more that the solvers overlapped the threads.  A strange pattern emerged on Figure 9.  The two higher power tests both decreased solve time and then flattened out with no added performance after 6 threads.  The middle power level of 200W actually continued to decrease solve time up to 8 threads, and then performance regressed at 12 threads.  The lowest power setting of 160W continued to gain performance up to 12 threads, and actually was faster than the unlimited 280W test.  This made no sense because the processor frequency was 12% lower.  We believe this had something to do with the meshing taking processor resources when all 16 of these solvers start.  Thus, the highest and lowest power levels were tested again with 12 threads per solver without meshing.  When the mesh was already completed, the 280W limit did have the best average solve time of all by 68 seconds.  So, for this benchmark, meshing plays a large role in the solve times.  Thermal testing of the 3970X shows that an 80W decrease in CPU power settings reduces it’s computational speed by 4% and reduces its die temperature by ~11’C.  Hence a 4% speed reduction effectively doubles the CPU’s long-term reliability.  Depending on how hard a workstation is utilized this may be an insurance policy worth considering.

Thermal testing also exposed another issue.  At the start of a simulation the project needs to be meshed, and this is always a fully multithreaded process, even if the cores per solver is limited.  Also, the meshing process consumes more power, always hitting up against whatever power limit was imposed.  Thus, when meshing the temperature can spike significantly.  Figure 13 shows a night of our normal testing, with each spike in temperature occurring when meshing.  At each of these points the CPU went from consuming 250W to the maximum of 280W.  However, meshing is quite a short process, so a limit of 20% less power might only slow down the simulation by a minute on the mesh in an 8 hour solve.  Thermal cycling and current spikes in the processor have the capacity to thermally stress the processor along with the electromigration problems.  Thus, we have decided to put an upper limit the consumption in order limit the severity of CPU power spikes when meshing.

Core Testing

The core testing was compared with a value denoted by X times one thread performance.  This was calculated by dividing the solve time of one thread by the solve time in question, while also multiplying by the number of parallel solvers.  The first set of data is on the Threadripper which is graphed in Figure 2 to 4.  With this graph we can see that a single solver hit the asymptote of 8 times one thread performance at 24 threads, which means 40 threads more added to that one solver did almost nothing to solve time.  That is over half of the processor being useless.  However, once another simulation was run alongside the first, the asymptote shifted to 12 times the one thread performance.  This pattern kept continuing even up to 16 solvers at once with 19.5 times the one thread performance.

 

The 3900x’s performance is graphed in Figure 5 to 7.  The 3900x did not flatline on a single solver the way that the Threadripper did, however it only had 24 threads to work with, which was the point at which the Threadripper flatlined.  For multithreaded loading, the 3900x had a limit of 9.3 times its single thread performance.

 

Another way to look at the solve time is how long it would take to solve 16 simulations.  For example, a single solver would do one simulation one after another 16 times, and a dual solver would do two simulations one after another 8 times.  With this as the metric we can see that a dual solver setup on the Threadripper is 39.4% faster than a single solver setup.  This scales up to the 16 solvers that can finish 16 simulations 84.1% faster.  It is worthy to note the asymptotic nature of this upscaling, where 16 solvers only were 12% faster than 8 solvers.  This is significant because it is twice the work on the user to setup these simulations since 8 Solidworks windows need to be open and 8 separate project files need to be used.  The If the drop-down menu went higher than two, 16 solvers and beyond would get much more appealing.  A drop-down selection of “8” or more would streamline the engineer’s setup workload and take full advantage of the vast processing potential in the new generation multicore CPU’s.  The result would be significantly improved Solidworks performance when solving many simulations.

Figure 8 shows how the 3900x scaled with respect to the Threadripper 3970x.  In this chart the 3900x started off slightly faster in a single core test (because of its higher boost frequency) yet did worse when it reached it maximum of 24 threads as opposed to even the 3970x at that same number of threads.  With 8 simultaneous solvers it was only able to reach the 3970x’s scaling of two simultaneous solvers at the same number of threads.  Yet the fastest the 3900x could solve 16 simulations was with 8 solvers giving it a solve time of 2139 seconds, which is almost double the fastest score of the 3970x of 1071 seconds.  This shows that for Solidworks Flow Simulation the 3970x is double that of the 3900x when solving many parallel simulations.

Figure 8: Combined Scaling of the 3970x and 3900x as Baselined to the 3970x

The 16 simultaneous solvers do take much longer for each solver to complete, so if the computer were to crash mid solve, more data would be potentially lost.  However, if the simulations were simply batch run simulations, this is not an issue since the simulations are saved as they are solved and can be restarted where it stopped.  In a parametric study window, there is not an option for continuing the calculation, instead there is only a “take previous results” option which will start from the beginning with all of the cells set at where the previous one had stopped.  This is not the same as continuing the calculation and can produce very different results on the convergence of goals, since it might take more or less time to solve.  The only way to continue the calculation is to make a separate project file for each partially solved simulation, and if the parametric study changes geometry, the simulations cannot be batch run, making it much slower to finish those solves, with much more work on the user.  Thus, in a parametric study, running more simultaneous solvers has a larger risk of losing the simulations data since they will take longer to solve.  Thus, we have struck the balance of performance, work and risk with running 8 solvers at once.

The final discussion is on memory.  In order to run this many simulations at once a large amount of memory must be used.  The benchmark simulation used here had a mesh of 500K and used about 3GB of memory each.  Thus, total memory used when solving the 16 simultaneous simulations did not exceed our 64GB of memory.  The motherboard does support up to 256GB of memory at which point each solver could be 5 times as large, but each solver would take an exponentially longer time to solve than 5 times more solve time.  Thus, memory is not a limiting factor in simultaneous solvers, if the limit remains at 16 solvers.

Storage Testing

The storage testing was conducted with 4 simultaneous solvers.  The data is tabulated in the table below.  The data showed no advantage of any storage solution over another.  Further testing might be conducted with the Solidworks install changed to each of these drives.

Conclusion

The thermal testing showed a significant 19°C drop in temperature while only losing 7% solve time when running the 3970x at 160W instead of its 280W maximum.  The meshing temperature spikes were also reduced by another 4°C with almost no total solve time impact.  The core testing showed the 3970x could continue to scale past 16 simultaneous solvers, but there we hit a hard limit.  Moreover 16 simultaneous solvers have the potential to increase the user’s workload significantly more, since 8 Solidworks windows need to be managed.   Thus 8 solvers will be chosen to balance performance and user workload.  Storage testing showed no significant difference in solve time when using an extremely fast SSD versus a slower disk drive.

Leave a Comment

Your email address will not be published.