Virtualized Petascale High-Performance Computing
System Case Study for New Material Modelling
Computational Efficiency
*
Rishabh Sinha, x21171203, Cloud Architecture, MSc. Cloud Computing,
National College of ireland, Dublin, Ireland, x21171203@student.ncirl.ie
Abstract—The objective of this project report is to assess the
prospective technical and financial advantages associated with
the transitioning of a leading European research center to a
virtualized petascale architecture. The research facility employs
machine learning-based mechanisms to address the challenges of
conventional experiments in material science applications, aiming
to achieve cost and time efficiency. The Research Facility uses a
variety of processing techniques to analyze massive datasets taken
from material databases in an effort to uncover latent knowledge
that can be effectively used in technological frameworks for
material modeling, material screening, selection, and recommen-
dation. The report will analyze the feasibility of implementing
a VMWare virtualized computing model for the existing high-
performance computing (HPC) systems of the research facility.
The research is structured into five primary segments, namely:
an introduction, contextual information, an evaluation of the on-
site high-performance computing (HPC) proposal of the research
facility, functional and non-functional details of HPC petascale,
and a cost analysis that considers multiple variables. The project
will comprise a comprehensive bibliography and an independent
study to substantiate the report’s recommendations and conclu-
sions. The aim is to provide pragmatic insights that can aid the
research institution in its deliberations regarding the potential
benefits of adopting a virtualized petascale infrastructure for its
high-performance computing (HPC) systems.
Index Terms—New Material Modelling, HPC, Virtualization,
Petascale Computing, Machine Learning, Material Science, Cost
Analysis.
I. INTRODUCTION
HPC has revolutionized material science by allowing re-
searchers to model complex crystal structures and study
environmental factors like temperature, pressure, and more
for New Material Modelling and discovery. This technology
helped find new materials and their uses. HPC has helped
researchers model and evaluate complex material architectures,
predict material characteristics using machine learning, and
efficiently and correctly manage large data sets. This study
suggests using petascale computing at a research facility to
overcome the limitations of terascale systems. Our goal is
to enable researchers with better data management, material
modeling, and simulations across scales and physics events,
resulting in faster and more accurate simulations and lower
costs for creating new materials. New material modeling
has many applications, and materials science and engineer-
ing scholars can use open ICSD, quantum materials, and
SuperCon databases. Pre-processing and feature engineering
prepare raw data for model creation, followed by AI-based
learning tools like ANN and optimization methods like GAs,
PSOs, and SAAs to create an artificial intelligence model.
[3], [4], [5] [6] However, computational models slow down
tera-scale computing systems and delay result production.
Petascale computing is needed to handle multi-scale, multi-
physics models. This paper suggests petascale computing at a
research center using VMware vSphere to run new material
modeling AI and machine learning workloads on NVIDIA
GPUs. Running these workloads as Kubernetes containers
in TKC would enhance on-premises environment usage and
scaling, speeding up and improving new material simulations.
II. BACKGROUND
A. Understanding the application workflow for New Material
Modelling
Material discovery—the identification and synthesis of
novel materials with unique properties—is crucial to scientific
study. Trial-and-error experimentation has traditionally been
used to find new materials. This method is expensive and time-
consuming, with limited results. The study facility’s material
discovery research has benefited from computational and ma-
chine learning methods. These methods allow rapid discovery
and modeling of novel materials with unique properties. New
material finding or modeling involves paradigm discovery, syn-
thesis, property prediction, and characterization. This report
explains the Research Facility Application in New Materials
Modelling/Discovery, including the research methodology and
High-Performance Computing (HPC) requirements for run-
ning their AI and ML Workloads, so they can properly utilize
the current HPC environment and enable them to petascale
and exascale if more computational tasks are introduced into
the facility for new research or modern ML tasks.
1) Material Discovery/Modelling Process: : Pre-
processing, model building and validation, and inverse
design comprise material discovery [Fig 1.]. Infrastructure
servers manage compute units, GPUs, storage, and networking
in on-premises environments. This system’s infrastructure
Fig. 1. New Material Modelling application workflow
server, a Dell-EMC-PowerEdge-R740 with an Intel Xeon-
Silver-4210 processor, will schedule ML jobs for the research
facility’s material modeling and finding process. Compute-
nodes handle ML workloads. This system’s compute unit
is the Dell EMC-PowerEdge-C6525 with an AMD EPYC-
7742 processor. Finally, NVIDIA-A100-Tensor-Core GPUs,
built for deep learning, accelerate various ML workloads
in their HPC system. Material Modelling involves below
mentioned steps: Gathering information: Multiple groups
provide standardized datasets with visual representations for
name and material traits. References list several. The dataset
may be an existing database or a product of numerous lap
experiments or simulations. Materials are modeled using
intrinsic and extrinsic knowledge. Data cleanup begins
pre-processing. Researchers keep their modeling data in the
PowerScale Storage Cluster. [3] [4] [5] [6] [?] The second
method reduces the dataset’s input variables to remove fewer
useful features. Descriptors reduce data complexity. Datasets
are represented by features or identifiers. This step requires
high computation. Third, model creation and validation.
Linear and non-linear regression can be used to map a model
to its goal properties and descriptors. Using data and reference
sources achieves this. After using raw parameters/features
from antecedent models, fine-tuning is done to incorporate
these features into the forecast model. Supervised learning
models identify a function that can uncover new material
models based on known and existing materials. Regression
problems involve continuous properties. ANNs, Kriging, and
SVMs are used in such cases. Decision trees and random
forests can classify classified outputs. This step requires
high computing power and can vary depending on the model
used to find the most accurate model for researchers to use
in new material modeling. [1] Inverse design uses machine
learning models to find novel materials with specific qualities
in the final step. The trained machine learning models
predict theoretical material properties and identify the best
candidates for further study. Inverse design uses machine
learning models to find and create new materials with wanted
properties. Convolutional Neural Network (CNN) models
can validate the best candidates. Neural networks reveal
hidden trends in datasets. Instead of using the network
for prediction, this method extracts relevant data insights.
Similar to combinatorial material science experiments, Ansys
Rocky/Fluent simulation software and CNN-based neural
network models mimic complex data in multiple dimensions.
The model analyzes main data relationship prediction output.
This model interprets the relationships with human expertise
to gain a fundamental comprehension. This model is trained
using machine learning techniques that require GPUs for
CNN models and Ansys Rocky simulation software. [7]
Discovered materials are saved and used to teach models.
This data improves model accuracy and identifies dataset
flaws. Data storage helps researchers collaborate, improving
model precision and area progress. Iterating through steps,
refining machine learning models, and discovering new
materials until desired qualities are achieved. Researchers will
need more advanced storage and high-performance computing
tools to manage and analyze data as it grows, enabling them
to gain valuable insights and accelerate novel material
exploration. The current High-Performance Computing
(HPC) system allows flexible material discovery study.
The Dell EMC-PowerEdge-R740 infrastructure server, Dell
EMC-PowerEdge-C6525 compute nodes, NVIDIA-A100-
Tensor-Core GPUs, Dell EMC-PowerScale-OneFS storage,
and Mellanox Infiniband HDR networking components enable
researchers to efficiently preprocess data, build and verify
machine learning models, and run ANSYS simulations to
discover novel materials with desired properties.
III. ANALYSIS OF THE RESEARCH FACILITYS
ON-PREMISES HPC
This study examines material modeling and finding using
HPC systems. The study facility’s HPC system meets high-
performance computing workloads. The infrastructure server,
compute nodes, GPUs, storage, and networking components
were carefully chosen to provide significant processing power,
extensive memory capacity, scalability, and fast connectivity,
but as researchers use more advanced methods and workloads,
it is becoming harder for the research facility to conduct effi-
cient research. On-premises HPC equipment, components, and
research facility issues will be discussed here. New material
Fig. 2. Research Facility Existing On-premise architecture
discovery drives technological growth. ”Material discovery” is
the systematic search for new materials with desirable prop-
erties for various uses. [8] Discovering new materials requires
data gathering, pre-processing, model development and vali-
dation, and inverse design. Discovering new materials requires
complex calculations that require HPC systems. The research
facility’s HPC system includes servers, storage, networking,
and GPUs. This study examines the on-site research infrastruc-
ture and HPC system design used to find new materials. This
study makes system, user, and application assertions based on
novel material modeling use cases. The study facility’s HPC
system is designed to handle high-performance computing
workloads needed for material modeling and discovery. The
system’s infrastructure was carefully selected to provide high
computing power, large memory capacity, scalability, and fast
connectivity [Fig 2]:
1) Management node : The infrastructure server manages
all system resources, including compute units, GPUs, storage,
and networking, and schedules jobs. Dell EMC PowerEdge
R740 servers with Intel-Xeon Silver 4210 processors are this
system’s backbone servers. With 10 cores and 20 threads,
this processor can handle large-scale machine learning tasks.
The server’s 192GB memory allows efficient system resource
control. The Dell EMC PowerEdge R740 server has a 2.2
GHz Intel Xeon Silver 4210 processor that can turbo boost
to 3.2 GHz. It has 10 cores and 20 threads. CPU IPC is 2.
One CPU per unit. The 24-DIMM system has 3 terabytes of
RAM. Three double-width GPUs or six single-width GPUs
are allowed. Reference the benchmark study. [9]
2) Compute node : Compute nodes run machine
learning tasks. The AMD-EPYC-7742-powered Dell-
EMC-PowerEdge-C6525 server is the system’s compute
core. The Dell EMC PowerEdge C6525 server with 4
AMD-EPYC-7742 processors has a total of 256 cores .
The rack-mounted Dell-EMC-PowerEdge-C6525 server can
hold four nodes per chassis and takes up two vertical rack
units. AMD-EPYC-7742 processors power each unit. The
AMD-EPYC-7742 CPU has 64 cores, 128 threads, a base
clock speed of 2.25 GHz, and a peak clock speed of 3.4
GHz. The server supports 4 terabytes of node RAM. See
CPU benchmark result. [10]
3) GPU : The current High-Performance Computing
(HPC) system uses deep learning-optimized NVIDIA-A100-
Tensor-Core [11] Graphics Processing Units to accelerate the
research facility’s varied Machine Learning (ML) workloads.
These GPUs can speed up Machine Learning (ML) workloads,
improving system efficiency. Tensor Cores in NVIDIA A100
GPUs speed matrix calculations, making them ideal for Ma-
chine Learning (ML) workloads. It has 6,912 CUDA cores and
40 GB of HBM2 RAM with 1.6 TB/s bandwidth. The GPU
operates at 1.41 GHz and 1.54 GHz. The Ampere architecture
makes the device ideal for high-performance computing.
4) Storage : Dell EMC PowerScale OneFS provides
scalable, high-performance storage for big machine learning
datasets in the HPC system. PowerScale’s OneFS storage
solution can be expanded to meet workload needs, providing
high-capacity, high-performance storage for the system. Dell
EMC Isilon scale-out Network Attached Storage [3] is ideal
for machine learning (ML) tasks due to its high throughput and
IOPS. Dell EMC PowerScale OneFS can store machine learn-
ing data. The device can handle 50 petabytes of data. It also
provides a machine-learning-optimized computing platform.
OneFS supports NFS, SMB, and HDFS. It also compresses,
deletes, and protects data.
5) Networking: : Mellanox InfiniBand HDR networking
connects the HPC system’s components quickly. This network-
ing solution optimizes data transfer between the infrastructure
server, compute nodes, GPUs, and storage components, im-
proving system efficiency. Mellanox InfiniBand HDR network-
ing technology has 200Gb/s HDR adapters with low latency
and high bandwidth, making it ideal for machine learning
tasks. Mellanox InfiniBand HDR networking gives up to 200
Gb/s per port. Mellanox InfiniBand HDR 200Gb/s adapters
are designed for HPC and AI/ML apps. Mellanox offers PCIe
Gen4 and Gen5 InfiniBand HDR 200Gb/s ports. [12] Research
Facility is using 4 Mellanox InfiniBand HDR IB switches with
40 ports each which can give 16 TB of speed. It is calculated
by number of IB (4) x Number of ports (40) x (200 Gb/s) =
16TB
A. Performance Evaluation
Data pre-processing, machine learning model construction
and verification, and ANSYS software simulations to find
novel materials with desired qualities are expected workloads.
The system comprises of a 1 x infrastructure node, 100 x
compute nodes, 50 x instances of GPU nodes, one storage
unit, and 4 x Mellanox InfiniBand HDR network. On-site
high-performance computing (HPC) equipment uses a cluster
for parallel computing. This cluster architecture is optimized
for parallel processing, making it ideal for large-dataset ma-
chine learning tasks. Parallel computing clusters distribute
workloads across numerous nodes, improving performance
and processing speed.
B. TFLOPS Calculation for On Premise HPC Systems
The dual CPU server Dell EMC PowerEdge R740 [20]
has peak node efficiency of CPU Speed (2.2)x No. of CPU
cores(10)x CPU IPC (2) x No. of CPU node (2) = 88 GFLOPS
and PowerEdge C6525 [21] servers exhibit processing capa-
bilities of CPU Speed (2.25)x No. of CPU cores(64)x CPU
IPC (2) x No. of CPU node (2) = 576 GFLOPS X 100
(nodes) = 57600 GFLOPS The processing capability of the
NVIDIA A100 Tensor Core GPU is reported to be up to 19.5
TFLOPS. We can assume that the storage and networking
components do not contribute to the TFLOPS capacity of
the system. The computation of the overall TFLOPS capacity
of the system can be achieved by employing the subsequent
formula: Infra node (88 GFLOPS) +Compute Node( 57600
GFLOPS)= 57688 GFLOPS = 57.688 TFLOPS GPU (19.5
TFLOPS) x 50 (GPU Nodes) = 975 TFLOPS Total Peak
performance of existing HPC = 975 + 57.688 TFLOPS =
1032.688 = 1.032 PFLOPS The High-Performance Computing
(HPC) architecture employed for the purpose of material
discovery involves an infrastructure node, 100 compute nodes,
50 GPU nodes, and a parallel computing cluster. The total and
combined capacity of the system is 1.032 PFLOPS
C. Problem:
Research agencies are having trouble managing big data.
Data redundancy, incongruity, and retrieval can be difficult
in a typical HPC environment. As research tasks grow in
complexity and size, the HPC infrastructure may not be able
to keep up. This could slow studies by affecting user perfor-
mance and waiting times. The need for many computational
nodes complicates infrastructure management and scalability.
Power, cooling, and hardware maintenance can be costly for
traditional high-performance computing (HPC) infrastructure.
Research centers may struggle to expand infrastructure to
meet researcher needs. Many conventional high-performance
computing (HPC) setups are also not adaptable to researchers’
changing needs. Inadequacy in handling diverse workloads can
slow study progress. System resilience issues may also arise
during the petascale shift.
IV. HPC PETASCALE PROPOSAL IDEA
VMware vSphere with Tanzu will help research centers
manage their infrastructure, ensuring scalability, resilience,
lower maintenance costs, and flexibility. Research facility is
proposed for creating a private HPC cloud infrastructure using
VMware products and technologies on a virtualized HPC
cluster. IT administrators can run Material Modelling Artificial
Intelligence workloads like inference, CNN model training,
model development, and simulation on Ansyx Rocky simula-
tion alongside their data center apps. [4] The Tanzu Kubernetes
cluster, which combines GPU resources into worker node tem-
plates, allows researchers to run artificial intelligence work-
loads on VMware vSphere. NVIDIA operators quickly opti-
mize worker nodes for workload efficiency, increasing output.
Hardware and hypervisor virtualization support in x86 micro-
processors have greatly enhanced computationally demanding
workload performance. Virtualizing high-performance com-
puting (HPC) environments allow research facilities to use
virtualized graphics processing units (GPUs) and run their
compute intensive simulation softwares like Ansys Rocky and
artificial intelligence (AI) workloads alongside data center
applications, improving efficiency. Virtualization technology
will streamline system management and supervision, leverag-
ing Tanzu’s ecosystem of software products. Tanzu’s software
products help handle Kubernetes and high-performance com-
puting workloads on virtualized infrastructure. [13]
A. Benefits of Virtualization:
Research institutions benefit from virtualizing HPC envi-
ronments. First, virtualization allows one host to create mul-
tiple virtual machines or container nodes, improving resource
allocation and eliminating the need for dedicated hardware
for each task. This method reduces hardware costs and en-
hances resource allocation. Virtualization allows managers to
quickly allocate and deallocate virtual machines, improving
HPC ecosystem adaptability and responsiveness. By using
containers for each task, virtualization can improve security
and isolation across high-performance computing workloads.
The above strategy reduces security breaches and workload
interference.
1) Security and Governance: Implementing regulations and
policies based on workflow, physical server, environment, and
operators can improve HPC infrastructure security. For audit
reporting, user rights can limit access to specific actions,
and all activities can be monitored and logged. Segregated
workflows prevent unauthorized entry or distribution of con-
fidential information to other High-Performance Computing
(HPC) systems, workflows, or users on identical hardware.
HPC managers can protect data from unauthorized access and
breaches by implementing security measures.
2) Resilience and Redundancy : Virtualized high-
performance computing (HPC) platforms have made
scalability possible. This allowed fault resilience, dynamic
recovery, and other operational continuity features. The
proposed virtualized high-performance computing (HPC)
environment facilitates:
Uninterrupted hardware maintenance procedures while
ensuring that ongoing HPC workflows or services remain
unaffected.
The implementation of an automated process for restart-
ing unsuccessful workflows on different physical servers
within the cluster.
The process of transferring workflows to an alternative
physical host is known as live migration. This is typically
done when the resources of a particular host have reached
their maximum capacity.
High-performance virtualization is a potential HPC technol-
ogy for a a research institution for their technical computing
efficiency Thus, we recommend virtualized HPC.. Virtualized
High-Performance Computing (HPC) can expand our HPC
infrastructure to peta-scale and beyond, making it easier to
handle extremely difficult workloads. [19]
B. Softwares required:
Multiple software components create a high-performance
computing environment for virtualized tasks. These include:
VMware vSphere 7 with Tanzu [14]. The virtualization
platform helps create and manage Kubernetes clusters,
allowing workload management across VMs and con-
tainers. It also supports simulated NVIDIA graphics
processing units for data analysis and AI.
Software-defined storage solution such as VMware vSAN
7 [15] is designed to cater to vSphere environments.
It provides a highly available, easy-to-manage hyper-
converged storage solution without affecting compute
cluster speed.
The NVIDIA AI Enterprise [16] is a collection of cloud-
native software for Artificial Intelligence and data analyt-
ics by NVIDIA for operation solely on VMware vSphere
systems. The Tanzu platform includes the NVIDIA GPU
Operator and numerous AI and data science frameworks
tailored to its needs.Perpetual licenses allow unlimited
use of NVIDIA AI Enterprise software. NVIDIA AI
Enterprise requires perpetual licenses and 1-, 3-, or 5-
year support packages. [16]
NVIDIA GPU Operator and Network Operator simplify
worker node provisioning of NVIDIA drivers and the
Container Runtime. These operators setup ConnectX net-
work adapters to speed GPU communication.
The Tanzu Ecosystem encompasses a variety of essen-
tial components necessary for Tanzu Kubernetes clus-
ters, including but not limited to Harbor, Prometheus,
and Grafana. These components serve the purpose of
container image storage, metrics gathering and analy-
sis, and multicloud load balancing, respectivelyHarbor,
Prometheus, and Grafana are part of the Tanzu Ecosys-
tem, which supports Kubernetes groups. These compo-
nents store container images, gather and analyze data,
and balance multicloud loads. Container entry is provided
by the NSX Advanced Load Balancer (Avi) [14] [17] .
Basic, Standard, and Advanced variants are available.
Fig. 3. Proposed Virtualized Peta Scale Architecture
C. Architecture Components:
A research center will use VMware and NVIDIA. VMware
vSphere, Tanzu, and NVIDIA AI Enterprise manage Kuber-
netes container groups and conventional virtual machines on
pre-existing servers with few new additions using the above
architectural design. Virtualization allows full lifecycle control
of computing and storage resources. [Fig 3]
1) Infrastructure/Management Node: : The Management
cluster manages HPC management and vSphere components.
The main node schedules workloads. Management clusters
assure essential service availability. Administrators can only
view the management cluster to protect infrastructure service
management containers and nodes. The management cluster
should have at least a five-node vSAN cluster to endure node
failure, even if one node is removed for maintenance. A
pre-existing management cluster was capacity analyzed and
adjusted to support HPC management components.
1 x single Dell EMC-PowerEdge-R740 server equipped
with an Intel Xeon Silver 4210 Processor. (Existing infras-
tructure). + 4 x the configuration comprises of Dell-EMC-
PowerEdge-R940 servers, each equipped with four Intel
Xeon Platinum 8380 processors. Due to several workload
single points of failure, VMware vSphere Enterprise Plus
Edition is suggested to ensure high availability and access
to advanced features.
2) Compute Node: : The research center will run HPC
workloads on compute clusters. VMware’s vSphere Scale-Out
license supports computational clusters for high-performance
computing tasks at a low cost. The researchers can use the
following infrastructure: 100 x Dell EMC PowerEdge C6525
servers, each equipped with an AMD EPYC 7742 Proces-
sor, has been initiated. + 400 x Dell EMC PowerEdge
C6525 [22] servers, each equipped with 4 AMD EPYC
7763 processors.
VMware vSphere with Tanzu will enable Tanzu Kubernetes
groups to be managed by Tanzu Mission Control. NVIDIA
GPUs can accelerate neural network training and inference on
PowerEdge servers.
3) GPU Nodes as Hardware Accelerators: : Accelerated
servers can replace many CPU servers by delegating compu-
tationally intensive jobs to them. VMware vSphere configures
most compute processors using DirectPath Input and Output
(passthrough) mode. This yields performance similar to non-
virtualized systems. DirectPath I/O technology can configure a
node with one or more accelerators that are devoted to the node
and not shared among virtual machines. Our plan calls for 50 x
NVIDIA A100 Tensor Core GPUs and 100 x DGX A100s
with 6912 CUDA cores to optimize hardware acceleration.
System administrators can create vGPU profiles and assign
them to Kubernetes worker nodes using NVIDIA GPU MIG
functionality.
4) Storage: : For data lake storage of unstructured neu-
ral network training data, Dell EMC PowerScale OneFS is
suggested. Tanzu Kubernetes Cluster recommends vSAN for
virtual machine storage. vSAN’s CSI driver automatically
creates a preset Storage Class. vSAN persistently stores pods
and Docker files. PowerScale can act as an NFS cache for large
datasets from external object storage, saving time. Kubernetes
can access the data on an NFS host.
5) Networking: : Mellanox Infiniband HDR networking
technology connects networks. 8 X 25 GbE PowerSwitch
network switches are our suggestion. This setup is ideal
for neural network training on a single node with two GPUs.
GPU partitioning-based model development and inference are
also ideal. In the petascale VMware NSX Advanced Load
Balancer is our recommended ingress driver and load
balancer.
This plan organizes Material Modelling workloads into
worker pods on Kubernetes worker nodes with CPU and GPU
resources. Node labels and nodeSelector fields in the pod
specification limit pods to specific nodes. CPU, GPU model
creation, GPU training, and GPU inference worker pods exist.
Each division has a node pool and worker nodes. To maxi-
mize resources, create several node pools with different CPU,
memory, and GPU allocations. Resource management and pod
assignment are improved. We propose a Tanzu Kubernetes
cluster with five control plane nodes. MIG profiles allocate
GPU resources to Virtual Machine (VM) classes by specifying
their memory sizes.
Statistical models, linear regression, and data analysis can
run on CPU worker pods without GPU acceleration. GPU
model pods are connected to smaller worker groups with
moderate GPU MIG partitions. These tools are suitable for
building machine learning models, fast proofs of concept,
and pipelines. GPUs are fully allocated in pods tied to high-
performance worker nodes to train complex CNN models that
have the requirement of accelerated architectural components
and scalable performance. Ansys Rocky Simulation illustrates
such tasks. GPU inference pods tied to small GPU MIG
partitions simplify model deployment and prediction API
endpoints. [4]
D. Theoretical peak performance Calculation
The proposed system comprises of a 5 x infrastructure node,
500 x compute nodes, 150 x instances of GPU nodes, one
storage unit, and 8 x Mellanox InfiniBand HDR network
1) Theoretical peak performance of CPU only Control
Plane Node: 1 x The dual CPU server Dell EMC PowerEdge
R740 [20] has peak node efficiency of CPU Speed (2.2)x
No. of CPU cores(10)x CPU IPC (2) x No. of CPU node
(2) = 88 GFLOPS + 4x the configuration comprises of Dell-
EMC-PowerEdge-R940 servers, each equipped with four Intel
Xeon Platinum 8380 processors. CPU Speed (2.3)x No. of
CPU cores(40)x CPU IPC (2) x No. of CPU node (4)= 2944
GFLOPS Total= 88GFLOPS + 2944GFLOPS =3032 GFLOPS
= 3.032 TFLOPS
2) Theoretical peak performance Calculation of GPU and
Worker Node: 100 x Dell EMC PowerEdge C6525 servers,
each equipped with an AMD EPYC 7742 Processor, has
been initiated that exhibit processing capabilities as calculated
below:
CPU Speed (2.25)x No. of CPU cores(64)x CPU IPC (2)
x No. of CPU node (2) = 576 GFLOPS X 100 (nodes) =
57600 GFLOPS + 400 x Dell EMC PowerEdge C6525 [22]
servers, each equipped with 4 AMD EPYC 7763 processors
that exhibit processing capabilities as calculated below:
CPU Speed (2.4)x No. of CPU cores(64)x CPU IPC (2)
x No. of CPU node (4) = 1228.8 GFLOPS x 400 (nodes) =
491520GFLOPS Total = 491.52 TFLOPS
As the proposed system includes 50 x NVIDIA A100 Tensor
Core GPUs and 100 x DGX A100s. The processing capability
of the NVIDIA DGX A100s is reported to be up to 39
TFLOPS [23]. We can assume that the storage and networking
components do not contribute to the TFLOPS capacity of
the system. 50 x 19.5 TFLOPS + 100 x 39TFLOPS = 975
TFLOPS + 3900 TFLOPS
Total Peak performance of existing HPC = 975 + 3900
TFLOPS = 4875 TFLOPS
The Proposed petascale High-Performance Computing
(HPC)system involves 5 infrastructure node, 500 compute
nodes, 150 GPU nodes. The total and combined ca-
pacity of the system is 4875 TFLOPS (GPU) + 491.52
TFLOPS(Compute) + 3.032 TFLOPS (Infra) = 5369.552
TFLOPS, which is equal to 5.36 PFLOPS
3) Comparison by Calculation: : Amdahl’s law states that
the fraction of a program that cannot be executed in parallel
limits its possible acceleration.Amdahl’s law’s math is:
Speedup = 1 / [(1 - p) + (p / N) ]
The equation pertains to the relationship between the pro-
portion of the program that can be parallelized denoted by
”p” and the number of processors denoted by ”N”. In the
old architecture, the compute nodes had 100 Dell EMC Pow-
erEdge C6525 servers with an AMD EPYC 7742 Processor,
while in the new architecture, the compute nodes have 100
Dell EMC PowerEdge C6525 servers with an AMD EPYC
7742 Processor and 400 Dell EMC PowerEdge C6525 servers
with 4 AMD EPYC 7763 Processors. This represents a 4x
increase in the number of processors.
Given a parallelizable proportion of 0.9, the computation
of speedup can be determined through the application of
Amdahl’s law.
Old architecture: Speedup = 1 / [(1 - 0.9) + (0.9 / 100)] =
5.26
New architecture: Speedup = 1 / [(1 - 0.9) + (0.9 / 500)] =
9.821
This shows that the new architecture has a higher potential
for speedup than the old architecture.
In contrast, Gustafson’s law considers the scalability of the
problem size with the addition of processors, resulting in a
higher proportion of the program that can be parallelized.
Gustafson’s law is presented below.:
Speedup = S + (N - S) * p
The equation provided relates the proportion of a program
that is serial (S), the number of processors (N), and the
proportion of the program that can be parallelized (p).
Given the parallelizable proportion of the program as 0.9
and the serial proportion as 0.1, the speedup can be computed
utilizing Gustafson’s law.
Old architecture: Speedup = 0.1 + (100 - 0.1) * 0.9 = 89.9
New architecture: Speedup = 0.1 + (500 - 0.1) * 0.9 =
450.01
This shows that the new architecture has a much higher
potential for speedup than the old architecture when the
problem size is scaled up.
In general, Amdahl’s law and Gustafson’s law indicate
that the novel architecture possesses the capability to achieve
substantial acceleration in contrast to the previous archi-
tecture, particularly when the magnitude of the problem is
increased. Ultimately, the recently developed architecture has
been specifically crafted to facilitate the processing of Material
Modelling AI workloads, all the while incorporating ethical
and data governance factors. The utilization of virtualization
and Tanzu Kubernetes clusters can potentially offer enhanced
scalability and redundancy. Furthermore, the incorporation of
supplementary compute nodes and GPUs may facilitate greater
parallelism, thereby supporting expedited processing of AI
workloads.
E. Problems Solved:
This study aims to address the issues that can be resolved
in contemporary Petascale architecture. The issues that have
been resolved are:
The utilization of virtualization in facilitating artificial in-
telligence workloads has the potential to improve security
and isolation.
The utilization of virtualization technology enables the
efficient management of workloads through the dynamic
allocation of resources.
The architecture that has been implemented recently has
been designed to enhance scalability and redundancy.
This has been achieved by integrating additional compute
nodes and GPUs, which enables the system to effec-
tively manage larger workloads. The integration of Tanzu
Kubernetes clusters that feature worker nodes furnished
with GPU resources can enable the prompt allocation of
resources to efficiently maintain such workloads.
The implementation of NVIDIA operators for automating
the configuration of worker nodes to support accelerated
hardware has the potential to enhance the deployment and
administration of the underlying infrastructure, thereby
optimizing the overall process.
The proposed architecture aims to improve the processing
of AI workloads by incorporating additional processors,
cores, and nodes to enable increased parallelism. This
enhancement is expected to result in expedited processing
of AI workloads with improved speedup.
The implementation of a novel architecture with supple-
mentary redundancy can significantly reduce downtime
and ensure the availability of the infrastructure.
The architecture that has been recently developed is
designed to enable the efficient processing of Material
Modelling AI workloads, while also taking into account
ethical and data governance considerations.
V. MULTI-VARIABLE COST
I have included some common cost components such as
hardware, maintenance, electricity, cooling, network, software
licenses, and personnel. The total cost is calculated for a 3-
year, 5-year, and 7-year period, and the cost per core per year
is also calculated for comparison purposes.
For the old infrastructure, I assumed a hardware cost of
1.2 million euros, which includes the Dell EMC PowerEdge
R740 infrastructure server, 100 Dell EMC PowerEdge C6525
compute nodes, 50 NVIDIA A100 Tensor Core GPUs, and
Dell EMC PowerScale OneFS storage. The maintenance cost
is assumed to be 1percent of the hardware cost per year, the
electricity cost is assumed to be 200,000 euros per year, the
cooling cost is assumed to be 50,000 euros per year, the
network cost is assumed to be 150,000 euros per year, the
software licenses cost is assumed to be 300,000 euros per year,
Fig. 4. TCO Analysis
and the personnel cost is assumed to be 500,000 euros per year.
For the new infrastructure, I assumed a hardware cost of 4.5
million euros, which includes the Dell EMC PowerEdge R740
infrastructure server, 4 Dell EMC PowerEdge R940 servers,
100 Dell EMC PowerEdge C6525 compute nodes, 400 Dell
EMC PowerEdge C6525 compute nodes, 50 NVIDIA A100
Tensor Core GPUs, and 350 NVIDIA DGX A100 GPUs with
the NVIDIA AI Enterprise License. The maintenance cost is
assumed to be 2percent of the hardware cost per year, the
electricity cost is assumed to be 150,000 euros per year, the
cooling cost is assumed to be 40,000 euros per year, the
network cost is assumed to be 250,000 euros per year, the
software licenses cost is assumed to be 400,000 euros per
year, and the personnel cost is assumed to be 600,000 euros
per year. The cost per core per year is calculated by dividing
the total cost by the total number of cores and the number of
years, which provides a way to compare the cost efficiency of
the two infrastructures. As shown in the example, the cost per
core per year for the new infrastructure is lower than that of the
old infrastructure, which indicates that the new infrastructure
is more cost-efficient in terms of processing power. One
interesting aspect of the TCO analysis is that despite the
significant increase in the number of compute nodes and
GPUs in the new infrastructure, the total cost of ownership
is lower compared to the old infrastructure. This is due to
several factors such as the more efficient hardware and the use
of open-source software components which reduces licensing
costs. Another interesting aspect is the impact of workload
on TCO. In the example provided, the workload was assumed
to be constant for both old and new infrastructures. However,
in real-world scenarios, the workload can vary significantly,
and this can affect the TCO. For instance, if the workload
increases significantly, the cost of the new infrastructure may
be higher due to the need for additional compute nodes or
GPUs. On the other hand, if the workload decreases, the old
infrastructure may become more cost-effective. Therefore, it
is essential to consider workload variability when evaluating
the TCO of different infrastructure options.
VI. CONCLUSIONS AND FUTURE WORK
Upon contemplation of this undertaking, I have acquired
knowledge and discerned the advantages of employing vir-
tualization technology within High-Performance Computing
(HPC) settings. Research institutions can enhance the effi-
ciency and responsiveness of their ecosystem by establishing
a secure private cloud infrastructure for HPC through the
utilization of VMware products and technologies.
In the event of undertaking this project anew, I would
allocate additional time towards an in-depth exploration of
the potential of virtualization in high-performance computing
(HPC) environments, encompassing the diverse software of-
ferings within the Tanzu ecosystem. Furthermore, an inquiry
into the potential utilization of virtualization to augment the
security and governance of the High-Performance Computing
(HPC) infrastructure is warranted. Additionally, alternative
methods to amplify resilience and redundancy should be
explored.
With additional time, I would expand upon this project
by incorporating heightened security measures and isolation
protocols, including the utilization of distinct containers for
individual workloads. The investigation of virtualization’s po-
tential to reduce expenses and enhance resource allocation
in high-performance computing (HPC) settings, including the
substitution of conventional hardware for specialized HPC
hardware, would be a worthwhile pursuit. In conclusion, it
would be beneficial to investigate supplementary methods for
enhancing the scalability and fault tolerance of the virtualized
High-Performance Computing (HPC) infrastructure. One such
approach could involve the automation of workflow restarts
on alternative physical servers situated within the cluster.
In summary, the advantages of virtualization technology
in enhancing productivity, efficacy, and security within HPC
settings are acknowledged by myself as an HPC administrator.
The implementation of virtualization technology within High
Performance Computing (HPC) settings has enabled a level of
scalability and fault tolerance that was previously unachievable
through traditional HPC setups. Through the utilization of
virtualization technology, an optimized method for the ad-
ministration of high-performance computing workloads on a
virtualized infrastructure can be achieved, leading to improved
efficacy and economic benefits for research institutions.
REFERENCES
[1] Imran, Qayyum F, Kim DH, Bong SJ, Chi SY, Choi YH. A Survey of
Datasets, Preprocessing, Modeling Mechanisms, and Simulation Tools
Based on AI for Material Analysis and Discovery. Materials (Basel).
2022 Feb 15;15(4):1428. doi: 10.3390/ma15041428. PMID: 35207968;
PMCID: PMC8875409.
[2] Design guide-virtualizing gpus for AI with vmware and Nvidia based
on Dell Infrastructure (no date) Dell Technologies Info Hub. Avail-
able at: https://infohub.delltechnologies.com/t/design-guide-virtualizing-
gpus-for-ai-with-vmware-and-nvidia-based-on-dell-infrastructure-1.
[3] Unstructured data Data Storage (no date) Data Storage
Dell Canada. Available at: https://www.dell.com/en-ca/dt/learn/data-
storage/file-storage.htm
[4] Dunn, A. et al. (2020) Benchmarking materials property predic-
tion methods: The MATBENCH test set and Automatminer refer-
ence algorithm, Nature News. Nature Publishing Group. Available at:
https://www.nature.com/articles/s41524-020-00406-3.
[5] Author links open overlay panelAshley N.
Henderson Materials discovery Available at:
https://www.sciencedirect.com/science/article/pii/S235234092100546
[6] Including crystal structure attributes in ma-
chine learning models of ..... Available at:
http://cucis.ece.northwestern.edu/publications/pdf/WLK17.pdf
[7] Petrone, G. (no date) Unleashing the full power of gpus for Ansys fluent.
Available at: https://www.ansys.com/blog/unleashing-the-full-power-of-
gpus-for-ansys-fluent
[8] Materials discovery (no date) Materials Discovery
Journal ScienceDirect.com by Elsevier. Available at:
https://www.sciencedirect.com/journal/materials-discovery
[9] CPU Benchmark (2023) Intel Xeon Silver 4210 - benchmark, test and
Specs, CPU Benchmark. CPU Benchmark. Available at: https://cpu-
benchmark.org/cpu/intel-xeon-silver-4210/
[10] AMD EPYC 7742 (no date) iconcharts. Available at:
[11] Nvidia A100 gpus power the modern data center (no date) NVIDIA.
Available at: https://www.nvidia.com/en-us/data-center/a100/ .
[12] Introducing 200g HDR infiniband solutions - nvidia (no date). Available
at: https://network.nvidia.com/files/doc-2020/wp-introducing-200g-hdr-
infiniband-solutions.pdf, https://network.nvidia.com/files/doc-2020/ocp-
vpi-adapter-cards-brochure.pdf
[13] Data Center Virtualization 2023 (VCP-
DCV) - vmware (no date). Available at:
https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/certification/vmw-
VCP-DCV-certification-preparation-guide.pdf
[14] What is vSphere with Tanzu? (no date) VMware Docs Home. Available
at: https://docs.vmware.com/en/VMware-vSphere/7.0/vmware-vsphere-
with-tanzu/GUID-70CAF0BB-1722-4526-9CE7-D5C92C15D7D0.html
[15] VMware vsan (no date). Available at:
https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/products/vsan/vmware-
vsan-datasheet.pdf
[16] Nvidia AI Enterprise Licensing and packaging guide (no date)
NVIDIA. Available at: https://resources.nvidia.com/en-us-nvaie/nvidia-
ai-enterprise-licensing-pg
[17] VMware vSphere, VMware vSphere+ Com-
pute Virtualization (no date). Available at:
https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/products/vsphere/vmware-
vsphere-pricing-whitepaper.pdf
[18] Software components: Implementation guide-virtualizing
gpus for AI with vmware and Nvidia based on Dell
Infrastrucutre (no date) Dell Technologies Info Hub. Available
at: https://infohub.delltechnologies.com/l/implementation-guide-
virtualizing-gpus-for-ai-with-vmware-and-nvidia-based-on-dell-
infrastrucutre/software-components-169
[19] Virtualizing HPC throughput computing envi-
ronments - vmware (no date). Available at:
https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/solutions/vmware-
virtualizing-hpc-throughput-computing-environments.pdf
[20] Poweredge R740 - Dell (no date). Available at:
https://i.dell.com/sites/doccontent/shared-content/data-
sheets/en/Documents/poweredge-r740-spec-sheet.pdf .
[21] Poweredge C6525 - Dell (no date). Available at:
https://i.dell.com/sites/csdocuments/ProductDocs/en/poweredge-c6525-
spec-sheet.pdf
[22] AMD EPYC 7763 specs (2023) TechPowerUp. Available at:
https://www.techpowerup.com/cpu-specs/epyc-7763.c2373
[23] https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth