Virtualized Petascale High-Performance Computing

System Case Study for New Material Modelling

Computational Efﬁciency

Rishabh Sinha, x21171203, Cloud Architecture, MSc. Cloud Computing,

National College of ireland, Dublin, Ireland, x21171203@student.ncirl.ie

Abstract—The objective of this project report is to assess the

prospective technical and ﬁnancial advantages associated with

the transitioning of a leading European research center to a

virtualized petascale architecture. The research facility employs

machine learning-based mechanisms to address the challenges of

conventional experiments in material science applications, aiming

to achieve cost and time efﬁciency. The Research Facility uses a

variety of processing techniques to analyze massive datasets taken

from material databases in an effort to uncover latent knowledge

that can be effectively used in technological frameworks for

material modeling, material screening, selection, and recommen-

dation. The report will analyze the feasibility of implementing

a VMWare virtualized computing model for the existing high-

performance computing (HPC) systems of the research facility.

The research is structured into ﬁve primary segments, namely:

an introduction, contextual information, an evaluation of the on-

site high-performance computing (HPC) proposal of the research

facility, functional and non-functional details of HPC petascale,

and a cost analysis that considers multiple variables. The project

will comprise a comprehensive bibliography and an independent

study to substantiate the report’s recommendations and conclu-

sions. The aim is to provide pragmatic insights that can aid the

research institution in its deliberations regarding the potential

beneﬁts of adopting a virtualized petascale infrastructure for its

high-performance computing (HPC) systems.

Index Terms—New Material Modelling, HPC, Virtualization,

Petascale Computing, Machine Learning, Material Science, Cost

Analysis.

I. INTRODUCTION

HPC has revolutionized material science by allowing re-

searchers to model complex crystal structures and study

environmental factors like temperature, pressure, and more

for New Material Modelling and discovery. This technology

helped ﬁnd new materials and their uses. HPC has helped

researchers model and evaluate complex material architectures,

predict material characteristics using machine learning, and

efﬁciently and correctly manage large data sets. This study

suggests using petascale computing at a research facility to

overcome the limitations of terascale systems. Our goal is

to enable researchers with better data management, material

modeling, and simulations across scales and physics events,

resulting in faster and more accurate simulations and lower

costs for creating new materials. New material modeling

has many applications, and materials science and engineer-

ing scholars can use open ICSD, quantum materials, and

SuperCon databases. Pre-processing and feature engineering

prepare raw data for model creation, followed by AI-based

learning tools like ANN and optimization methods like GAs,

PSOs, and SAAs to create an artiﬁcial intelligence model.

[3], [4], [5] [6] However, computational models slow down

tera-scale computing systems and delay result production.

Petascale computing is needed to handle multi-scale, multi-

physics models. This paper suggests petascale computing at a

research center using VMware vSphere to run new material

modeling AI and machine learning workloads on NVIDIA

GPUs. Running these workloads as Kubernetes containers

in TKC would enhance on-premises environment usage and

scaling, speeding up and improving new material simulations.

II. BACKGROUND

A. Understanding the application workﬂow for New Material

Modelling

Material discovery—the identiﬁcation and synthesis of

novel materials with unique properties—is crucial to scientiﬁc

study. Trial-and-error experimentation has traditionally been

used to ﬁnd new materials. This method is expensive and time-

consuming, with limited results. The study facility’s material

discovery research has beneﬁted from computational and ma-

chine learning methods. These methods allow rapid discovery

and modeling of novel materials with unique properties. New

material ﬁnding or modeling involves paradigm discovery, syn-

thesis, property prediction, and characterization. This report

explains the Research Facility Application in New Materials

Modelling/Discovery, including the research methodology and

High-Performance Computing (HPC) requirements for run-

ning their AI and ML Workloads, so they can properly utilize

the current HPC environment and enable them to petascale

and exascale if more computational tasks are introduced into

the facility for new research or modern ML tasks.

1) Material Discovery/Modelling Process: : Pre-

processing, model building and validation, and inverse

design comprise material discovery [Fig 1.]. Infrastructure

servers manage compute units, GPUs, storage, and networking

in on-premises environments. This system’s infrastructure

Fig. 1. New Material Modelling application workﬂow

server, a Dell-EMC-PowerEdge-R740 with an Intel Xeon-

Silver-4210 processor, will schedule ML jobs for the research

facility’s material modeling and ﬁnding process. Compute-

nodes handle ML workloads. This system’s compute unit

is the Dell EMC-PowerEdge-C6525 with an AMD EPYC-

7742 processor. Finally, NVIDIA-A100-Tensor-Core GPUs,

built for deep learning, accelerate various ML workloads

in their HPC system. Material Modelling involves below

mentioned steps: Gathering information: Multiple groups

provide standardized datasets with visual representations for

name and material traits. References list several. The dataset

may be an existing database or a product of numerous lap

experiments or simulations. Materials are modeled using

intrinsic and extrinsic knowledge. Data cleanup begins

pre-processing. Researchers keep their modeling data in the

PowerScale Storage Cluster. [3] [4] [5] [6] [?] The second

method reduces the dataset’s input variables to remove fewer

useful features. Descriptors reduce data complexity. Datasets

are represented by features or identiﬁers. This step requires

high computation. Third, model creation and validation.

Linear and non-linear regression can be used to map a model

to its goal properties and descriptors. Using data and reference

sources achieves this. After using raw parameters/features

from antecedent models, ﬁne-tuning is done to incorporate

these features into the forecast model. Supervised learning

models identify a function that can uncover new material

models based on known and existing materials. Regression

problems involve continuous properties. ANNs, Kriging, and

SVMs are used in such cases. Decision trees and random

forests can classify classiﬁed outputs. This step requires

high computing power and can vary depending on the model

used to ﬁnd the most accurate model for researchers to use

in new material modeling. [1] Inverse design uses machine

learning models to ﬁnd novel materials with speciﬁc qualities

in the ﬁnal step. The trained machine learning models

predict theoretical material properties and identify the best

candidates for further study. Inverse design uses machine

learning models to ﬁnd and create new materials with wanted

properties. Convolutional Neural Network (CNN) models

can validate the best candidates. Neural networks reveal

hidden trends in datasets. Instead of using the network

for prediction, this method extracts relevant data insights.

Similar to combinatorial material science experiments, Ansys

Rocky/Fluent simulation software and CNN-based neural

network models mimic complex data in multiple dimensions.

The model analyzes main data relationship prediction output.

This model interprets the relationships with human expertise

to gain a fundamental comprehension. This model is trained

using machine learning techniques that require GPUs for

CNN models and Ansys Rocky simulation software. [7]

Discovered materials are saved and used to teach models.

This data improves model accuracy and identiﬁes dataset

ﬂaws. Data storage helps researchers collaborate, improving

model precision and area progress. Iterating through steps,

reﬁning machine learning models, and discovering new

materials until desired qualities are achieved. Researchers will

need more advanced storage and high-performance computing

tools to manage and analyze data as it grows, enabling them

to gain valuable insights and accelerate novel material

exploration. The current High-Performance Computing

(HPC) system allows ﬂexible material discovery study.

The Dell EMC-PowerEdge-R740 infrastructure server, Dell

EMC-PowerEdge-C6525 compute nodes, NVIDIA-A100-

Tensor-Core GPUs, Dell EMC-PowerScale-OneFS storage,

and Mellanox Inﬁniband HDR networking components enable

researchers to efﬁciently preprocess data, build and verify

machine learning models, and run ANSYS simulations to

discover novel materials with desired properties.

III. ANALYSIS OF THE RESEARCH FACILITYS

ON-PREMISES HPC

This study examines material modeling and ﬁnding using

HPC systems. The study facility’s HPC system meets high-

performance computing workloads. The infrastructure server,

compute nodes, GPUs, storage, and networking components

were carefully chosen to provide signiﬁcant processing power,

extensive memory capacity, scalability, and fast connectivity,

but as researchers use more advanced methods and workloads,

it is becoming harder for the research facility to conduct efﬁ-

cient research. On-premises HPC equipment, components, and

research facility issues will be discussed here. New material

Fig. 2. Research Facility Existing On-premise architecture

discovery drives technological growth. ”Material discovery” is

the systematic search for new materials with desirable prop-

erties for various uses. [8] Discovering new materials requires

data gathering, pre-processing, model development and vali-

dation, and inverse design. Discovering new materials requires

complex calculations that require HPC systems. The research

facility’s HPC system includes servers, storage, networking,

and GPUs. This study examines the on-site research infrastruc-

ture and HPC system design used to ﬁnd new materials. This

study makes system, user, and application assertions based on

novel material modeling use cases. The study facility’s HPC

system is designed to handle high-performance computing

workloads needed for material modeling and discovery. The

system’s infrastructure was carefully selected to provide high

computing power, large memory capacity, scalability, and fast

connectivity [Fig 2]:

1) Management node : The infrastructure server manages

all system resources, including compute units, GPUs, storage,

and networking, and schedules jobs. Dell EMC PowerEdge

R740 servers with Intel-Xeon Silver 4210 processors are this

system’s backbone servers. With 10 cores and 20 threads,

this processor can handle large-scale machine learning tasks.

The server’s 192GB memory allows efﬁcient system resource

control. The Dell EMC PowerEdge R740 server has a 2.2

GHz Intel Xeon Silver 4210 processor that can turbo boost

to 3.2 GHz. It has 10 cores and 20 threads. CPU IPC is 2.

One CPU per unit. The 24-DIMM system has 3 terabytes of

RAM. Three double-width GPUs or six single-width GPUs

are allowed. Reference the benchmark study. [9]

2) Compute node : Compute nodes run machine

learning tasks. The AMD-EPYC-7742-powered Dell-

EMC-PowerEdge-C6525 server is the system’s compute

core. The Dell EMC PowerEdge C6525 server with 4

AMD-EPYC-7742 processors has a total of 256 cores .

The rack-mounted Dell-EMC-PowerEdge-C6525 server can

hold four nodes per chassis and takes up two vertical rack

units. AMD-EPYC-7742 processors power each unit. The

AMD-EPYC-7742 CPU has 64 cores, 128 threads, a base

clock speed of 2.25 GHz, and a peak clock speed of 3.4

GHz. The server supports 4 terabytes of node RAM. See

CPU benchmark result. [10]

3) GPU : The current High-Performance Computing

(HPC) system uses deep learning-optimized NVIDIA-A100-

Tensor-Core [11] Graphics Processing Units to accelerate the

research facility’s varied Machine Learning (ML) workloads.

These GPUs can speed up Machine Learning (ML) workloads,

improving system efﬁciency. Tensor Cores in NVIDIA A100

GPUs speed matrix calculations, making them ideal for Ma-

chine Learning (ML) workloads. It has 6,912 CUDA cores and

40 GB of HBM2 RAM with 1.6 TB/s bandwidth. The GPU

operates at 1.41 GHz and 1.54 GHz. The Ampere architecture

makes the device ideal for high-performance computing.

4) Storage : Dell EMC PowerScale OneFS provides

scalable, high-performance storage for big machine learning

datasets in the HPC system. PowerScale’s OneFS storage

solution can be expanded to meet workload needs, providing

high-capacity, high-performance storage for the system. Dell

EMC Isilon scale-out Network Attached Storage [3] is ideal

for machine learning (ML) tasks due to its high throughput and

IOPS. Dell EMC PowerScale OneFS can store machine learn-

ing data. The device can handle 50 petabytes of data. It also

provides a machine-learning-optimized computing platform.

OneFS supports NFS, SMB, and HDFS. It also compresses,

deletes, and protects data.

5) Networking: : Mellanox InﬁniBand HDR networking

connects the HPC system’s components quickly. This network-

ing solution optimizes data transfer between the infrastructure

server, compute nodes, GPUs, and storage components, im-

proving system efﬁciency. Mellanox InﬁniBand HDR network-

ing technology has 200Gb/s HDR adapters with low latency

and high bandwidth, making it ideal for machine learning

tasks. Mellanox InﬁniBand HDR networking gives up to 200

Gb/s per port. Mellanox InﬁniBand HDR 200Gb/s adapters

are designed for HPC and AI/ML apps. Mellanox offers PCIe

Gen4 and Gen5 InﬁniBand HDR 200Gb/s ports. [12] Research

Facility is using 4 Mellanox InﬁniBand HDR IB switches with

40 ports each which can give 16 TB of speed. It is calculated

by number of IB (4) x Number of ports (40) x (200 Gb/s) =

16TB

A. Performance Evaluation

Data pre-processing, machine learning model construction

and veriﬁcation, and ANSYS software simulations to ﬁnd

novel materials with desired qualities are expected workloads.

The system comprises of a 1 x infrastructure node, 100 x

compute nodes, 50 x instances of GPU nodes, one storage

unit, and 4 x Mellanox InﬁniBand HDR network. On-site

high-performance computing (HPC) equipment uses a cluster

for parallel computing. This cluster architecture is optimized

for parallel processing, making it ideal for large-dataset ma-

chine learning tasks. Parallel computing clusters distribute

workloads across numerous nodes, improving performance

and processing speed.

B. TFLOPS Calculation for On Premise HPC Systems

The dual CPU server Dell EMC PowerEdge R740 [20]

has peak node efﬁciency of CPU Speed (2.2)x No. of CPU

cores(10)x CPU IPC (2) x No. of CPU node (2) = 88 GFLOPS

and PowerEdge C6525 [21] servers exhibit processing capa-

bilities of CPU Speed (2.25)x No. of CPU cores(64)x CPU

IPC (2) x No. of CPU node (2) = 576 GFLOPS X 100

(nodes) = 57600 GFLOPS The processing capability of the

NVIDIA A100 Tensor Core GPU is reported to be up to 19.5

TFLOPS. We can assume that the storage and networking

components do not contribute to the TFLOPS capacity of

the system. The computation of the overall TFLOPS capacity

of the system can be achieved by employing the subsequent

formula: Infra node (88 GFLOPS) +Compute Node( 57600

GFLOPS)= 57688 GFLOPS = 57.688 TFLOPS GPU (19.5

TFLOPS) x 50 (GPU Nodes) = 975 TFLOPS Total Peak

performance of existing HPC = 975 + 57.688 TFLOPS =

1032.688 = 1.032 PFLOPS The High-Performance Computing

(HPC) architecture employed for the purpose of material

discovery involves an infrastructure node, 100 compute nodes,

50 GPU nodes, and a parallel computing cluster. The total and

combined capacity of the system is 1.032 PFLOPS

C. Problem:

Research agencies are having trouble managing big data.

Data redundancy, incongruity, and retrieval can be difﬁcult

in a typical HPC environment. As research tasks grow in

complexity and size, the HPC infrastructure may not be able

to keep up. This could slow studies by affecting user perfor-

mance and waiting times. The need for many computational

nodes complicates infrastructure management and scalability.

Power, cooling, and hardware maintenance can be costly for

traditional high-performance computing (HPC) infrastructure.

Research centers may struggle to expand infrastructure to

meet researcher needs. Many conventional high-performance

computing (HPC) setups are also not adaptable to researchers’

changing needs. Inadequacy in handling diverse workloads can

slow study progress. System resilience issues may also arise

during the petascale shift.

IV. HPC PETASCALE PROPOSAL IDEA

VMware vSphere with Tanzu will help research centers

manage their infrastructure, ensuring scalability, resilience,

lower maintenance costs, and ﬂexibility. Research facility is

proposed for creating a private HPC cloud infrastructure using

VMware products and technologies on a virtualized HPC

cluster. IT administrators can run Material Modelling Artiﬁcial

Intelligence workloads like inference, CNN model training,

model development, and simulation on Ansyx Rocky simula-

tion alongside their data center apps. [4] The Tanzu Kubernetes

cluster, which combines GPU resources into worker node tem-

plates, allows researchers to run artiﬁcial intelligence work-

loads on VMware vSphere. NVIDIA operators quickly opti-

mize worker nodes for workload efﬁciency, increasing output.

Hardware and hypervisor virtualization support in x86 micro-

processors have greatly enhanced computationally demanding

workload performance. Virtualizing high-performance com-

puting (HPC) environments allow research facilities to use

virtualized graphics processing units (GPUs) and run their

compute intensive simulation softwares like Ansys Rocky and

artiﬁcial intelligence (AI) workloads alongside data center

applications, improving efﬁciency. Virtualization technology

will streamline system management and supervision, leverag-

ing Tanzu’s ecosystem of software products. Tanzu’s software

products help handle Kubernetes and high-performance com-

puting workloads on virtualized infrastructure. [13]

A. Beneﬁts of Virtualization:

Research institutions beneﬁt from virtualizing HPC envi-

ronments. First, virtualization allows one host to create mul-

tiple virtual machines or container nodes, improving resource

allocation and eliminating the need for dedicated hardware

for each task. This method reduces hardware costs and en-

hances resource allocation. Virtualization allows managers to

quickly allocate and deallocate virtual machines, improving

HPC ecosystem adaptability and responsiveness. By using

containers for each task, virtualization can improve security

and isolation across high-performance computing workloads.

The above strategy reduces security breaches and workload

interference.

1) Security and Governance: Implementing regulations and

policies based on workﬂow, physical server, environment, and

operators can improve HPC infrastructure security. For audit

reporting, user rights can limit access to speciﬁc actions,

and all activities can be monitored and logged. Segregated

workﬂows prevent unauthorized entry or distribution of con-

ﬁdential information to other High-Performance Computing

(HPC) systems, workﬂows, or users on identical hardware.

HPC managers can protect data from unauthorized access and

breaches by implementing security measures.

2) Resilience and Redundancy : Virtualized high-

performance computing (HPC) platforms have made

scalability possible. This allowed fault resilience, dynamic

recovery, and other operational continuity features. The

proposed virtualized high-performance computing (HPC)

environment facilitates:

• Uninterrupted hardware maintenance procedures while

ensuring that ongoing HPC workﬂows or services remain

unaffected.

• The implementation of an automated process for restart-

ing unsuccessful workﬂows on different physical servers

within the cluster.

• The process of transferring workﬂows to an alternative

physical host is known as live migration. This is typically

done when the resources of a particular host have reached

their maximum capacity.

High-performance virtualization is a potential HPC technol-

ogy for a a research institution for their technical computing

efﬁciency Thus, we recommend virtualized HPC.. Virtualized

High-Performance Computing (HPC) can expand our HPC

infrastructure to peta-scale and beyond, making it easier to

handle extremely difﬁcult workloads. [19]

B. Softwares required:

Multiple software components create a high-performance

computing environment for virtualized tasks. These include:

• VMware vSphere 7 with Tanzu [14]. The virtualization

platform helps create and manage Kubernetes clusters,

allowing workload management across VMs and con-

tainers. It also supports simulated NVIDIA graphics

processing units for data analysis and AI.

• Software-deﬁned storage solution such as VMware vSAN

7 [15] is designed to cater to vSphere environments.

It provides a highly available, easy-to-manage hyper-

converged storage solution without affecting compute

cluster speed.

• The NVIDIA AI Enterprise [16] is a collection of cloud-

native software for Artiﬁcial Intelligence and data analyt-

ics by NVIDIA for operation solely on VMware vSphere

systems. The Tanzu platform includes the NVIDIA GPU

Operator and numerous AI and data science frameworks

tailored to its needs.Perpetual licenses allow unlimited

use of NVIDIA AI Enterprise software. NVIDIA AI

Enterprise requires perpetual licenses and 1-, 3-, or 5-

year support packages. [16]

• NVIDIA GPU Operator and Network Operator simplify

worker node provisioning of NVIDIA drivers and the

Container Runtime. These operators setup ConnectX net-

work adapters to speed GPU communication.

• The Tanzu Ecosystem encompasses a variety of essen-

tial components necessary for Tanzu Kubernetes clus-

ters, including but not limited to Harbor, Prometheus,

and Grafana. These components serve the purpose of

container image storage, metrics gathering and analy-

sis, and multicloud load balancing, respectivelyHarbor,

Prometheus, and Grafana are part of the Tanzu Ecosys-

tem, which supports Kubernetes groups. These compo-

nents store container images, gather and analyze data,

and balance multicloud loads. Container entry is provided

by the NSX Advanced Load Balancer (Avi) [14] [17] .

Basic, Standard, and Advanced variants are available.

Fig. 3. Proposed Virtualized Peta Scale Architecture

C. Architecture Components:

A research center will use VMware and NVIDIA. VMware

vSphere, Tanzu, and NVIDIA AI Enterprise manage Kuber-

netes container groups and conventional virtual machines on

pre-existing servers with few new additions using the above

architectural design. Virtualization allows full lifecycle control

of computing and storage resources. [Fig 3]

1) Infrastructure/Management Node: : The Management

cluster manages HPC management and vSphere components.

The main node schedules workloads. Management clusters

assure essential service availability. Administrators can only

view the management cluster to protect infrastructure service

management containers and nodes. The management cluster

should have at least a ﬁve-node vSAN cluster to endure node

failure, even if one node is removed for maintenance. A

pre-existing management cluster was capacity analyzed and

adjusted to support HPC management components.

1 x single Dell EMC-PowerEdge-R740 server equipped

with an Intel Xeon Silver 4210 Processor. (Existing infras-

tructure). + 4 x the conﬁguration comprises of Dell-EMC-

PowerEdge-R940 servers, each equipped with four Intel

Xeon Platinum 8380 processors. Due to several workload

single points of failure, VMware vSphere Enterprise Plus

Edition is suggested to ensure high availability and access

to advanced features.

2) Compute Node: : The research center will run HPC

workloads on compute clusters. VMware’s vSphere Scale-Out

license supports computational clusters for high-performance

computing tasks at a low cost. The researchers can use the

following infrastructure: 100 x Dell EMC PowerEdge C6525

servers, each equipped with an AMD EPYC 7742 Proces-

sor, has been initiated. + 400 x Dell EMC PowerEdge

C6525 [22] servers, each equipped with 4 AMD EPYC

7763 processors.

VMware vSphere with Tanzu will enable Tanzu Kubernetes

groups to be managed by Tanzu Mission Control. NVIDIA

GPUs can accelerate neural network training and inference on

PowerEdge servers.

3) GPU Nodes as Hardware Accelerators: : Accelerated

servers can replace many CPU servers by delegating compu-

tationally intensive jobs to them. VMware vSphere conﬁgures

most compute processors using DirectPath Input and Output

(passthrough) mode. This yields performance similar to non-

virtualized systems. DirectPath I/O technology can conﬁgure a

node with one or more accelerators that are devoted to the node

and not shared among virtual machines. Our plan calls for 50 x

NVIDIA A100 Tensor Core GPUs and 100 x DGX A100s

with 6912 CUDA cores to optimize hardware acceleration.

System administrators can create vGPU proﬁles and assign

them to Kubernetes worker nodes using NVIDIA GPU MIG

functionality.

4) Storage: : For data lake storage of unstructured neu-

ral network training data, Dell EMC PowerScale OneFS is

suggested. Tanzu Kubernetes Cluster recommends vSAN for

virtual machine storage. vSAN’s CSI driver automatically

creates a preset Storage Class. vSAN persistently stores pods

and Docker ﬁles. PowerScale can act as an NFS cache for large

datasets from external object storage, saving time. Kubernetes

can access the data on an NFS host.

5) Networking: : Mellanox Inﬁniband HDR networking

technology connects networks. 8 X 25 GbE PowerSwitch

network switches are our suggestion. This setup is ideal

for neural network training on a single node with two GPUs.

GPU partitioning-based model development and inference are

also ideal. In the petascale VMware NSX Advanced Load

Balancer is our recommended ingress driver and load

balancer.

This plan organizes Material Modelling workloads into

worker pods on Kubernetes worker nodes with CPU and GPU

resources. Node labels and nodeSelector ﬁelds in the pod

speciﬁcation limit pods to speciﬁc nodes. CPU, GPU model

creation, GPU training, and GPU inference worker pods exist.

Each division has a node pool and worker nodes. To maxi-

mize resources, create several node pools with different CPU,

memory, and GPU allocations. Resource management and pod

assignment are improved. We propose a Tanzu Kubernetes

cluster with ﬁve control plane nodes. MIG proﬁles allocate

GPU resources to Virtual Machine (VM) classes by specifying

their memory sizes.

Statistical models, linear regression, and data analysis can

run on CPU worker pods without GPU acceleration. GPU

model pods are connected to smaller worker groups with

moderate GPU MIG partitions. These tools are suitable for

building machine learning models, fast proofs of concept,

and pipelines. GPUs are fully allocated in pods tied to high-

performance worker nodes to train complex CNN models that

have the requirement of accelerated architectural components

and scalable performance. Ansys Rocky Simulation illustrates

such tasks. GPU inference pods tied to small GPU MIG

partitions simplify model deployment and prediction API

endpoints. [4]

D. Theoretical peak performance Calculation

The proposed system comprises of a 5 x infrastructure node,

500 x compute nodes, 150 x instances of GPU nodes, one

storage unit, and 8 x Mellanox InﬁniBand HDR network

1) Theoretical peak performance of CPU only Control

Plane Node: 1 x The dual CPU server Dell EMC PowerEdge

R740 [20] has peak node efﬁciency of CPU Speed (2.2)x

No. of CPU cores(10)x CPU IPC (2) x No. of CPU node

(2) = 88 GFLOPS + 4x the conﬁguration comprises of Dell-

EMC-PowerEdge-R940 servers, each equipped with four Intel

Xeon Platinum 8380 processors. CPU Speed (2.3)x No. of

CPU cores(40)x CPU IPC (2) x No. of CPU node (4)= 2944

GFLOPS Total= 88GFLOPS + 2944GFLOPS =3032 GFLOPS

= 3.032 TFLOPS

2) Theoretical peak performance Calculation of GPU and

Worker Node: 100 x Dell EMC PowerEdge C6525 servers,

each equipped with an AMD EPYC 7742 Processor, has

been initiated that exhibit processing capabilities as calculated

below:

CPU Speed (2.25)x No. of CPU cores(64)x CPU IPC (2)

x No. of CPU node (2) = 576 GFLOPS X 100 (nodes) =

57600 GFLOPS + 400 x Dell EMC PowerEdge C6525 [22]

servers, each equipped with 4 AMD EPYC 7763 processors

that exhibit processing capabilities as calculated below:

CPU Speed (2.4)x No. of CPU cores(64)x CPU IPC (2)

x No. of CPU node (4) = 1228.8 GFLOPS x 400 (nodes) =

491520GFLOPS Total = 491.52 TFLOPS

As the proposed system includes 50 x NVIDIA A100 Tensor

Core GPUs and 100 x DGX A100s. The processing capability

of the NVIDIA DGX A100s is reported to be up to 39

TFLOPS [23]. We can assume that the storage and networking

components do not contribute to the TFLOPS capacity of

the system. 50 x 19.5 TFLOPS + 100 x 39TFLOPS = 975

TFLOPS + 3900 TFLOPS

Total Peak performance of existing HPC = 975 + 3900

TFLOPS = 4875 TFLOPS

The Proposed petascale High-Performance Computing

(HPC)system involves 5 infrastructure node, 500 compute

nodes, 150 GPU nodes. The total and combined ca-

pacity of the system is 4875 TFLOPS (GPU) + 491.52

TFLOPS(Compute) + 3.032 TFLOPS (Infra) = 5369.552

TFLOPS, which is equal to 5.36 PFLOPS

3) Comparison by Calculation: : Amdahl’s law states that

the fraction of a program that cannot be executed in parallel

limits its possible acceleration.Amdahl’s law’s math is:

Speedup = 1 / [(1 - p) + (p / N) ]

The equation pertains to the relationship between the pro-

portion of the program that can be parallelized denoted by

”p” and the number of processors denoted by ”N”. In the

old architecture, the compute nodes had 100 Dell EMC Pow-

erEdge C6525 servers with an AMD EPYC 7742 Processor,

while in the new architecture, the compute nodes have 100

Dell EMC PowerEdge C6525 servers with an AMD EPYC

7742 Processor and 400 Dell EMC PowerEdge C6525 servers

with 4 AMD EPYC 7763 Processors. This represents a 4x

increase in the number of processors.

Given a parallelizable proportion of 0.9, the computation

of speedup can be determined through the application of

Amdahl’s law.

Old architecture: Speedup = 1 / [(1 - 0.9) + (0.9 / 100)] =

5.26

New architecture: Speedup = 1 / [(1 - 0.9) + (0.9 / 500)] =

9.821

This shows that the new architecture has a higher potential

for speedup than the old architecture.

In contrast, Gustafson’s law considers the scalability of the

problem size with the addition of processors, resulting in a

higher proportion of the program that can be parallelized.

Gustafson’s law is presented below.:

Speedup = S + (N - S) * p

The equation provided relates the proportion of a program

that is serial (S), the number of processors (N), and the

proportion of the program that can be parallelized (p).

Given the parallelizable proportion of the program as 0.9

and the serial proportion as 0.1, the speedup can be computed

utilizing Gustafson’s law.

Old architecture: Speedup = 0.1 + (100 - 0.1) * 0.9 = 89.9

New architecture: Speedup = 0.1 + (500 - 0.1) * 0.9 =

450.01

This shows that the new architecture has a much higher

potential for speedup than the old architecture when the

problem size is scaled up.

In general, Amdahl’s law and Gustafson’s law indicate

that the novel architecture possesses the capability to achieve

substantial acceleration in contrast to the previous archi-

tecture, particularly when the magnitude of the problem is

increased. Ultimately, the recently developed architecture has

been speciﬁcally crafted to facilitate the processing of Material

Modelling AI workloads, all the while incorporating ethical

and data governance factors. The utilization of virtualization

and Tanzu Kubernetes clusters can potentially offer enhanced

scalability and redundancy. Furthermore, the incorporation of

supplementary compute nodes and GPUs may facilitate greater

parallelism, thereby supporting expedited processing of AI

workloads.

E. Problems Solved:

This study aims to address the issues that can be resolved

in contemporary Petascale architecture. The issues that have

been resolved are:

• The utilization of virtualization in facilitating artiﬁcial in-

telligence workloads has the potential to improve security

and isolation.

• The utilization of virtualization technology enables the

efﬁcient management of workloads through the dynamic

allocation of resources.

• The architecture that has been implemented recently has

been designed to enhance scalability and redundancy.

This has been achieved by integrating additional compute

nodes and GPUs, which enables the system to effec-

tively manage larger workloads. The integration of Tanzu

Kubernetes clusters that feature worker nodes furnished

with GPU resources can enable the prompt allocation of

resources to efﬁciently maintain such workloads.

• The implementation of NVIDIA operators for automating

the conﬁguration of worker nodes to support accelerated

hardware has the potential to enhance the deployment and

administration of the underlying infrastructure, thereby

optimizing the overall process.

• The proposed architecture aims to improve the processing

of AI workloads by incorporating additional processors,

cores, and nodes to enable increased parallelism. This

enhancement is expected to result in expedited processing

of AI workloads with improved speedup.

• The implementation of a novel architecture with supple-

mentary redundancy can signiﬁcantly reduce downtime

and ensure the availability of the infrastructure.

• The architecture that has been recently developed is

designed to enable the efﬁcient processing of Material

Modelling AI workloads, while also taking into account

ethical and data governance considerations.

V. MULTI-VARIABLE COST

I have included some common cost components such as

hardware, maintenance, electricity, cooling, network, software

licenses, and personnel. The total cost is calculated for a 3-

year, 5-year, and 7-year period, and the cost per core per year

is also calculated for comparison purposes.

For the old infrastructure, I assumed a hardware cost of

1.2 million euros, which includes the Dell EMC PowerEdge

R740 infrastructure server, 100 Dell EMC PowerEdge C6525

compute nodes, 50 NVIDIA A100 Tensor Core GPUs, and

Dell EMC PowerScale OneFS storage. The maintenance cost

is assumed to be 1percent of the hardware cost per year, the

electricity cost is assumed to be 200,000 euros per year, the

cooling cost is assumed to be 50,000 euros per year, the

network cost is assumed to be 150,000 euros per year, the

software licenses cost is assumed to be 300,000 euros per year,

Fig. 4. TCO Analysis

and the personnel cost is assumed to be 500,000 euros per year.

For the new infrastructure, I assumed a hardware cost of 4.5

million euros, which includes the Dell EMC PowerEdge R740

infrastructure server, 4 Dell EMC PowerEdge R940 servers,

100 Dell EMC PowerEdge C6525 compute nodes, 400 Dell

EMC PowerEdge C6525 compute nodes, 50 NVIDIA A100

Tensor Core GPUs, and 350 NVIDIA DGX A100 GPUs with

the NVIDIA AI Enterprise License. The maintenance cost is

assumed to be 2percent of the hardware cost per year, the

electricity cost is assumed to be 150,000 euros per year, the

cooling cost is assumed to be 40,000 euros per year, the

network cost is assumed to be 250,000 euros per year, the

software licenses cost is assumed to be 400,000 euros per

year, and the personnel cost is assumed to be 600,000 euros

per year. The cost per core per year is calculated by dividing

the total cost by the total number of cores and the number of

years, which provides a way to compare the cost efﬁciency of

the two infrastructures. As shown in the example, the cost per

core per year for the new infrastructure is lower than that of the

old infrastructure, which indicates that the new infrastructure

is more cost-efﬁcient in terms of processing power. One

interesting aspect of the TCO analysis is that despite the

signiﬁcant increase in the number of compute nodes and

GPUs in the new infrastructure, the total cost of ownership

is lower compared to the old infrastructure. This is due to

several factors such as the more efﬁcient hardware and the use

of open-source software components which reduces licensing

costs. Another interesting aspect is the impact of workload

on TCO. In the example provided, the workload was assumed

to be constant for both old and new infrastructures. However,

in real-world scenarios, the workload can vary signiﬁcantly,

and this can affect the TCO. For instance, if the workload

increases signiﬁcantly, the cost of the new infrastructure may

be higher due to the need for additional compute nodes or

GPUs. On the other hand, if the workload decreases, the old

infrastructure may become more cost-effective. Therefore, it

is essential to consider workload variability when evaluating

the TCO of different infrastructure options.

VI. CONCLUSIONS AND FUTURE WORK

Upon contemplation of this undertaking, I have acquired

knowledge and discerned the advantages of employing vir-

tualization technology within High-Performance Computing

(HPC) settings. Research institutions can enhance the efﬁ-

ciency and responsiveness of their ecosystem by establishing

a secure private cloud infrastructure for HPC through the

utilization of VMware products and technologies.

In the event of undertaking this project anew, I would

allocate additional time towards an in-depth exploration of

the potential of virtualization in high-performance computing

(HPC) environments, encompassing the diverse software of-

ferings within the Tanzu ecosystem. Furthermore, an inquiry

into the potential utilization of virtualization to augment the

security and governance of the High-Performance Computing

(HPC) infrastructure is warranted. Additionally, alternative

methods to amplify resilience and redundancy should be

explored.

With additional time, I would expand upon this project

by incorporating heightened security measures and isolation

protocols, including the utilization of distinct containers for

individual workloads. The investigation of virtualization’s po-

tential to reduce expenses and enhance resource allocation

in high-performance computing (HPC) settings, including the

substitution of conventional hardware for specialized HPC

hardware, would be a worthwhile pursuit. In conclusion, it

would be beneﬁcial to investigate supplementary methods for

enhancing the scalability and fault tolerance of the virtualized

High-Performance Computing (HPC) infrastructure. One such

approach could involve the automation of workﬂow restarts

on alternative physical servers situated within the cluster.

In summary, the advantages of virtualization technology

in enhancing productivity, efﬁcacy, and security within HPC

settings are acknowledged by myself as an HPC administrator.

The implementation of virtualization technology within High

Performance Computing (HPC) settings has enabled a level of

scalability and fault tolerance that was previously unachievable

through traditional HPC setups. Through the utilization of

virtualization technology, an optimized method for the ad-

ministration of high-performance computing workloads on a

virtualized infrastructure can be achieved, leading to improved

efﬁcacy and economic beneﬁts for research institutions.

REFERENCES

[1] Imran, Qayyum F, Kim DH, Bong SJ, Chi SY, Choi YH. A Survey of

Datasets, Preprocessing, Modeling Mechanisms, and Simulation Tools

Based on AI for Material Analysis and Discovery. Materials (Basel).

2022 Feb 15;15(4):1428. doi: 10.3390/ma15041428. PMID: 35207968;

PMCID: PMC8875409.

[2] Design guide-virtualizing gpus for AI with vmware and Nvidia based

on Dell Infrastructure (no date) Dell Technologies Info Hub. Avail-

able at: https://infohub.delltechnologies.com/t/design-guide-virtualizing-

gpus-for-ai-with-vmware-and-nvidia-based-on-dell-infrastructure-1.

[3] Unstructured data – Data Storage (no date) – Data Storage —

Dell Canada. Available at: https://www.dell.com/en-ca/dt/learn/data-

storage/ﬁle-storage.htm

[4] Dunn, A. et al. (2020) Benchmarking materials property predic-

tion methods: The MATBENCH test set and Automatminer refer-

ence algorithm, Nature News. Nature Publishing Group. Available at:

https://www.nature.com/articles/s41524-020-00406-3.

[5] Author links open overlay panelAshley N.

Henderson Materials discovery Available at:

https://www.sciencedirect.com/science/article/pii/S235234092100546

[6] Including crystal structure attributes in ma-

chine learning models of ..... Available at:

http://cucis.ece.northwestern.edu/publications/pdf/WLK17.pdf

[7] Petrone, G. (no date) Unleashing the full power of gpus for Ansys ﬂuent.

Available at: https://www.ansys.com/blog/unleashing-the-full-power-of-

gpus-for-ansys-ﬂuent

[8] Materials discovery (no date) Materials Discovery —

Journal — ScienceDirect.com by Elsevier. Available at:

https://www.sciencedirect.com/journal/materials-discovery

[9] CPU Benchmark (2023) Intel Xeon Silver 4210 - benchmark, test and

Specs, CPU Benchmark. CPU Benchmark. Available at: https://cpu-

benchmark.org/cpu/intel-xeon-silver-4210/

[10] AMD EPYC 7742 (no date) iconcharts. Available at: ”

[11] Nvidia A100 gpus power the modern data center (no date) NVIDIA.

Available at: https://www.nvidia.com/en-us/data-center/a100/ .

[12] Introducing 200g HDR inﬁniband solutions - nvidia (no date). Available

at: https://network.nvidia.com/ﬁles/doc-2020/wp-introducing-200g-hdr-

inﬁniband-solutions.pdf, https://network.nvidia.com/ﬁles/doc-2020/ocp-

vpi-adapter-cards-brochure.pdf

[13] Data Center Virtualization 2023 (VCP-

DCV) - vmware (no date). Available at:

https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/certiﬁcation/vmw-

VCP-DCV-certiﬁcation-preparation-guide.pdf

[14] What is vSphere with Tanzu? (no date) VMware Docs Home. Available

at: https://docs.vmware.com/en/VMware-vSphere/7.0/vmware-vsphere-

with-tanzu/GUID-70CAF0BB-1722-4526-9CE7-D5C92C15D7D0.html

[15] VMware vsan (no date). Available at:

https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/products/vsan/vmware-

vsan-datasheet.pdf

[16] Nvidia AI Enterprise Licensing and packaging guide (no date)

NVIDIA. Available at: https://resources.nvidia.com/en-us-nvaie/nvidia-

ai-enterprise-licensing-pg

[17] VMware vSphere, VMware vSphere+ Com-

pute Virtualization (no date). Available at:

https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/products/vsphere/vmware-

vsphere-pricing-whitepaper.pdf

[18] Software components: Implementation guide-virtualizing

gpus for AI with vmware and Nvidia based on Dell

Infrastrucutre (no date) Dell Technologies Info Hub. Available

at: https://infohub.delltechnologies.com/l/implementation-guide-

virtualizing-gpus-for-ai-with-vmware-and-nvidia-based-on-dell-

infrastrucutre/software-components-169

[19] Virtualizing HPC throughput computing envi-

ronments - vmware (no date). Available at:

https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/solutions/vmware-

virtualizing-hpc-throughput-computing-environments.pdf

[20] Poweredge R740 - Dell (no date). Available at:

https://i.dell.com/sites/doccontent/shared-content/data-

sheets/en/Documents/poweredge-r740-spec-sheet.pdf .

[21] Poweredge C6525 - Dell (no date). Available at:

https://i.dell.com/sites/csdocuments/ProductDocs/en/poweredge-c6525-

spec-sheet.pdf

[22] AMD EPYC 7763 specs (2023) TechPowerUp. Available at:

https://www.techpowerup.com/cpu-specs/epyc-7763.c2373

[23] https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth