4.6 Working with Data & Software

Working with Data & Software

Once connected to your head node, you’ll want to install software and get your data in place. ParallelCluster provides shared filesystems that make this straightforward.

Shared Filesystems

All nodes in your cluster share several filesystems:

Mount Point Type Description
/shared FSx for Lustre High-performance shared storage for data and software
/home NFS (EBS) Home directories, shared across all nodes
/opt/slurm NFS Slurm installation, shared across all nodes

You can verify shared mounts with:

showmount -e localhost
Tip

Install software and store data in /shared so it’s accessible from both the head node and all compute nodes.

Installing Software with Spack

Spack is a package manager for supercomputers that makes installing scientific software easy. It supports Python, R, C, C++, and Fortran packages, and can target specific compilers and architectures.

Setting Up Spack

# Install Spack to shared storage
export SPACK_ROOT=/shared/spack
export SPACK_VERSION=v0.22.2
git clone -c feature.manyFiles=true https://github.com/spack/spack -b $SPACK_VERSION $SPACK_ROOT

# Add to your shell profile
echo "export SPACK_ROOT=$SPACK_ROOT" >> $HOME/.bashrc
echo "source \$SPACK_ROOT/share/spack/setup-env.sh" >> $HOME/.bashrc
source $HOME/.bashrc

Enable the Binary Cache

The Spack Binary Cache provides pre-built packages, reducing install times dramatically:

spack mirror add spack-binary https://binaries.spack.io/${SPACK_VERSION}
spack buildcache keys --install --trust
spack compiler find

Installing Packages

# Search for a package
spack list openfoam

# Install a package
spack install openfoam

# Load a package into your environment
spack load openfoam

Moving Data In and Out

SCP / SFTP

Transfer files to the head node using standard tools:

# Upload from your local machine
scp -i ~/.ssh/your-key.pem localfile.txt ec2-user@<head-node-ip>:/shared/data/

# Download results
scp -i ~/.ssh/your-key.pem ec2-user@<head-node-ip>:/shared/results/ ./

AWS CLI (S3)

Transfer data to/from S3 buckets:

# Download from S3
aws s3 cp s3://bucket-name/dataset.tar.gz /shared/data/

# Upload results to S3
aws s3 cp /shared/results/ s3://bucket-name/results/ --recursive

Organizing Your Data

A recommended directory structure on /shared:

/shared/
├── spack/          # Spack installation and packages
├── data/           # Input datasets
├── scripts/        # Job scripts
├── results/        # Output files
└── software/       # Manually installed software

Storage Considerations

Storage Performance Persistence Cost
FSx for Lustre Very high throughput Deleted with cluster (SCRATCH) Based on capacity
EBS (head node) Standard SSD Deleted with cluster Based on size
S3 High throughput for bulk Persistent (independent of cluster) Per GB/month
Warning

FSx for Lustre SCRATCH storage is deleted when the cluster is deleted. Always back up important results to S3 before deleting your cluster.

Now that your software and data are ready, let’s submit some jobs: Submitting Jobs with Slurm