Cluster Overview
The Haas OpenLava cluster provides high‑performance batch computing for research workloads.
- Master / login node: hpc.haastech.org (haas-hpc00) — submit jobs here; do not run heavy workloads directly on this node
- Compute nodes: haas-hpc01 through haas-hpc10
- GPU node: haas-gpu01 — Nvidia A100 (2× 40GB partitions)
- Special server: haas-hpc11 (SQL / JupyterHub support)
- Four queues: normal for standard jobs | hi-mem for large-memory workloads | gpu40a & gpu40b for GPU computing
Node Specifications
Use this table when deciding how many cores to request or which queue to target.
| Node | CPU Cores | Available RAM | GPU | Max Cores / Job | Queue |
|---|---|---|---|---|---|
| haas-hpc01 | 64 | 755 GiB | — | 40 | hi-mem |
| haas-hpc02 | 64 | 755 GiB | — | 40 | hi-mem |
| haas-hpc03 | 64 | 661 GiB | — | 40 | hi-mem |
| haas-hpc04 | 64 | 251 GiB | — | 40 | normal |
| haas-hpc05 | 40 | 251 GiB | — | 30 | normal |
| haas-hpc06 | 40 | 251 GiB | — | 30 | normal |
| haas-hpc07 | 40 | 251 GiB | — | 30 | normal |
| haas-hpc08 | 40 | 251 GiB | — | 30 | normal |
| haas-hpc09 | 40 | 251 GiB | — | 30 | normal |
| haas-hpc10 | 40 | 251 GiB | — | 30 | normal |
| haas-gpu01 | — | 1000 GiB | Nvidia A100 (2× 40GB) | — | gpu40a / gpu40b |
MXJ in lsb.hosts). Requesting more cores than this limit will cause your job to wait indefinitely in the queue.
Queue Guide — Which Queue Should I Use?
🟢 normal (default)
- Nodes: hpc04 – hpc10
- RAM per node: up to 251 GiB
- Max cores per job: 30–40
- Max jobs per user: 80
- Max jobs in queue: 400
- Best for: most research workloads, array jobs, standard parallel jobs
🔵 hi-mem
- Nodes: hpc01, hpc02, hpc03
- RAM per node: 661 – 755 GiB
- Max cores per job: 40
- Max jobs per user: 30
- Best for: large in-memory datasets, genome assembly, ML model training requiring > 251 GiB RAM
🟣 gpu40a
- Node: haas-gpu01
- GPU: Nvidia A100 40GB (partition A)
- System RAM: 1 TB
- Max jobs in queue: 1
- Best for: deep learning, large language models, GPU-accelerated scientific computing
🟣 gpu40b
- Node: haas-gpu01
- GPU: Nvidia A100 40GB (partition B)
- System RAM: 1 TB
- Max jobs in queue: 1
- Best for: deep learning, large language models, GPU-accelerated scientific computing
Targeting a Specific Queue
# Submit to the normal queue (this is the default) bsub -q normal ./my_job.sh # Submit to the hi-mem queue bsub -q hi-mem ./my_job.sh # Submit to GPU queue A bsub -q gpu40a ./my_gpu_job.sh # Submit to GPU queue B bsub -q gpu40b ./my_gpu_job.sh
Check Current Queue Status
bqueues
GPU Computing on haas-gpu01
The cluster's newest addition, haas-gpu01, features an Nvidia A100 GPU with 40GB of memory, partitioned into two independent compute resources accessible through queues gpu40a and gpu40b. The node also provides 1 TB of system RAM, making it suitable for both GPU-intensive and memory-intensive workloads.
When to Use the GPU Node
- Deep learning: Training neural networks with PyTorch, TensorFlow, or JAX
- Large language models: Fine-tuning or inference with transformer models
- Scientific computing: GPU-accelerated simulations, molecular dynamics, image processing
- Data preprocessing: GPU-accelerated ETL pipelines with RAPIDS (cuDF, cuML)
Quick Start: Test GPU Access
First, verify that you can see the GPU:
# Interactive test - check if GPU is visible bsub -Is -q gpu40a /opt/openlava-4.0/scripts/gpu40a.sh \ docker run --rm --gpus all \ nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 \ nvidia-smi
Understanding the GPU Workflow
GPU jobs require three components:
- Wrapper script:
/opt/openlava-4.0/scripts/gpu40a.shorgpu40b.sh— sets the correct GPU partition - Docker command:
docker run --rm --gpus all— launches the container with GPU access - NGC container: Pre-built images with CUDA, PyTorch, TensorFlow, etc.
Example: PyTorch GPU Training Script
Create a test script to verify GPU training works correctly:
cat > ~/train_model.py << 'EOF'
#!/usr/bin/env python3
"""Simple PyTorch GPU test - trains a small network on random data"""
import torch
import torch.nn as nn
import torch.optim as optim
import time
print("=" * 60)
print("PyTorch GPU Test Script")
print("=" * 60)
# Check GPU availability
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
print("WARNING: CUDA not available!")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"\nUsing device: {device}\n")
# Simple neural network
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(1000, 512)
self.fc2 = nn.Linear(512, 256)
self.fc3 = nn.Linear(256, 10)
self.relu = nn.ReLU()
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.relu(self.fc2(x))
return self.fc3(x)
# Create model
model = SimpleNet().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print("\nTraining for 5 epochs with 100 batches each...\n")
# Training loop
start_time = time.time()
for epoch in range(5):
epoch_loss = 0.0
for batch in range(100):
# Generate random data
inputs = torch.randn(128, 1000).to(device)
targets = torch.randint(0, 10, (128,)).to(device)
# Train
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
print(f"Epoch {epoch+1}/5 - Loss: {epoch_loss/100:.4f}")
total_time = time.time() - start_time
print(f"\nTraining completed in {total_time:.2f}s ({total_time/5:.2f}s per epoch)")
if torch.cuda.is_available():
print(f"GPU memory used: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
print("\n✓ Test completed successfully!")
EOF
chmod +x ~/train_model.py
Running the GPU Training Script
Interactive Mode (good for testing)
bsub -Is -q gpu40a /opt/openlava-4.0/scripts/gpu40a.sh \ docker run --rm --gpus all \ -v $HOME:/workspace \ nvcr.io/nvidia/pytorch:24.01-py3 \ python /workspace/train_model.py
Batch Mode (for production runs)
bsub -q gpu40b -o gpu_output_%J.log -e gpu_error_%J.log \ /opt/openlava-4.0/scripts/gpu40b.sh \ docker run --rm --gpus all \ -v $HOME:/workspace \ nvcr.io/nvidia/pytorch:24.01-py3 \ python /workspace/train_model.py
--rm— automatically clean up container when it exits--gpus all— give container access to GPU (required)-v $HOME:/workspace— mount your home directory at/workspaceinside container
GPU Wrapper Script Details
Each GPU queue has a corresponding wrapper script that configures GPU access:
# /opt/openlava-4.0/scripts/gpu40a.sh export CUDA_VISIBLE_DEVICES=MIG-06853c02-ad40-5910-8b6b-ffa9b7c69c5d exec "$@" # /opt/openlava-4.0/scripts/gpu40b.sh export CUDA_VISIBLE_DEVICES=MIG-0b771745-e87a-5aae-854d-b8dadd631b54 exec "$@"
Recommended GPU Docker Images
Use official NVIDIA NGC containers for best compatibility:
| Framework | Container Image |
|---|---|
| PyTorch | nvcr.io/nvidia/pytorch:24.01-py3 |
| TensorFlow | nvcr.io/nvidia/tensorflow:24.01-tf2-py3 |
| CUDA base | nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 |
| RAPIDS | nvcr.io/nvidia/rapidsai/rapidsai:24.02-cuda12.0-runtime-ubuntu22.04-py3.10 |
GPU Job Script Template
For longer jobs, create a script instead of typing the full command:
#!/bin/bash #BSUB -J gpu_training #BSUB -q gpu40a # or gpu40b #BSUB -W 24:00 # Wall-clock limit #BSUB -o output_%J.log #BSUB -e error_%J.log # Run your GPU workload inside a container /opt/openlava-4.0/scripts/gpu40a.sh \ docker run --rm --gpus all \ -v $HOME:/workspace \ -w /workspace \ nvcr.io/nvidia/pytorch:24.01-py3 \ python train_model.py --epochs 100 --batch-size 64
Submit it with:
bsub < my_gpu_job.sh
Checking GPU Queue Status
# View GPU queue status bqueues gpu40a gpu40b # Check jobs in gpu40a queue bjobs -q gpu40a # Check jobs in gpu40b queue bjobs -q gpu40b # Check all your jobs (all queues) bjobs
-q flags in a single bjobs command. Run separate commands for each queue or use bjobs without -q to see all your jobs.Common Docker Volume Mounts
Mount different directories depending on where your data lives:
# Mount your home directory -v $HOME:/workspace # Mount a specific project directory -v /data/projects/myproject:/project # Mount multiple directories -v $HOME:/home \ -v /data/shared:/data # Set working directory inside container -w /workspace
Installing Additional Python Packages
If you need packages not included in the NGC container:
# Option 1: Install at runtime (packages lost when container exits) /opt/openlava-4.0/scripts/gpu40a.sh \ docker run --rm --gpus all -v $HOME:/workspace \ nvcr.io/nvidia/pytorch:24.01-py3 \ bash -c "pip install transformers accelerate && python train.py" # Option 2: Create a requirements file and install it echo "transformers accelerate datasets" > requirements.txt /opt/openlava-4.0/scripts/gpu40a.sh \ docker run --rm --gpus all -v $HOME:/workspace \ nvcr.io/nvidia/pytorch:24.01-py3 \ bash -c "pip install -r /workspace/requirements.txt && python /workspace/train.py"
Monitoring GPU Usage
If you have an active job, you can SSH to the GPU node and monitor usage:
# Only do this if you have a running GPU job ssh haas-gpu01 # Check GPU utilization nvidia-smi # Continuous monitoring (updates every 2 seconds) watch -n 2 nvidia-smi
bsub -q gpu40a or bsub -q gpu40b using the appropriate wrapper script.Troubleshooting
Error: "could not select device driver"
Make sure you're using --gpus all in your docker command.
Error: "No such file or directory" for your Python script
Check that you've mounted the directory containing your script with -v and that the path inside the container is correct (e.g., /workspace/train_model.py).
Error: "CUDA out of memory"
Reduce your batch size or model size. The A100 has 40GB of memory per partition.
Job stays in PEND state
The other GPU partition may be occupied. Try the alternate queue (gpu40a vs gpu40b) or check bjobs -u all to see who's using the GPUs.
How do I SSH into the cluster?
From macOS or Linux:
ssh yourusername@hpc.haastech.org
Windows users should use MobaXterm or PuTTY.
FastX Remote Desktop
FastX provides a full graphical desktop session on the cluster, accessible from your browser or a native desktop client. It is ideal for running GUI applications such as RStudio, MATLAB, or any graphical tool without needing a local installation.
Step 1 — Download the FastX Client
Step 2 — Install the Client
- macOS: Open the
.dmgand drag FastX to your Applications folder. - Windows: Run the
.exeinstaller and follow the prompts. - Linux (DEB):
sudo dpkg -i fastx-client_*.deb - Linux (RPM):
sudo rpm -i fastx-client_*.rpm
Step 3 — Connect to the Cluster
- Host:
hpc.haastech.org - Port:
3300(default FastX port) - Username: your Haas cluster username
- Authentication: Password or SSH key (same credentials as SSH access)
Step 4 — Start a Desktop Session
Click + to create a new session. Choose XFCE or KDE for best performance. Your session persists if you close the client window — you can resume it later without losing your work.
Resuming or Ending a Session
Existing sessions appear in the FastX session list when you connect — click one to resume. To fully terminate a session and free resources, right-click it and choose Terminate. Simply closing the window suspends without terminating.
Submitting Jobs with OpenLava
Basic Job Submission
# Submit a script to the default (normal) queue bsub ./my_job.sh
Request Multiple Cores on a Single Node
bsub -n 16 -R "span[hosts=1]" ./my_job.sh
Submit to the hi-mem Queue
bsub -q hi-mem -n 32 -R "span[hosts=1]" ./my_job.sh
Submit to a GPU Queue (with Docker)
# GPU partition A bsub -q gpu40a /opt/openlava-4.0/scripts/gpu40a.sh \ docker run --rm --gpus all \ -v $HOME:/workspace \ nvcr.io/nvidia/pytorch:24.01-py3 \ python /workspace/train.py # GPU partition B bsub -q gpu40b /opt/openlava-4.0/scripts/gpu40b.sh \ docker run --rm --gpus all \ -v $HOME:/workspace \ nvcr.io/nvidia/pytorch:24.01-py3 \ python /workspace/train.py
Set a Runtime Limit (recommended)
# Request 4 cores, limit to 24 hours bsub -n 4 -W 24:00 ./my_job.sh
Check Queues, Nodes, and Running Jobs
bqueues # queue summary and load bhosts # compute node status bjobs # your running and pending jobs bjobs -u all # all users' jobs # Check specific queues (run separately) bjobs -q gpu40a bjobs -q gpu40b
Job Script Template
Copy this template as a starting point. Lines beginning with #BSUB are directives read by OpenLava — they are not ordinary comments.
#!/bin/bash #BSUB -J my_analysis # Job name #BSUB -q normal # Queue: normal, hi-mem, gpu40a, or gpu40b #BSUB -n 8 # Number of CPU cores #BSUB -R "span[hosts=1]" # Keep all cores on one node #BSUB -W 12:00 # Wall-clock limit (HH:MM) #BSUB -o output_%J.log # Stdout (%J = job ID) #BSUB -e error_%J.log # Stderr # --- Load your environment --- source ~/.bashrc conda activate myenv # or module load, etc. # --- Your commands --- python run_analysis.py --input data.csv --output results/
-W wall-clock limit. Jobs without a limit can hold node resources indefinitely if something goes wrong.
Hi-mem Variant
#!/bin/bash #BSUB -J big_model #BSUB -q hi-mem #BSUB -n 32 #BSUB -R "span[hosts=1]" #BSUB -W 48:00 #BSUB -o output_%J.log #BSUB -e error_%J.log source ~/.bashrc conda activate myenv python train_model.py
GPU Variant (Docker-based)
#!/bin/bash #BSUB -J gpu_job #BSUB -q gpu40a # or gpu40b #BSUB -W 24:00 #BSUB -o output_%J.log #BSUB -e error_%J.log # Use the wrapper script to run your container /opt/openlava-4.0/scripts/gpu40a.sh \ docker run --rm --gpus all \ -v $HOME:/workspace \ -w /workspace \ nvcr.io/nvidia/pytorch:24.01-py3 \ python train_model.py --epochs 100
bjobs / bkill Cheat Sheet
Viewing Jobs
# Your jobs (running + pending) bjobs # All users' jobs bjobs -u all # Detailed info for a specific job bjobs -l 12345 # Show only running jobs bjobs -r # Show only pending jobs bjobs -p # Show finished / exited jobs bjobs -d bjobs -x # Jobs in a specific queue bjobs -q gpu40a bjobs -q gpu40b bjobs -q hi-mem
Canceling Jobs
# Kill a specific job by ID bkill 12345 # Kill all your pending and running jobs bkill 0 # Kill all jobs with a specific name bkill -J my_analysis
Checking Node Load
# Node status summary bhosts # Detailed info for one node bhosts -l haas-hpc01 # Live load (refreshes every few seconds) lsload
-q options in a single bjobs command. To check jobs in multiple queues, run separate commands:bjobs -q gpu40abjobs -q gpu40bOr use
bjobs without -q to see all your jobs across all queues.Understanding bjobs Status Codes
| Status | Meaning |
|---|---|
RUN | Job is actively running on a compute node |
PEND | Job is queued, waiting for available slots |
DONE | Job completed successfully (exit code 0) |
EXIT | Job exited with an error — check your error_%J.log |
SSUSP | Job suspended by the system (e.g., load threshold exceeded) |
USUSP | Job suspended by the user via bstop |
Advanced OpenLava Examples
Job Array (parameter sweep)
# Submit 100 jobs; $LSB_JOBINDEX (1–100) is available inside the script bsub -J "sweep[1-100]" ./run_case.sh
9876[42]. Use bjobs 9876 to see all array elements at once.Interactive Session
bsub -Is -q interactive bash
Job Dependency (run B after A succeeds)
# Submit job A and capture its ID JOB_A=$(bsub ./job_a.sh | grep -oP '(?<=Job <)\d+') # Submit job B to run only after A finishes successfully bsub -w "done($JOB_A)" ./job_b.sh
Linux Bash Essentials
ls -lh # list files with sizes du -sh * # disk usage per item top # live process monitor df -h # filesystem disk usage htop # interactive process viewer (if installed)
Making Conda / Mamba Work Properly
conda init bash source ~/.bashrc mamba create -n research python=3.11 conda activate research
source ~/.bashrc) so compute nodes use the correct Python and packages.MariaDB Database Server (haas-hpc11)
haas-hpc11 hosts a shared MariaDB relational database server available to all Haas cluster users. It is the right tool when your research involves structured data that benefits from SQL queries, joins across large tables, or sharing a dataset with collaborators on the cluster without copying files.
Available Datasets
| Dataset | Description | Access |
|---|---|---|
| OpenAlex | A fully open catalog of the global research system — over 250 million scholarly works, authors, institutions, journals, and citation relationships. Useful for bibliometrics, citation network analysis, and science-of-science research. | Request access from Haas IT |
| Nielsen Scanner & Panel Data | Retail point-of-sale and consumer panel data. See the Nielsen FAQ below for details. | Restricted — see Nielsen FAQ |
Connecting to the Database
Connect from any cluster node using the standard MariaDB client:
mysql -h haas-hpc11 -u your_username -p
Or specify a database directly:
mysql -h haas-hpc11 -u your_username -p openalex
Connecting from Python
import pymysql
import pandas as pd
conn = pymysql.connect(
host="haas-hpc11",
user="your_username",
password="your_password",
database="openalex"
)
df = pd.read_sql("SELECT * FROM works LIMIT 100", conn)
conn.close()
Connecting from R
library(DBI) library(RMariaDB) con <- dbConnect(MariaDB(), host = "haas-hpc11", user = "your_username", password = "your_password", dbname = "openalex" ) df <- dbGetQuery(con, "SELECT * FROM works LIMIT 100") dbDisconnect(con)
Requesting Access
Database accounts are not created automatically. To request access, contact Haas IT and include:
- Your cluster username
- Which dataset(s) you need access to
- A brief description of your research use
LIMIT while developing queries, and add WHERE clauses and indexes where possible before running full-table scans.Nielsen Scanner & Panel Data
The Haas cluster hosts a licensed copy of two Nielsen datasets widely used in marketing, economics, and consumer behavior research. Access is restricted to authorized researchers due to licensing requirements.
What is Nielsen Scanner Data?
Nielsen Scanner Data (also called Retail Scanner Data or RMS — Retail Measurement Services) captures weekly point-of-sale transaction records from a large national sample of retail stores, including grocery, drug, and mass-merchandise outlets. Each record contains:
- UPC-level product information (brand, size, category)
- Weekly unit sales and revenue by store
- Price and promotional flag (feature ad, display, temporary price reduction)
- Store identifiers with market and channel type
Scanner data is well suited for studying pricing strategy, promotional effectiveness, market competition, and demand elasticity at the product and category level.
What is Nielsen Panel Data?
Nielsen Panel Data (also called HMS — Homescan Consumer Panel) tracks the purchasing behavior of a nationally representative panel of households over time. Panelists scan all their retail purchases at home using a handheld scanner. Each record includes:
- Household demographics (income, household size, age, education — anonymized)
- Every UPC purchased, the store visited, price paid, and any coupon use
- Purchase date and quantity
Panel data is especially useful for studying household brand loyalty, switching behavior, coupon responsiveness, and the effect of marketing on individual consumers over time.
How Scanner and Panel Data Complement Each Other
Scanner data tells you what sold and at what price across stores. Panel data tells you who bought it and where they shopped. Together they are a powerful combination for linking supply-side pricing decisions to demand-side consumer responses.
Requesting Access
To request access to the Nielsen datasets, contact Haas IT with the following:
- Your name, cluster username, and faculty sponsor (if applicable)
- A brief description of your research project and intended use of the data
- Confirmation that you have read and agree to the Nielsen data use terms
Haas IT will verify your eligibility and provision database access to the relevant Nielsen schemas on haas-hpc11.
Windows Cluster Access
Use Remote Desktop (RDP) to connect to the Windows research environment. Ensure VPN is active when off campus.
AEoD Virtual Desktop
AEoD provides virtual desktops through Citrix Workspace. Request access through Haas IT.