Berkeley Haas Wordmark

Haas Research Computing FAQ

Cluster Overview

The Haas OpenLava cluster provides high‑performance batch computing for research workloads.

  • Master / login node: hpc.haastech.org (haas-hpc00) — submit jobs here; do not run heavy workloads directly on this node
  • Compute nodes: haas-hpc01 through haas-hpc10
  • GPU node: haas-gpu01 — Nvidia A100 (2× 40GB partitions)
  • Special server: haas-hpc11 (SQL / JupyterHub support)
  • Four queues: normal for standard jobs  |  hi-mem for large-memory workloads  |  gpu40a & gpu40b for GPU computing

Node Specifications

Use this table when deciding how many cores to request or which queue to target.

NodeCPU CoresAvailable RAMGPUMax Cores / JobQueue
haas-hpc0164755 GiB40hi-mem
haas-hpc0264755 GiB40hi-mem
haas-hpc0364661 GiB40hi-mem
haas-hpc0464251 GiB40normal
haas-hpc0540251 GiB30normal
haas-hpc0640251 GiB30normal
haas-hpc0740251 GiB30normal
haas-hpc0840251 GiB30normal
haas-hpc0940251 GiB30normal
haas-hpc1040251 GiB30normal
haas-gpu011000 GiBNvidia A100 (2× 40GB)gpu40a / gpu40b
💡 Max Cores / Job is the per-job slot limit set in OpenLava (MXJ in lsb.hosts). Requesting more cores than this limit will cause your job to wait indefinitely in the queue.

Queue Guide — Which Queue Should I Use?

🟢 normal (default)

  • Nodes: hpc04 – hpc10
  • RAM per node: up to 251 GiB
  • Max cores per job: 30–40
  • Max jobs per user: 80
  • Max jobs in queue: 400
  • Best for: most research workloads, array jobs, standard parallel jobs

🔵 hi-mem

  • Nodes: hpc01, hpc02, hpc03
  • RAM per node: 661 – 755 GiB
  • Max cores per job: 40
  • Max jobs per user: 30
  • Best for: large in-memory datasets, genome assembly, ML model training requiring > 251 GiB RAM

🟣 gpu40a

  • Node: haas-gpu01
  • GPU: Nvidia A100 40GB (partition A)
  • System RAM: 1 TB
  • Max jobs in queue: 1
  • Best for: deep learning, large language models, GPU-accelerated scientific computing

🟣 gpu40b

  • Node: haas-gpu01
  • GPU: Nvidia A100 40GB (partition B)
  • System RAM: 1 TB
  • Max jobs in queue: 1
  • Best for: deep learning, large language models, GPU-accelerated scientific computing
💡 Not sure? Start with the normal queue. Switch to hi-mem only if your job runs out of memory or you know you need more than ~200 GiB RAM. Use gpu40a or gpu40b for GPU-accelerated workloads.
⚠️ Job limits: The normal queue enforces a limit of 80 concurrent jobs per user and 400 jobs total in the queue. GPU queues allow 1 job at a time per partition. Plan large array jobs accordingly.

Targeting a Specific Queue


# Submit to the normal queue (this is the default)
bsub -q normal ./my_job.sh

# Submit to the hi-mem queue
bsub -q hi-mem ./my_job.sh

# Submit to GPU queue A
bsub -q gpu40a ./my_gpu_job.sh

# Submit to GPU queue B
bsub -q gpu40b ./my_gpu_job.sh

Check Current Queue Status


bqueues

GPU Computing on haas-gpu01

The cluster's newest addition, haas-gpu01, features an Nvidia A100 GPU with 40GB of memory, partitioned into two independent compute resources accessible through queues gpu40a and gpu40b. The node also provides 1 TB of system RAM, making it suitable for both GPU-intensive and memory-intensive workloads.

⚠️ Ubuntu 26.04 Compatibility Note: The GPU server runs Ubuntu 26.04, which is too new for direct CUDA software installation. GPU jobs must use Docker containers and are executed through special wrapper scripts that configure GPU visibility.

When to Use the GPU Node

  • Deep learning: Training neural networks with PyTorch, TensorFlow, or JAX
  • Large language models: Fine-tuning or inference with transformer models
  • Scientific computing: GPU-accelerated simulations, molecular dynamics, image processing
  • Data preprocessing: GPU-accelerated ETL pipelines with RAPIDS (cuDF, cuML)

Quick Start: Test GPU Access

First, verify that you can see the GPU:


# Interactive test - check if GPU is visible
bsub -Is -q gpu40a /opt/openlava-4.0/scripts/gpu40a.sh \
  docker run --rm --gpus all \
  nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04 \
  nvidia-smi
💡 You should see output showing one Nvidia A100 40GB GPU. If you see "no devices were found", contact Haas IT.

Understanding the GPU Workflow

GPU jobs require three components:

  1. Wrapper script: /opt/openlava-4.0/scripts/gpu40a.sh or gpu40b.sh — sets the correct GPU partition
  2. Docker command: docker run --rm --gpus all — launches the container with GPU access
  3. NGC container: Pre-built images with CUDA, PyTorch, TensorFlow, etc.

Example: PyTorch GPU Training Script

Create a test script to verify GPU training works correctly:


cat > ~/train_model.py << 'EOF'
#!/usr/bin/env python3
"""Simple PyTorch GPU test - trains a small network on random data"""

import torch
import torch.nn as nn
import torch.optim as optim
import time

print("=" * 60)
print("PyTorch GPU Test Script")
print("=" * 60)

# Check GPU availability
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU count: {torch.cuda.device_count()}")
    print(f"GPU name: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("WARNING: CUDA not available!")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"\nUsing device: {device}\n")

# Simple neural network
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(1000, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 10)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        return self.fc3(x)

# Create model
model = SimpleNet().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print("\nTraining for 5 epochs with 100 batches each...\n")

# Training loop
start_time = time.time()
for epoch in range(5):
    epoch_loss = 0.0
    for batch in range(100):
        # Generate random data
        inputs = torch.randn(128, 1000).to(device)
        targets = torch.randint(0, 10, (128,)).to(device)
        
        # Train
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
    
    print(f"Epoch {epoch+1}/5 - Loss: {epoch_loss/100:.4f}")

total_time = time.time() - start_time
print(f"\nTraining completed in {total_time:.2f}s ({total_time/5:.2f}s per epoch)")

if torch.cuda.is_available():
    print(f"GPU memory used: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")

print("\n✓ Test completed successfully!")
EOF

chmod +x ~/train_model.py

Running the GPU Training Script

Interactive Mode (good for testing)


bsub -Is -q gpu40a /opt/openlava-4.0/scripts/gpu40a.sh \
  docker run --rm --gpus all \
  -v $HOME:/workspace \
  nvcr.io/nvidia/pytorch:24.01-py3 \
  python /workspace/train_model.py

Batch Mode (for production runs)


bsub -q gpu40b -o gpu_output_%J.log -e gpu_error_%J.log \
  /opt/openlava-4.0/scripts/gpu40b.sh \
  docker run --rm --gpus all \
  -v $HOME:/workspace \
  nvcr.io/nvidia/pytorch:24.01-py3 \
  python /workspace/train_model.py
💡 Key Docker flags explained:
  • --rm — automatically clean up container when it exits
  • --gpus all — give container access to GPU (required)
  • -v $HOME:/workspace — mount your home directory at /workspace inside container

GPU Wrapper Script Details

Each GPU queue has a corresponding wrapper script that configures GPU access:


# /opt/openlava-4.0/scripts/gpu40a.sh
export CUDA_VISIBLE_DEVICES=MIG-06853c02-ad40-5910-8b6b-ffa9b7c69c5d
exec "$@"

# /opt/openlava-4.0/scripts/gpu40b.sh
export CUDA_VISIBLE_DEVICES=MIG-0b771745-e87a-5aae-854d-b8dadd631b54
exec "$@"

Recommended GPU Docker Images

Use official NVIDIA NGC containers for best compatibility:

FrameworkContainer Image
PyTorchnvcr.io/nvidia/pytorch:24.01-py3
TensorFlownvcr.io/nvidia/tensorflow:24.01-tf2-py3
CUDA basenvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
RAPIDSnvcr.io/nvidia/rapidsai/rapidsai:24.02-cuda12.0-runtime-ubuntu22.04-py3.10
💡 Browse all available containers at NGC Catalog.

GPU Job Script Template

For longer jobs, create a script instead of typing the full command:


#!/bin/bash
#BSUB -J gpu_training
#BSUB -q gpu40a                          # or gpu40b
#BSUB -W 24:00                           # Wall-clock limit
#BSUB -o output_%J.log
#BSUB -e error_%J.log

# Run your GPU workload inside a container
/opt/openlava-4.0/scripts/gpu40a.sh \
  docker run --rm --gpus all \
  -v $HOME:/workspace \
  -w /workspace \
  nvcr.io/nvidia/pytorch:24.01-py3 \
  python train_model.py --epochs 100 --batch-size 64

Submit it with:


bsub < my_gpu_job.sh

Checking GPU Queue Status


# View GPU queue status
bqueues gpu40a gpu40b

# Check jobs in gpu40a queue
bjobs -q gpu40a

# Check jobs in gpu40b queue
bjobs -q gpu40b

# Check all your jobs (all queues)
bjobs
⚠️ Important: You cannot use multiple -q flags in a single bjobs command. Run separate commands for each queue or use bjobs without -q to see all your jobs.

Common Docker Volume Mounts

Mount different directories depending on where your data lives:


# Mount your home directory
-v $HOME:/workspace

# Mount a specific project directory
-v /data/projects/myproject:/project

# Mount multiple directories
-v $HOME:/home \
-v /data/shared:/data

# Set working directory inside container
-w /workspace

Installing Additional Python Packages

If you need packages not included in the NGC container:


# Option 1: Install at runtime (packages lost when container exits)
/opt/openlava-4.0/scripts/gpu40a.sh \
  docker run --rm --gpus all -v $HOME:/workspace \
  nvcr.io/nvidia/pytorch:24.01-py3 \
  bash -c "pip install transformers accelerate && python train.py"

# Option 2: Create a requirements file and install it
echo "transformers
accelerate
datasets" > requirements.txt

/opt/openlava-4.0/scripts/gpu40a.sh \
  docker run --rm --gpus all -v $HOME:/workspace \
  nvcr.io/nvidia/pytorch:24.01-py3 \
  bash -c "pip install -r /workspace/requirements.txt && python /workspace/train.py"

Monitoring GPU Usage

If you have an active job, you can SSH to the GPU node and monitor usage:


# Only do this if you have a running GPU job
ssh haas-gpu01

# Check GPU utilization
nvidia-smi

# Continuous monitoring (updates every 2 seconds)
watch -n 2 nvidia-smi
💡 Each GPU queue (gpu40a and gpu40b) allows only 1 job at a time. Plan accordingly — if one partition is busy, submit to the other.
⚠️ Do not SSH directly to haas-gpu01 unless you have an active job running there. Always submit work through bsub -q gpu40a or bsub -q gpu40b using the appropriate wrapper script.

Troubleshooting

Error: "could not select device driver"

Make sure you're using --gpus all in your docker command.

Error: "No such file or directory" for your Python script

Check that you've mounted the directory containing your script with -v and that the path inside the container is correct (e.g., /workspace/train_model.py).

Error: "CUDA out of memory"

Reduce your batch size or model size. The A100 has 40GB of memory per partition.

Job stays in PEND state

The other GPU partition may be occupied. Try the alternate queue (gpu40a vs gpu40b) or check bjobs -u all to see who's using the GPUs.

How do I SSH into the cluster?

From macOS or Linux:


ssh yourusername@hpc.haastech.org

Windows users should use MobaXterm or PuTTY.

🔒 Off campus? Connect to the UC Berkeley VPN at vpn.berkeley.edu before logging in.

FastX Remote Desktop

FastX provides a full graphical desktop session on the cluster, accessible from your browser or a native desktop client. It is ideal for running GUI applications such as RStudio, MATLAB, or any graphical tool without needing a local installation.

Step 1 — Download the FastX Client

🍎
macOS macOS 10.13+ Download
🪟
Windows Windows 10/11 Download
🐧
Linux RPM & DEB packages Download
💡 Browser option: You can also connect in a modern web browser without installing any client — navigate to the FastX web URL provided by Haas IT and log in with your cluster credentials.

Step 2 — Install the Client

  • macOS: Open the .dmg and drag FastX to your Applications folder.
  • Windows: Run the .exe installer and follow the prompts.
  • Linux (DEB): sudo dpkg -i fastx-client_*.deb
  • Linux (RPM): sudo rpm -i fastx-client_*.rpm

Step 3 — Connect to the Cluster

  • Host: hpc.haastech.org
  • Port: 3300 (default FastX port)
  • Username: your Haas cluster username
  • Authentication: Password or SSH key (same credentials as SSH access)
🔒 Off-campus users: You must be connected to the UC Berkeley VPN before launching a FastX session.

Step 4 — Start a Desktop Session

Click + to create a new session. Choose XFCE or KDE for best performance. Your session persists if you close the client window — you can resume it later without losing your work.

Resuming or Ending a Session

Existing sessions appear in the FastX session list when you connect — click one to resume. To fully terminate a session and free resources, right-click it and choose Terminate. Simply closing the window suspends without terminating.

Submitting Jobs with OpenLava

Basic Job Submission


# Submit a script to the default (normal) queue
bsub ./my_job.sh

Request Multiple Cores on a Single Node


bsub -n 16 -R "span[hosts=1]" ./my_job.sh

Submit to the hi-mem Queue


bsub -q hi-mem -n 32 -R "span[hosts=1]" ./my_job.sh

Submit to a GPU Queue (with Docker)


# GPU partition A
bsub -q gpu40a /opt/openlava-4.0/scripts/gpu40a.sh \
  docker run --rm --gpus all \
  -v $HOME:/workspace \
  nvcr.io/nvidia/pytorch:24.01-py3 \
  python /workspace/train.py

# GPU partition B  
bsub -q gpu40b /opt/openlava-4.0/scripts/gpu40b.sh \
  docker run --rm --gpus all \
  -v $HOME:/workspace \
  nvcr.io/nvidia/pytorch:24.01-py3 \
  python /workspace/train.py

Set a Runtime Limit (recommended)


# Request 4 cores, limit to 24 hours
bsub -n 4 -W 24:00 ./my_job.sh

Check Queues, Nodes, and Running Jobs


bqueues        # queue summary and load
bhosts         # compute node status
bjobs          # your running and pending jobs
bjobs -u all   # all users' jobs

# Check specific queues (run separately)
bjobs -q gpu40a
bjobs -q gpu40b

Job Script Template

Copy this template as a starting point. Lines beginning with #BSUB are directives read by OpenLava — they are not ordinary comments.


#!/bin/bash
#BSUB -J my_analysis          # Job name
#BSUB -q normal                # Queue: normal, hi-mem, gpu40a, or gpu40b
#BSUB -n 8                     # Number of CPU cores
#BSUB -R "span[hosts=1]"       # Keep all cores on one node
#BSUB -W 12:00                 # Wall-clock limit (HH:MM)
#BSUB -o output_%J.log         # Stdout  (%J = job ID)
#BSUB -e error_%J.log          # Stderr

# --- Load your environment ---
source ~/.bashrc
conda activate myenv            # or module load, etc.

# --- Your commands ---
python run_analysis.py --input data.csv --output results/
💡 Always set a -W wall-clock limit. Jobs without a limit can hold node resources indefinitely if something goes wrong.

Hi-mem Variant


#!/bin/bash
#BSUB -J big_model
#BSUB -q hi-mem
#BSUB -n 32
#BSUB -R "span[hosts=1]"
#BSUB -W 48:00
#BSUB -o output_%J.log
#BSUB -e error_%J.log

source ~/.bashrc
conda activate myenv
python train_model.py

GPU Variant (Docker-based)


#!/bin/bash
#BSUB -J gpu_job
#BSUB -q gpu40a                # or gpu40b
#BSUB -W 24:00
#BSUB -o output_%J.log
#BSUB -e error_%J.log

# Use the wrapper script to run your container
/opt/openlava-4.0/scripts/gpu40a.sh \
  docker run --rm --gpus all \
  -v $HOME:/workspace \
  -w /workspace \
  nvcr.io/nvidia/pytorch:24.01-py3 \
  python train_model.py --epochs 100

bjobs / bkill Cheat Sheet

Viewing Jobs


# Your jobs (running + pending)
bjobs

# All users' jobs
bjobs -u all

# Detailed info for a specific job
bjobs -l 12345

# Show only running jobs
bjobs -r

# Show only pending jobs
bjobs -p

# Show finished / exited jobs
bjobs -d
bjobs -x

# Jobs in a specific queue
bjobs -q gpu40a
bjobs -q gpu40b
bjobs -q hi-mem

Canceling Jobs


# Kill a specific job by ID
bkill 12345

# Kill all your pending and running jobs
bkill 0

# Kill all jobs with a specific name
bkill -J my_analysis

Checking Node Load


# Node status summary
bhosts

# Detailed info for one node
bhosts -l haas-hpc01

# Live load (refreshes every few seconds)
lsload
⚠️ Important: You cannot use multiple -q options in a single bjobs command. To check jobs in multiple queues, run separate commands:

bjobs -q gpu40a
bjobs -q gpu40b

Or use bjobs without -q to see all your jobs across all queues.

Understanding bjobs Status Codes

StatusMeaning
RUNJob is actively running on a compute node
PENDJob is queued, waiting for available slots
DONEJob completed successfully (exit code 0)
EXITJob exited with an error — check your error_%J.log
SSUSPJob suspended by the system (e.g., load threshold exceeded)
USUSPJob suspended by the user via bstop

Advanced OpenLava Examples

Job Array (parameter sweep)


# Submit 100 jobs; $LSB_JOBINDEX (1–100) is available inside the script
bsub -J "sweep[1-100]" ./run_case.sh
💡 Job arrays share a single job ID with an index suffix, e.g. 9876[42]. Use bjobs 9876 to see all array elements at once.

Interactive Session


bsub -Is -q interactive bash

Job Dependency (run B after A succeeds)


# Submit job A and capture its ID
JOB_A=$(bsub ./job_a.sh | grep -oP '(?<=Job <)\d+')

# Submit job B to run only after A finishes successfully
bsub -w "done($JOB_A)" ./job_b.sh

Linux Bash Essentials


ls -lh          # list files with sizes
du -sh *        # disk usage per item
top             # live process monitor
df -h           # filesystem disk usage
htop            # interactive process viewer (if installed)

Making Conda / Mamba Work Properly


conda init bash
source ~/.bashrc
mamba create -n research python=3.11
conda activate research
💡 Always activate your conda environment inside your job script (after source ~/.bashrc) so compute nodes use the correct Python and packages.

MariaDB Database Server (haas-hpc11)

haas-hpc11 hosts a shared MariaDB relational database server available to all Haas cluster users. It is the right tool when your research involves structured data that benefits from SQL queries, joins across large tables, or sharing a dataset with collaborators on the cluster without copying files.

Available Datasets

DatasetDescriptionAccess
OpenAlex A fully open catalog of the global research system — over 250 million scholarly works, authors, institutions, journals, and citation relationships. Useful for bibliometrics, citation network analysis, and science-of-science research. Request access from Haas IT
Nielsen Scanner & Panel Data Retail point-of-sale and consumer panel data. See the Nielsen FAQ below for details. Restricted — see Nielsen FAQ

Connecting to the Database

Connect from any cluster node using the standard MariaDB client:


mysql -h haas-hpc11 -u your_username -p

Or specify a database directly:


mysql -h haas-hpc11 -u your_username -p openalex

Connecting from Python


import pymysql
import pandas as pd

conn = pymysql.connect(
    host="haas-hpc11",
    user="your_username",
    password="your_password",
    database="openalex"
)

df = pd.read_sql("SELECT * FROM works LIMIT 100", conn)
conn.close()

Connecting from R


library(DBI)
library(RMariaDB)

con <- dbConnect(MariaDB(),
  host     = "haas-hpc11",
  user     = "your_username",
  password = "your_password",
  dbname   = "openalex"
)

df <- dbGetQuery(con, "SELECT * FROM works LIMIT 100")
dbDisconnect(con)

Requesting Access

Database accounts are not created automatically. To request access, contact Haas IT and include:

  • Your cluster username
  • Which dataset(s) you need access to
  • A brief description of your research use
💡 OpenAlex is free to use for all Haas cluster users — just request an account. No additional approvals or licensing are required.
⚠️ The database server is a shared resource. Avoid running very large unoptimized queries during peak hours. Use LIMIT while developing queries, and add WHERE clauses and indexes where possible before running full-table scans.

Nielsen Scanner & Panel Data

The Haas cluster hosts a licensed copy of two Nielsen datasets widely used in marketing, economics, and consumer behavior research. Access is restricted to authorized researchers due to licensing requirements.

What is Nielsen Scanner Data?

Nielsen Scanner Data (also called Retail Scanner Data or RMS — Retail Measurement Services) captures weekly point-of-sale transaction records from a large national sample of retail stores, including grocery, drug, and mass-merchandise outlets. Each record contains:

  • UPC-level product information (brand, size, category)
  • Weekly unit sales and revenue by store
  • Price and promotional flag (feature ad, display, temporary price reduction)
  • Store identifiers with market and channel type

Scanner data is well suited for studying pricing strategy, promotional effectiveness, market competition, and demand elasticity at the product and category level.

What is Nielsen Panel Data?

Nielsen Panel Data (also called HMS — Homescan Consumer Panel) tracks the purchasing behavior of a nationally representative panel of households over time. Panelists scan all their retail purchases at home using a handheld scanner. Each record includes:

  • Household demographics (income, household size, age, education — anonymized)
  • Every UPC purchased, the store visited, price paid, and any coupon use
  • Purchase date and quantity

Panel data is especially useful for studying household brand loyalty, switching behavior, coupon responsiveness, and the effect of marketing on individual consumers over time.

How Scanner and Panel Data Complement Each Other

Scanner data tells you what sold and at what price across stores. Panel data tells you who bought it and where they shopped. Together they are a powerful combination for linking supply-side pricing decisions to demand-side consumer responses.

⚠️ Access is restricted. Nielsen data is licensed and may only be used for approved academic research. You must have a signed data use agreement on file before accessing these tables. Do not copy, export, or share raw data outside the cluster environment.

Requesting Access

To request access to the Nielsen datasets, contact Haas IT with the following:

  • Your name, cluster username, and faculty sponsor (if applicable)
  • A brief description of your research project and intended use of the data
  • Confirmation that you have read and agree to the Nielsen data use terms

Haas IT will verify your eligibility and provision database access to the relevant Nielsen schemas on haas-hpc11.

💡 If you are unsure whether Nielsen data is appropriate for your project, reach out to Haas IT or your faculty advisor before requesting access.

Windows Cluster Access

Use Remote Desktop (RDP) to connect to the Windows research environment. Ensure VPN is active when off campus.

AEoD Virtual Desktop

AEoD provides virtual desktops through Citrix Workspace. Request access through Haas IT.