List of Most Popular Haas Compute Services

Custom JupyterHub for Courses

https://jupyterhub.haas.berkeley.edu

Our most popular service provides custom JupyterHub environments for coursework. This service offers numerous advantages for instructors:

  • Specify custom library versions and dependencies for all students
  • Share data and notebooks through a common read-only shared folder
  • Administrative access to all student notebooks
  • Account management capabilities
  • Custom URL options (e.g., mba123.haastech.org)
  • Optional semester extension
  • Configurable resource limits (RAM and CPU) per student
  • Optional SQL database integration

High-Performance Computing (HPC) Cluster

Our HPC cluster is designed for researchers and research groups, featuring:

  • 10 compute nodes (64 or 40 cores per node)
  • Support for OpenLava and SLURM job control commands
  • Dedicated SQL database node
  • Custom shared folders for research groups
  • Access via SSH and X-Windows Desktop (Fast-X)
  • 20 simultaneous jobs per user
  • Flexible core allocation across nodes

Available Software:

Python
R
Julia
FORTRAN
Perl
PHP
Stata
SAS
JupyterLab

Our computing cluster primarily handles batch jobs, accommodating a diverse range of research needs. Some researchers demand jobs with numerous cores, while others submit thousands of single-core jobs. To serve this varied user base, we migrated our data to Google Cloud in 2020 and replicated our servers using virtual machines.

Tony then initiated a one-on-one training program to guide researchers on using the new cluster. However, this process proved time-consuming. After only two weeks of training, with just seven of the fifty users transitioned, the estimated monthly Google Cloud costs reached $17,000. Tony promptly alerted the manager about this escalating expense and was told to immediately shutdown the project.

Tony still runs other projects on Google Cloud and works with research groups to use services only available on the cloud like Big Query.

Browser-Based Jupyter Environment

https://jupyter.haastech.org

Access Python, R, Julia, and Stata through a familiar Jupyter interface:

  • Browser-based access from any device
  • User-friendly interface
  • No SSH or FastX required
  • Minimal learning curve

GPU Computing Environment

https://engineering.haas.berkeley.edu

Specialized GPU-enabled JupyterHub featuring:

  • Tesla K80 GPUs
  • 24GB GPU RAM per unit
  • 4000 GPU cores per unit
  • Optimized for research computing

Tutorial Platform

https://tutorials.haas.berkeley.edu

A collaborative platform developed by Richard Huntsinger, Thomas Lee, and Tony Cricelli offering:

  • Interactive tutorials for Jupyter notebooks
  • R and Python kernel examples
  • Preparation materials for courses
  • Persistent storage for student work

Custom Research hubs

https://jupyterhub.haas.berkeley.edu

These are custom Research Jupyterhubs used by Research groups who prefer the Jupyter platform for

  • Richard Huntsinger https://rah.haas.berkeley.edu
  • Heather Haveman https://h2.haastech.org
  • Tom Lee https://fcba.haastech.org
  • Xiao-Jun Zhang https://ba229.haastech.org

Neilsen Data Available on the HPC Cluster

Tony Cricelli has been appointed as the **Data Steward** for the Nielsen Data. In this role, Tony is responsible for:

  • **Data Transfer:** Using Globus, Tony will securely download the Nielsen Data to a designated storage location.
  • **Data Security:** Implementing robust security measures to protect the data's confidentiality and integrity.
  • **Data Accessibility:** Setting up shared folders and access controls to ensure that only authorized users can access the data. This will prevent unnecessary data duplication and streamline the workflow for researchers. Currently between the compressed and uncompressed data, Neilsen data is consuming 43TB, not including researcher data.
  • Consumer Panel Data: The Consumer Panel Data comprise a representative panel of households that continually provide information about their purchases in a longitudinal study in which panelists stay on as long as they continue to meet Nielsen’s criteria. Nielsen consumer panelists use in-home scanners to record all of their purchases (from any outlet) intended for personal, in-home use. Consumers provide information about their households and what products they buy, as well as when and where they make purchases.
  • PanelView Surveys: Complementary to the Consumer Panel Data, the Panel Views surveys contain additional data about households and their members.
  • Retail Scanner Data: Retail Scanner Data consist of weekly pricing, volume, and store environment information generated by point-of-sale systems from more than 90 participating retail chains across all US markets.

Open OnDemand (OOD)

Tony is working with Charles on a new test cluster to setup the Open OnDemand services. The plan is to setup and test a small cluster where users can login and request resources. If the resources are available locally, the requested program will be launched, if not, for example a request for 100,000 GPU cores, the user will be connected to a cloud service which has the resources and the job will be launched. We will thoroughly investigate how to budget for users to prevent run away costs on the cloud. Currently the cluster on Google Cloud, with no users is going to cost about $3000/mo. As we add storage and users, the cost is going to go up fast. We are investigating other cloud providers like Azure and AWS. In general, GCP seems to be the least expensive. Comparing like to like is difficult because of different naming convenstions and services offered.

Kubernetes Cluster Insite

Our JupyterHub cluster has successfully served over 10 million requests since Aug 2023. The cluster often has multiple courses with over a hundred students each. The exact number of courses varies semester to semester. It's likely we served well over 100 million requests since starting the service. The only significant outages were due to campus turning power off. Although they were scheduled outages they happend in the middle of a semester. It was not under our control and we did our best to notify and keep users informed. The EWMBA folks are frequent users of our tutorials. We also allow anyone with a @berkeley.edu to use our tutorials.

As a side note, we initially experimented with Google and AWS Cloud, but students found the performance to be unsatisfactory. The cloud's high costs for maintaining idle servers and the significant startup time (up to 5 minutes) negatively impacted the user experience, especially during live lectures. Cloud resources are better suited for targeted research projects, start a server, run analysis, turn off server.