Three tiers of computational resources
- Your own computer
- Flexibility - you can install/run whatever you want, whenever you want
- Ease of use/familiarity (no need to learn how to to use linux/bash - although you may want to anyway)
- Resource limitations - most people don’t have hundreds of CPU cores on their home machine
- Wilke lab cluster
- Not quite as many resources as TACC, but definitely still respectable
- No job submission system means you can run jobs instantaneously
- No CPU hour limits
- Shared resource - you can accidentally screw up other people’s work if you’re not being careful!
- Texas Advanced Computing Center (TACC)
- Crazy large amount of resources (CPU, RAM, storage)
- Sooner or later if you do or plan to do any kind of high performance computing, you need to gain familiarity with job submission systems and large-scale clusters
- Navigating a large cluster has a bit of a learning curve
- All jobs must be submitted to a job scheduler, which means your jobs may not run immediately
- The upside to having a scheduler is that resource use is tightly controlled, so you’re less likely to take down the whole cluster or interfere with other user’s jobs
- Jobs also have time limits - 48 hrs max, without special permission
- Finite number of CPU hours allocated per year
Ask yourself this before deciding to scale up to any computing cluster?
Do I need to run this code / store this information?
- Access to a lot of computing resources is no excuse for a sloppy algorithm or for unnecessary work
- Parkinson’s law is the adage that “work expands so as to fill the time available for its completion”. It is sometimes applied to the growth of bureaucracy in an organization
Wilke lab PODS
- Two compute servers (wilkcomp01 & wilkcomp02)
- Quad 14-core/28-thread CPUs (112 logical cores) per server
- 1 TB RAM per server
- Linux environment, ssh access
- Storage server (wilkstor01.ccbb.utexas.edu)
- 24 6-TB disks
- 144 TB raw, 80 TB usable
- No ssh access, but files can be transferred directly via scp/rsync
Set up a VPN (required to use the POD off campus):
Consult the pod wiki that is run by BRCF here for information about the pods.
Make an RCTF account here.
Once you have an account set up, issue the following command in the terminal of your computer, but replace
my_username with your UT EID.
Upon logging in you will find yourself located at
- This is your “home” folder
- Only you (and those with sudo access) can access files stored here
- Storage is limited to 100GB
- A short way of saying, do not do your work here
Navigate to the work directory
Make yourself a folder, this is where you should do the vast majority of your code writing, data storage, etc.
There are really only three main filesystems you need to worry about.
In general, most compute clusters (TACC included) are organized like this
- Only you can write/read files here
- Meant for small files - simple scripts, cron jobs, configuration files, etc
- 100 GB limit Also, some programs may write data here automatically, so be sure to check on your filesystem * usage periodically
Note: To check disk usage of any file/directory, run the following command
du -sh /path/to/your/file
Access these services via a browser at https://wilkcomp01.ccbb.utexas.edu/ or https://wilkcomp02.ccbb.utexas.edu/ (do not need VPN to access). RStudio Server provides a GUI with file system access, a text editor, and a terminal, all without VPN access.
To get a TACC account, start here
More information for specific TACC clusters can be found here
Run the following command to get info on TACC nodes/available resources
Use this to get info on submitted jobs including what node(s) a job is running on.
squeue -u your_username
Use this to access any node that you own (i.e. that a job of yours is running on, find via squeue) and monitor the job. Use htop to ensure that you are using the resources you expect and maximizing either the CPU or RAM of the node.
This can be helpful to test out code by requesting an interactive session on a compute node.
ALWAYS TEST YOUR JOB BEFORE REQUESTING MASSIVE RESOURCES!
A good rule of thumb is to do the following:
- Test a subset of data on your local machine or the lab cluster to troubleshoot your code.
- Test a subset of data on TACC to work out TACC related issues, e.g. resource usage, running in parallel, weird TACC bugs.
- Test a larger subset of data on TACC to make sure your scripts, analysis, and output are all correct. Monitor things carefully, try to capture edge cases.
- Run your full pipeline on all data but monitor output carefully.
You should also aim to max out either CPU or RAM on all nodes used as TACC allocations are based on node usage not individual CPU/RAM usage.