Start - Core benchmarks on Aimos 4-nodes, 6 GPUs/node
Middle - Extended model training with checkpointing and "Turing test" validations against standard benchmarks.
Conclusion - Initial paper submitted to NeurIPS and plans for extended experiments explored. Codes and models will be made fully available as open source in a professional form with documentation and available to any research group.
{Empty}
A student with interest in ML and some programming experience would be best.
{Empty}
Can work with any level
{Empty}
MIT
Cambridge, Massachusetts. 02139
NE-MGHPCC
04/29/2021
Yes
Already behind3Start date is flexible
One month
{Empty}
{Empty}
{Empty}
{Empty}
{Empty}
Tools, will be openly published via GitHub.
Two at least
1. NeurIPS
2. TBD
various ML tools
pytorch, NCCL, megatron, deeper speed
performance profiling
profiling of GPU codes on Aimos and on MGHPCC Satori systems
{Empty}
This will be a collaboration involving some of the most energy efficient systems with models that are traditionally seen as very resource hungry.
4 Aimos nodes, 6 GPUs per node.
There are two student facilitators (Alex Andonian and David Bau) and one mentor (John Cohn) not currently in the Cyberteams system.