Using the Mammouth Ms2 and Mp2 super computer: steps to follow

To get a working account on Sherbrooke's Mammouth Ms2 or Mp2 super computers, you must go through the following steps:

  • Create account on Compute Canada platform by completing the online form: https://ccdb.computecanada.ca/account_application.
    When asked for the Compute Canada Role Identifier of your sponsor, enter ycy-622-03.

  • I will then be able to grant you access to the Mammouth Ms2 and/or Mp2 super computer. This step can require two or three days, after which you will receive an email confirming your access to the super computer. The email will also specify how you can connect to the super computer using the ssh linux program. The command line to use should look something like this:

    ssh USERNAME@larochel-ms.ccs.usherbrooke.ca # for Ms2
    ssh USERNAME@larochel-mp2.ccs.usherbrooke.ca # for Mp2
                
    where USERNAME is the assigned to you for your account. This will allow you to connect to an interactive node, on which you can install software for your account and debug your code. However, never run long experiments on the interactive node: you need to use jobdispatch to launch such experiments on the compute nodes of the super computer (see below).

  • The first time you connect to the interative node, you'll need to modify your ~/.bashrc file by adding the following lines in it (you can use the emacs program to edit the file):

    # User specific aliases and functions
    alias rm='rm -i'
    alias cp='cp -i'
    alias mv='mv -i'
    
    # To look at the ms queue
    export BQMAMMOUTH=ms
    #export BQMAMMOUTH=mp2 # use this line instead if on Mp2
    
    # Add Python
    #module add python64/2.6.4
    module add python64/2.7.1
    export CPATH=$CPATH:/opt/python64/2.7.1/lib/python2.7/site-packages/numpy/core/include
    
    # Add boost
    module add boost64/1.38.0
    
    # Add lapack
    module add lapack64
    module add mkl64/10.1.3.027
    
    # For job launching
    PATH=$PATH:/home/laroche1/software/Jobman/bin
    export PYTHONPATH=${PYTHONPATH}:/home/laroche1/software/Jobman
    

  • You'll also need to install MLPython in your account, like you would on your own computer. Simply follow the installation instructions, except those regarding the installation of other librairies required by MLPython (they should already be installed).

  • Say you have a Python script run_nnet.py which can be run as follows:

    python run_nnet.py 0.01 0 [20,10] 0 0 1234 True
                  
    You would now like to run the same script but with different values of its arguments. To launch such experiments (jobs) on the Mammouth Mp2's compute nodes, you can use the jobdispatch program, as follows:

    jobdispatch --bqtools --queue=qwork@mp2 python run_nnet.py '{{0.01,0.001}}' 0 ['{{20,10}}','{{10,40}}'] 0 0 1234 True
                  
    In this example, 8 jobs (2 x 2 x 2 = 8) will be launched, corresponding to all combinations of values within each set '{{A,B,C...}}'.

    To launch experiments on Ms2, simply remove the --queue=qwork@mp2 from the command line in the example above.

    In a directory named LOGS (created the first time you use jobdispatch), log files will be added for each job. They will be put in separate subdirectories (one for each time you use jobdispatch) within LOGS. Each job will have its own ID.out (standard output) and ID.err file, where ID is a number assigned to each job.

    You can then use the bqwatch command to track whether your batch of jobs is still running. On Mp2, bqwatch automatically groups the jobs in batches of up to 24 jobs, so don't be surprised if it displays a smaller number than the total number of jobs.

    Finally, if you launch jobs, realize after there is a bug in your code and wish to kill them, the command bqkill will do that for you. Just run bqkill and a prompt will ask you to specify which batch of jobs you'd like to kill among those you are currently running.

You can find more information and documentation on Mammouth here.

Be warned that your account is to be used only during the course's trimester and only as part of the assignments/project of this course. It will be deleted after the course.