[MAP] Using G4beamline on hopper.nersc.gov

Thu Jan 17 13:12:45 EST 2013

There were two related subtle but potentially devastating bugs in the MPI timing 
of G4beamline 2.14. I believe it is fixed in 2.14a, which I have installed in 
the MAP directory on hopper.nersc.gov. This is the only difference between 2.14 
and 2.14a. (There remain some minor efficiency questions.)

I intend to write a document for MAP titled "Using G4beamline on Hopper". Until 
I complete it, here are a few pointers to get you started.

* You need a login on Hopper; request one from Rob Ryne:
   mailto:rdryne at lbl.gov

* For more information on Hopper and NERSC, go to http://nersc.gov .
   In particular you can learn about the physical architecture of Hopper,
   and why jobs normally use an integer multiple of 24 cores.

* It's best to work in /project/projectdirs/map/Users/yourname -- create
   the "yourname" directory (once). It's all too easy to exceed the quota
   on your $HOME. This also fosters sharing among members of MAP.

* It's best to use "umask 002" and group "map" -- then all MAP users
   can look at your files. In particular, this permits Rob and me to
   see them for debugging. The Users directory has its SetGID bit on,
   which makes new files and directories copy its group id (map).

* To setup the latest version of G4beamline, do this in a bash shell:
     source /project/projectdirs/map/bin/g4bl_latest.sh
   I will keep that updated when new versions are installed; 2.14a now.
   Old versions will be kept as long as users are using them, but you
   will need to source the g4bl-setup.sh for the version you want.

* To use multiple cores via MPI (the whole point of using Hopper), the
   script is very similar to the g4bl script:
      g4blmpi 24 input.file name=value [...]
   Where 24 is the total number of cores to use.

* In MPI terminology, a process runs on a core, and each is identified
   uniquely by its "rank". Rank is an integer from 0 thru N-1, where N
   is the total number of processes (= #cores = #ranks). In G4beamline,
   the rank 0 process does not simulate events, but handles all I/O;
   ranks 1 thru N-1 are workers that simulate events, performing I/O
   via MPI messages to/from rank 0. This makes rank 0 be a "serial
   bottleneck", but in practice it is OK. Note that each process has
   its own address space, and the only inter-process communication is
   via MPI messages.

* The "g4bl" script works as usual on a login node (without MPI); on
   Hopper you get a benign warning about LIBDMAPP. The "g4blmpi" script
   does NOT work on a login node -- submit a batch job.

* To submit a job on Hopper, first you need a batch file. Here is a basic
   one that uses the "debug" queue:
#!/bin/bash
#PBS -A map
#PBS -q debug
#PBS -l mppwidth=24
#PBS -l walltime=00:15:00
#PBS -j eo
cd $PBS_O_WORKDIR  # cd to the directory in which "qsub" was executed
source /project/projectdirs/map/bin/g4bl_latest.sh
g4blmpi 24 input.file nEvents=2000 MPI_Debug=1 # make 24 = mppwidth

* Note the "mppwidth" value is the maximum number of cores the job can
   use, and "walltime" is the limit on the wall clock time it can use.
   The job limits in the debug queue are 30 minutes and 12,288 cores.

* Once you have the desired batch file, submit it from the directory
   containing your input.file:
     qsub batch
   You can then see its progress in the queue:
     qs #  More detailed information is available via the "qstat" command
   You can also kill it:
     qsig --JobId--
   Output files will be placed into the same directory: stdout+stderr
   (from rank 0) are put into a file "batch.eJobID" ("batch" = name of
   file to qsub), worker outputs are "rankN.out", plus any data files
   written such as g4beamline.root.

* To see what queues are available, go to http://nersc.gov , ForUsers,
   QueuesAndPolicies.

* G4beamline has a new parameter "wallClockLimit" which is the limit on
   wall clock time in seconds. It will exit gracefully when this is
   reached. You can select a very large number of events and rely on
   this to end the job. NOTE: its value should be 5-10 minutes less
   than the walltime limit in the batch job (there are overheads, and
   closing up takes time).

* Note that if G4beamline exceeds the walltime limit in the batch file,
   Hopper's operating system will kill it, and you will probably lose
   the output files. So make sure your job will exit on its own.

* If you need to add your own code to G4beamline, do this on a login
   node:
     source /project/projectdirs/map/bin/g4bl_latest.sh
     cd MyDir  # where your .cc files are located
     g4blmake
     export G4BEAMLINE=/path/to/MyDir/g4beamline
     ... now use the g4bl and g4blmpi scripts as usual.
   Once you have tested it, put the export command into your batch file.

* You probably don't know how many cores you should use. Determining
   that is more of an art than a science. Some pointers:
   - G4beamline I/O cannot handle more than about 5,000-10,000 events
     per second, putting an upper limit on the useful number of cores.
   - First run your simulation using "g4bl" on a login node. Let it
     run for a few minutes and see how many events per second it can
     simulate. ^C out of it after you have an answer.
   - Then start in the debug queue. Using your knowledge of ev/sec,
     adjust the #cores and #events (or wallClockLimit) to run for about
     5 minutes (not counting startup and closeup).
      + first use 24 cores
      + then use 48, 96, 192, ... cores
      + Plot #events/sec vs #cores as you go, and you'll see where the
        graph saturates.
      + Use the number of cores 20-25% below saturation.
      + If your initial test gives 0.1 ev/sec or fewer, you can start
        with 96 cores or so.
      + I have run tests using 10,000 cores; the xbig queue has a limit of
        146,400 cores. You probably won't need more than a few thousand.
      + As you can expect, using more cores, or using more wall clock
        time, will make your job wait in the queue longer; how long
        depends of course on the other jobs in the system.

* By default, each worker puts its stdout/stderr to file rankN.out.
   Once you know it is working OK, you can suppress this by
     param MPI_WorkerOutput=0    # or on the g4blmpi command line
   Note that exceptions and errors in worker nodes are printed (once) to
   stdout (which also contains the output from rank 0). In particular,
   the exception summary at the end includes all workers plus rank 0.
   The other MPI parameters should probably be left alone.

* MPI also works on a multi-core Mac running Leopard, Snow Leopard,
   Lion, or (probably) Mountain Lion. See me for how to set it up.
   Ditto for a multi-core Linux system. G4beamline does not support
   MPI on Windows.

* For CY2013, MAP has an allocation of 4.5 million CPU-hours. That means
   that running a handful of 5-10 minute jobs to determine the number of
   cores is in the noise. It is worthwhile to ensure that a large job
   will work properly.

Good luck! Let me know how it goes.

Tom Roberts