Here is my situation:
2 Dell R900's with 16 cpus each and
64 GB RAM
OS: SuSE SLES 10 SP2 patched up to date
R version 2.9.1
Rmpi version 0.5-7
snow version 0.3-3
maanova library version 1.14.0
openmpi version 1.3.3
slurm version 2.0.3
With a given set of R code, we get abnormal
exits when using 14 or less cpus. When using 15 or more, the job completes
normally.
error is a variation on:
[pdp-dev-r01:22618] [[15549,1],0] routed:binomial:
Connection to lifeline [[15549,0],0] lost
during the array permutations.
Increasing the number of permutations
above 200 also produces similar results.
The R code is executed with a typical
command line for 14 cpus being:
sbatch -n 14 -i ./Rtest.txt --mail-type=ALL
--mail-user=steven_dale@hc-sc.gc.ca /usr/local/bin/R --no-save
Config.log, ompi_info, Rscript.txt and
slurm outputs are attached. Network is GB Ethernet copper tcp/ip.
I think this to be an openmpi error/bug
due to the routed:binomial message. This also had the same results with
openmpi-1.3.2, R 2.9.0, maanova 1.12 and slurm 2.0.1.
No non-default MCA parameters are set.
LD_LIBRARY_PATH=/usr/local/lib.
Configuration done with defaults.
Any ideas are welcome.
____________________
Steve Dale