Here is my situation:
2 Dell R900's with 16 cpus each and 64 GB RAM
OS: SuSE SLES 10 SP2 patched up to date
R version 2.9.1
Rmpi version 0.5-7
snow version 0.3-3
maanova library version 1.14.0
openmpi version 1.3.3
slurm version 2.0.3
With a given set of R code, we get abnormal exits when using 14 or less
cpus. When using 15 or more, the job completes normally.
error is a variation on:
[pdp-dev-r01:22618] [[15549,1],0] routed:binomial: Connection to lifeline
[[15549,0],0] lost
during the array permutations.
Increasing the number of permutations above 200 also produces similar
results.
The R code is executed with a typical command line for 14 cpus being:
sbatch -n 14 -i ./Rtest.txt --mail-type=ALL
--mail-user=steven_dale_at_[hidden] /usr/local/bin/R --no-save
Config.log, ompi_info, Rscript.txt and slurm outputs are attached. Network
is GB Ethernet copper tcp/ip.
I think this to be an openmpi error/bug due to the routed:binomial
message. This also had the same results with openmpi-1.3.2, R 2.9.0,
maanova 1.12 and slurm 2.0.1.
No non-default MCA parameters are set.
LD_LIBRARY_PATH=/usr/local/lib.
Configuration done with defaults.
Any ideas are welcome.
____________________
Steve Dale
|