Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Possible openmpi bug?
From: Steven Dale (steven_dale_at_[hidden])
Date: 2009-07-16 13:15:33


Here is my situation:

2 Dell R900's with 16 cpus each and 64 GB RAM
OS: SuSE SLES 10 SP2 patched up to date
R version 2.9.1
Rmpi version 0.5-7
snow version 0.3-3
maanova library version 1.14.0
openmpi version 1.3.3
slurm version 2.0.3

With a given set of R code, we get abnormal exits when using 14 or less
cpus. When using 15 or more, the job completes normally.
error is a variation on:

[pdp-dev-r01:22618] [[15549,1],0] routed:binomial: Connection to lifeline
[[15549,0],0] lost

during the array permutations.

Increasing the number of permutations above 200 also produces similar
results.

The R code is executed with a typical command line for 14 cpus being:

sbatch -n 14 -i ./Rtest.txt --mail-type=ALL
--mail-user=steven_dale_at_[hidden] /usr/local/bin/R --no-save

Config.log, ompi_info, Rscript.txt and slurm outputs are attached. Network
is GB Ethernet copper tcp/ip.

I think this to be an openmpi error/bug due to the routed:binomial
message. This also had the same results with openmpi-1.3.2, R 2.9.0,
maanova 1.12 and slurm 2.0.1.

No non-default MCA parameters are set.

LD_LIBRARY_PATH=/usr/local/lib.

Configuration done with defaults.

Any ideas are welcome.

____________________
Steve Dale