Dunno. Do lower np values succeed? If so, at what value of np does
the job no longer start?
Perhaps it's having a hard time creating the shared-memory backing file
in /tmp. I think this is a 64-Mbyte file. If this is the case, try
reducing the size of the shared area per this FAQ item:
http://www.open-mpi.org/faq/?category=sm#decrease-sm Most notably,
reduce mpool_sm_min_size below 67108864.
Also note trac ticket 2043, which describes problems with the sm BTL
exposed by GCC 4.4.x compilers. You need to get a sufficiently recent
build to solve this. But, those problems don't occur until you start
passing messages, and here you're not even starting up.
Nicolas Bock wrote:
Sorry, I forgot to give more details on what versions I am
using:
OpenMPI 1.4
Ubuntu 9.10, kernel 2.6.31-16-generic #53-Ubuntu
gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1
On Fri, Jan 15, 2010 at 15:47, Nicolas Bock
<nicolasbock@gmail.com>
wrote:
Hello
list,
I am running a job on a 4 quadcore AMD Opteron. This machine has 16
cores, which I can verify by looking at /proc/cpuinfo. However, when I
run a job with
mpirun -np 16 -mca btl self,sm job
I get this error:
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[56972,2],0]) is on host: rust
Process 2 ([[56972,1],0]) is on host: rust
BTLs attempted: self sm
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
By adding the tcp btl I can run the job. I don't understand why openmpi
claims that a pair of processes can not reach each other, all processor
cores should have access to all memory after all. Do I need to set some
other btl limit?