Hi, I'm trying to run an OpenMPI 1.6.5 job across a set of nodes, some with Mellanox cards and some with Qlogic cards.  I'm getting errors indicating "At least one pair of MPI processes are unable to reach each other for MPI communications".  As far as I can tell all of the nodes are properly configured and able to reach each other, via IP and non-IP connections.

I've also discovered that even if I turn off the IB transport via "--mca btl tcp,self" I'm still getting the same issue.

The test works fine if I run it confined to hosts with identical IB cards.

I'd appreciate some assistance in figuring out what I'm doing wrong.

 

Thanks,

Kevin

 

Here's a log of a failed run:

> mpirun -d --debug-daemons --mca btl tcp,self --mca orte_base_help_aggregate 0 --mca btl_base_verbose 100 -np 2 -machinefile foo.hosts /homes/kevin/alltoall.mpi-1.6.5

[compute-g18-5.deepthought.umd.edu:20574] procdir: /tmp/openmpi-sessions-kevin@compute-g18-5.deepthought.umd.edu_0/63142/0/0

[compute-g18-5.deepthought.umd.edu:20574] jobdir: /tmp/openmpi-sessions-kevin@compute-g18-5.deepthought.umd.edu_0/63142/0

[compute-g18-5.deepthought.umd.edu:20574] top: openmpi-sessions-kevin@compute-g18-5.deepthought.umd.edu_0

[compute-g18-5.deepthought.umd.edu:20574] tmp: /tmp

[compute-g18-5.deepthought.umd.edu:20574] mpirun: reset PATH: /cell_root/software/openmpi/1.6.5/gnu/4.8.1/threaded/sys/bin:/cell_root/software/openmpi/1.6.5/gnu/4.8.1/threaded/sys/bin:/cell_r      ftware/gcc/4.8.1/sys/bin:/cell_root/software/moab/bin:/cell_root/software/gold/bin:/usr/local/ofed/1.5.4/sbin:/usr/local/ofed/1.5.4/bin:/homes/kevin/bin:/homes/kevin/bin/amd64:/dept/oit/glue/      scripts:/usr/local/scripts:/usr/local/bin:/usr/bin:/bin:/sbin:/usr/sbin:/usr/afsws/bin:/usr/afsws/etc

[compute-g18-5.deepthought.umd.edu:20574] mpirun: reset LD_LIBRARY_PATH: /cell_root/software/openmpi/1.6.5/gnu/4.8.1/threaded/sys/lib:/usr/local/ofed/1.5.4/lib64

Daemon was launched on compute-g17-33.deepthought.umd.edu - beginning to initialize

[compute-g17-33.deepthought.umd.edu:20174] procdir: /tmp/openmpi-sessions-kevin@compute-g17-33.deepthought.umd.edu_0/63142/0/1

[compute-g17-33.deepthought.umd.edu:20174] jobdir: /tmp/openmpi-sessions-kevin@compute-g17-33.deepthought.umd.edu_0/63142/0

[compute-g17-33.deepthought.umd.edu:20174] top: openmpi-sessions-kevin@compute-g17-33.deepthought.umd.edu_0

[compute-g17-33.deepthought.umd.edu:20174] tmp: /tmp

Daemon [[63142,0],1] checking in as pid 20174 on host compute-g17-33.deepthought.umd.edu

[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted: up and running - waiting for commands!

[compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received add_local_procs

[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] node[0].name compute-g18-5 daemon 0

[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] node[1].name compute-g17-33 daemon 1

[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received add_local_procs

  MPIR_being_debugged = 0

  MPIR_debug_state = 1

  MPIR_partial_attach_ok = 1

  MPIR_i_am_starter = 0

  MPIR_forward_output = 0

  MPIR_proctable_size = 2

  MPIR_proctable:

    (i, host, exe, pid) = (0, compute-g18-5.deepthought.umd.edu, /homes/kevin/alltoall.mpi-1.6.5, 20576)

    (i, host, exe, pid) = (1, compute-g17-33, /homes/kevin/alltoall.mpi-1.6.5, 20175)

MPIR_executable_path: NULL

MPIR_server_arguments: NULL

[compute-g18-5.deepthought.umd.edu:20576] procdir: /tmp/openmpi-sessions-kevin@compute-g18-5.deepthought.umd.edu_0/63142/1/0

[compute-g18-5.deepthought.umd.edu:20576] jobdir: /tmp/openmpi-sessions-kevin@compute-g18-5.deepthought.umd.edu_0/63142/1

[compute-g18-5.deepthought.umd.edu:20576] top: openmpi-sessions-kevin@compute-g18-5.deepthought.umd.edu_0

[compute-g18-5.deepthought.umd.edu:20576] tmp: /tmp

[compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_recv: received sync+nidmap from local proc [[63142,1],0]

[compute-g18-5.deepthought.umd.edu:20576] [[63142,1],0] node[0].name compute-g18-5 daemon 0

[compute-g18-5.deepthought.umd.edu:20576] [[63142,1],0] node[1].name compute-g17-33 daemon 1

[compute-g17-33.deepthought.umd.edu:20175] procdir: /tmp/openmpi-sessions-kevin@compute-g17-33.deepthought.umd.edu_0/63142/1/1

[compute-g17-33.deepthought.umd.edu:20175] jobdir: /tmp/openmpi-sessions-kevin@compute-g17-33.deepthought.umd.edu_0/63142/1

[compute-g17-33.deepthought.umd.edu:20175] top: openmpi-sessions-kevin@compute-g17-33.deepthought.umd.edu_0

[compute-g17-33.deepthought.umd.edu:20175] tmp: /tmp

[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_recv: received sync+nidmap from local proc [[63142,1],1]

[compute-g17-33.deepthought.umd.edu:20175] [[63142,1],1] node[0].name compute-g18-5 daemon 0

[compute-g17-33.deepthought.umd.edu:20175] [[63142,1],1] node[1].name compute-g17-33 daemon 1

[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: Looking for btl components

[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: opening btl components

[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: found loaded component self

[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component self has no register function

[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component self open function successful

[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: found loaded component tcp

[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component tcp register function successful

[compute-g18-5.deepthought.umd.e:20576] mca: base: components_open: component tcp open function successful

[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: Looking for btl components

[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: opening btl components

[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: found loaded component self

[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component self has no register function

[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component self open function successful

[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: found loaded component tcp

[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component tcp register function successful

[compute-g17-33.deepthought.umd.:20175] mca: base: components_open: component tcp open function successful

[compute-g17-33.deepthought.umd.:20175] select: initializing btl component self

[compute-g17-33.deepthought.umd.:20175] select: init of component self returned success

[compute-g17-33.deepthought.umd.:20175] select: initializing btl component tcp

[compute-g17-33.deepthought.umd.:20175] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8

[compute-g17-33.deepthought.umd.:20175] btl: tcp: Found match: 127.0.0.1 (lo)

[compute-g17-33.deepthought.umd.:20175] select: init of component tcp returned success

[compute-g18-5.deepthought.umd.e:20576] mca: base: close: component self closed

[compute-g18-5.deepthought.umd.e:20576] mca: base: close: unloading component self

[compute-g18-5.deepthought.umd.e:20576] mca: base: close: component tcp closed

[compute-g18-5.deepthought.umd.e:20576] mca: base: close: unloading component tcp

[compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received message_local_procs

[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received message_local_procs

--------------------------------------------------------------------------

It looks like MPI_INIT failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during MPI_INIT; some of which are due to configuration or environment

problems.  This failure appears to be an internal failure; here's some

additional information (which may only be relevant to an Open MPI

developer):

 

  PML add procs failed

  --> Returned "Error" (-1) instead of "Success" (0)

--------------------------------------------------------------------------

*** An error occurred in MPI_Init

*** on a NULL communicator

*** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly.  You should

double check that everything has shut down cleanly.

 

  Reason:     Before MPI_INIT completed

  Local host: compute-g18-5.deepthought.umd.edu

  PID:        20576

--------------------------------------------------------------------------

--------------------------------------------------------------------------

At least one pair of MPI processes are unable to reach each other for

MPI communications.  This means that no Open MPI device has indicated

that it can be used to communicate between these processes.  This is

an error; Open MPI requires that all MPI processes be able to reach

each other.  This error can sometimes be the result of forgetting to

specify the "self" BTL.

 

  Process 1 ([[63142,1],1]) is on host: compute-g17-33.deepthought.umd.edu

  Process 2 ([[63142,1],0]) is on host: compute-g18-5

  BTLs attempted: self tcp

 

Your MPI job is now going to abort; sorry.

--------------------------------------------------------------------------

*** An error occurred in MPI_Init

*** on a NULL communicator

*** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

--------------------------------------------------------------------------

MPI_INIT has failed because at least one MPI process is unreachable

from another.  This *usually* means that an underlying communication

plugin -- such as a BTL or an MTL -- has either not loaded or not

allowed itself to be used.  Your MPI job will now abort.

 

You may wish to try to narrow down the problem;

 

* Check the output of ompi_info to see which BTL/MTL plugins are

   available.

* Run your application with MPI_THREAD_SINGLE.

* Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,

   if using MTL-based communications) to see exactly which

   communication plugins were considered and/or discarded.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

An MPI process is aborting at a time when it cannot guarantee that all

of its peer processes in the job will be killed properly.  You should

double check that everything has shut down cleanly.

 

  Reason:     Before MPI_INIT completed

  Local host: compute-g17-33.deepthought.umd.edu

  PID:        20175

--------------------------------------------------------------------------

[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received waitpid_fired cmd

[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received iof_complete cmd

[compute-g17-33.deepthought.umd.edu:20174] sess_dir_finalize: proc session dir not empty - leaving

[compute-g18-5.deepthought.umd.edu:20574] sess_dir_finalize: proc session dir not empty - leaving

[compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received iof_complete cmd

--------------------------------------------------------------------------

mpirun has exited due to process rank 1 with PID 20175 on

node compute-g17-33 exiting improperly. There are two reasons this could occur:

 

1. this process did not call "init" before exiting, but others in

the job did. This can cause a job to hang indefinitely while it waits

for all processes to call "init". By rule, if one process calls "init",

then ALL processes must call "init" prior to termination.

 

2. this process called "init", but exited without calling "finalize".

By rule, all processes that call "init" MUST call "finalize" prior to

exiting or it will be considered an "abnormal termination"

 

This may have caused other processes in the application to be

terminated by signals sent by mpirun (as reported here).

--------------------------------------------------------------------------

[compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] orted_cmd: received exit cmd

[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted_cmd: received exit cmd

[compute-g17-33.deepthought.umd.edu:20174] [[63142,0],1] orted: finalizing

[compute-g18-5.deepthought.umd.edu:20574] sess_dir_finalize: job session dir not empty - leaving

[compute-g17-33.deepthought.umd.edu:20174] sess_dir_finalize: job session dir not empty - leaving

[compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] Releasing job data for [63142,0]

[compute-g18-5.deepthought.umd.edu:20574] [[63142,0],0] Releasing job data for [63142,1]

[compute-g18-5.deepthought.umd.edu:20574] sess_dir_finalize: proc session dir not empty - leaving

orterun: exiting with status 1