Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ERROR: At least one pair of MPI processes are unable to reach each other for MPI communications.
From: RoboBeans (robobeans_at_[hidden])
Date: 2013-08-03 16:30:30


Thanks for looking into in Ralph. I modified the hosts file but I am
still getting the same error. Any other pointers you can think of? The
difference between this 1.7.2 installation and 1.5.4 is that I installed
1.5.4 using yum and for 1.7.2, I used the source code and configured
with *--enable-event-thread-support --enable-opal-multi-threads
--enable-orte-progress-threads --enable-mpi-thread-multiple**
*. Am I missing something here?

//******************************************************************

*$ cat mpi_hostfile*

x.x.x.22 slots=15 max-slots=15
x.x.x.24 slots=2 max-slots=2
x.x.x.26 slots=14 max-slots=14
x.x.x.28 slots=16 max-slots=16
x.x.x.29 slots=14 max-slots=14
x.x.x.30 slots=16 max-slots=16
x.x.x.41 slots=46 max-slots=46
x.x.x.101 slots=46 max-slots=46
x.x.x.100 slots=46 max-slots=46
x.x.x.102 slots=22 max-slots=22
x.x.x.103 slots=22 max-slots=22

//******************************************************************
*$ mpirun -d --display-map -np 10 --hostfile mpi_hostfile --bynode ./test**
*
[SERVER-2:08907] procdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/0/0
[SERVER-2:08907] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/0
[SERVER-2:08907] top: openmpi-sessions-mpidemo_at_SERVER-2_0
[SERVER-2:08907] tmp: /tmp
CentOS release 6.4 (Final)
Kernel \r on an \m
CentOS release 6.4 (Final)
Kernel \r on an \m
CentOS release 6.4 (Final)
Kernel \r on an \m
[SERVER-3:32517] procdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/0/1
[SERVER-3:32517] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/0
[SERVER-3:32517] top: openmpi-sessions-mpidemo_at_SERVER-3_0
[SERVER-3:32517] tmp: /tmp
CentOS release 6.4 (Final)
Kernel \r on an \m
CentOS release 6.4 (Final)
Kernel \r on an \m
[SERVER-6:11595] procdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/0/4
[SERVER-6:11595] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/0
[SERVER-6:11595] top: openmpi-sessions-mpidemo_at_SERVER-6_0
[SERVER-6:11595] tmp: /tmp
[SERVER-4:27445] procdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/0/2
[SERVER-4:27445] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/0
[SERVER-4:27445] top: openmpi-sessions-mpidemo_at_SERVER-4_0
[SERVER-4:27445] tmp: /tmp
[SERVER-7:02607] procdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/0/5
[SERVER-7:02607] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/0
[SERVER-7:02607] top: openmpi-sessions-mpidemo_at_SERVER-7_0
[SERVER-7:02607] tmp: /tmp
[sv-1:46100] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/0/8
[sv-1:46100] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/0
[sv-1:46100] top: openmpi-sessions-mpidemo_at_sv-1_0
[sv-1:46100] tmp: /tmp
CentOS release 6.4 (Final)
Kernel \r on an \m
[SERVER-5:16404] procdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/0/3
[SERVER-5:16404] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/0
[SERVER-5:16404] top: openmpi-sessions-mpidemo_at_SERVER-5_0
[SERVER-5:16404] tmp: /tmp
[sv-3:08575] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/0/9
[sv-3:08575] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/0
[sv-3:08575] top: openmpi-sessions-mpidemo_at_sv-3_0
[sv-3:08575] tmp: /tmp
[SERVER-14:10755] procdir:
/tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/0/6
[SERVER-14:10755] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/0
[SERVER-14:10755] top: openmpi-sessions-mpidemo_at_SERVER-14_0
[SERVER-14:10755] tmp: /tmp
[sv-4:12040] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-4_0/62216/0/10
[sv-4:12040] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-4_0/62216/0
[sv-4:12040] top: openmpi-sessions-mpidemo_at_sv-4_0
[sv-4:12040] tmp: /tmp
[sv-2:07725] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/0/7
[sv-2:07725] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/0
[sv-2:07725] top: openmpi-sessions-mpidemo_at_sv-2_0
[sv-2:07725] tmp: /tmp

  Mapper requested: NULL Last mapper: round_robin Mapping policy:
BYNODE Ranking policy: NODE Binding policy: NONE[NODE] Cpu set: NULL
PPR: NULL
      Num new daemons: 0 New daemon starting vpid INVALID
      Num nodes: 10

  Data for node: SERVER-2 Launch id: -1 State: 2
      Daemon: [[62216,0],0] Daemon launched: True
      Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
      Num slots allocated: 15 Max slots: 15
      Username on node: NULL
      Num procs: 1 Next node_rank: 1
      Data for proc: [[62216,1],0]
          Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
          State: INITIALIZED Restarts: 0 App_context: 0 Locale:
0-15 Binding: NULL[0]

  Data for node: x.x.x.24 Launch id: -1 State: 0
      Daemon: [[62216,0],1] Daemon launched: False
      Num slots: 2 Slots in use: 1 Oversubscribed: FALSE
      Num slots allocated: 2 Max slots: 2
      Username on node: NULL
      Num procs: 1 Next node_rank: 1
      Data for proc: [[62216,1],1]
          Pid: 0 Local rank: 0 Node rank: 0 App rank: 1
          State: INITIALIZED Restarts: 0 App_context: 0 Locale:
0-7 Binding: NULL[0]

  Data for node: x.x.x.26 Launch id: -1 State: 0
      Daemon: [[62216,0],2] Daemon launched: False
      Num slots: 14 Slots in use: 1 Oversubscribed: FALSE
      Num slots allocated: 14 Max slots: 14
      Username on node: NULL
      Num procs: 1 Next node_rank: 1
      Data for proc: [[62216,1],2]
          Pid: 0 Local rank: 0 Node rank: 0 App rank: 2
          State: INITIALIZED Restarts: 0 App_context: 0 Locale:
0-7 Binding: NULL[0]

  Data for node: x.x.x.28 Launch id: -1 State: 0
      Daemon: [[62216,0],3] Daemon launched: False
      Num slots: 16 Slots in use: 1 Oversubscribed: FALSE
      Num slots allocated: 16 Max slots: 16
      Username on node: NULL
      Num procs: 1 Next node_rank: 1
      Data for proc: [[62216,1],3]
          Pid: 0 Local rank: 0 Node rank: 0 App rank: 3
          State: INITIALIZED Restarts: 0 App_context: 0 Locale:
0-7 Binding: NULL[0]

  Data for node: x.x.x.29 Launch id: -1 State: 0
      Daemon: [[62216,0],4] Daemon launched: False
      Num slots: 14 Slots in use: 1 Oversubscribed: FALSE
      Num slots allocated: 14 Max slots: 14
      Username on node: NULL
      Num procs: 1 Next node_rank: 1
      Data for proc: [[62216,1],4]
          Pid: 0 Local rank: 0 Node rank: 0 App rank: 4
          State: INITIALIZED Restarts: 0 App_context: 0 Locale:
0-7 Binding: NULL[0]

  Data for node: x.x.x.30 Launch id: -1 State: 0
      Daemon: [[62216,0],5] Daemon launched: False
      Num slots: 16 Slots in use: 1 Oversubscribed: FALSE
      Num slots allocated: 16 Max slots: 16
      Username on node: NULL
      Num procs: 1 Next node_rank: 1
      Data for proc: [[62216,1],5]
          Pid: 0 Local rank: 0 Node rank: 0 App rank: 5
          State: INITIALIZED Restarts: 0 App_context: 0 Locale:
0-7 Binding: NULL[0]

  Data for node: x.x.x.41 Launch id: -1 State: 0
      Daemon: [[62216,0],6] Daemon launched: False
      Num slots: 46 Slots in use: 1 Oversubscribed: FALSE
      Num slots allocated: 46 Max slots: 46
      Username on node: NULL
      Num procs: 1 Next node_rank: 1
      Data for proc: [[62216,1],6]
          Pid: 0 Local rank: 0 Node rank: 0 App rank: 6
          State: INITIALIZED Restarts: 0 App_context: 0 Locale:
0-7 Binding: NULL[0]

  Data for node: x.x.x.101 Launch id: -1 State: 0
      Daemon: [[62216,0],7] Daemon launched: False
      Num slots: 46 Slots in use: 1 Oversubscribed: FALSE
      Num slots allocated: 46 Max slots: 46
      Username on node: NULL
      Num procs: 1 Next node_rank: 1
      Data for proc: [[62216,1],7]
          Pid: 0 Local rank: 0 Node rank: 0 App rank: 7
          State: INITIALIZED Restarts: 0 App_context: 0 Locale:
0-7 Binding: NULL[0]

  Data for node: x.x.x.100 Launch id: -1 State: 0
      Daemon: [[62216,0],8] Daemon launched: False
      Num slots: 46 Slots in use: 1 Oversubscribed: FALSE
      Num slots allocated: 46 Max slots: 46
      Username on node: NULL
      Num procs: 1 Next node_rank: 1
      Data for proc: [[62216,1],8]
          Pid: 0 Local rank: 0 Node rank: 0 App rank: 8
          State: INITIALIZED Restarts: 0 App_context: 0 Locale:
0-7 Binding: NULL[0]

  Data for node: x.x.x.102 Launch id: -1 State: 0
      Daemon: [[62216,0],9] Daemon launched: False
      Num slots: 22 Slots in use: 1 Oversubscribed: FALSE
      Num slots allocated: 22 Max slots: 22
      Username on node: NULL
      Num procs: 1 Next node_rank: 1
      Data for proc: [[62216,1],9]
          Pid: 0 Local rank: 0 Node rank: 0 App rank: 9
          State: INITIALIZED Restarts: 0 App_context: 0 Locale:
0-7 Binding: NULL[0]
[sv-1:46111] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/1/8
[sv-1:46111] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/1
[sv-1:46111] top: openmpi-sessions-mpidemo_at_sv-1_0
[sv-1:46111] tmp: /tmp
[SERVER-14:10768] procdir:
/tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/1/6
[SERVER-14:10768] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/1
[SERVER-14:10768] top: openmpi-sessions-mpidemo_at_SERVER-14_0
[SERVER-14:10768] tmp: /tmp
[SERVER-2:08912] procdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/1/0
[SERVER-2:08912] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/1
[SERVER-2:08912] top: openmpi-sessions-mpidemo_at_SERVER-2_0
[SERVER-2:08912] tmp: /tmp
[SERVER-4:27460] procdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/1/2
[SERVER-4:27460] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/1
[SERVER-4:27460] top: openmpi-sessions-mpidemo_at_SERVER-4_0
[SERVER-4:27460] tmp: /tmp
[SERVER-6:11608] procdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/1/4
[SERVER-6:11608] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/1
[SERVER-6:11608] top: openmpi-sessions-mpidemo_at_SERVER-6_0
[SERVER-6:11608] tmp: /tmp
[SERVER-7:02620] procdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/1/5
[SERVER-7:02620] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/1
[SERVER-7:02620] top: openmpi-sessions-mpidemo_at_SERVER-7_0
[SERVER-7:02620] tmp: /tmp
[sv-3:08586] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/1/9
[sv-3:08586] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/1
[sv-3:08586] top: openmpi-sessions-mpidemo_at_sv-3_0
[sv-3:08586] tmp: /tmp
[sv-2:07736] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/1/7
[sv-2:07736] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/1
[sv-2:07736] top: openmpi-sessions-mpidemo_at_sv-2_0
[sv-2:07736] tmp: /tmp
[SERVER-5:16418] procdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/1/3
[SERVER-5:16418] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/1
[SERVER-5:16418] top: openmpi-sessions-mpidemo_at_SERVER-5_0
[SERVER-5:16418] tmp: /tmp
[SERVER-3:32533] procdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/1/1
[SERVER-3:32533] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/1
[SERVER-3:32533] top: openmpi-sessions-mpidemo_at_SERVER-3_0
[SERVER-3:32533] tmp: /tmp
   MPIR_being_debugged = 0
   MPIR_debug_state = 1
   MPIR_partial_attach_ok = 1
   MPIR_i_am_starter = 0
   MPIR_forward_output = 0
   MPIR_proctable_size = 10
   MPIR_proctable:
     (i, host, exe, pid) = (0, SERVER-2,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8912)
     (i, host, exe, pid) = (1, x.x.x.24,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32533)
     (i, host, exe, pid) = (2, x.x.x.26,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 27460)
     (i, host, exe, pid) = (3, x.x.x.28,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 16418)
     (i, host, exe, pid) = (4, x.x.x.29,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 11608)
     (i, host, exe, pid) = (5, x.x.x.30,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 2620)
     (i, host, exe, pid) = (6, x.x.x.41,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 10768)
     (i, host, exe, pid) = (7, x.x.x.101,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7736)
     (i, host, exe, pid) = (8, x.x.x.100,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 46111)
     (i, host, exe, pid) = (9, x.x.x.102,
/usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8586)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

   PML add procs failed
   --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[SERVER-2:8912] *** An error occurred in MPI_Init
[SERVER-2:8912] *** reported by process [140393673392129,140389596004352]
[SERVER-2:8912] *** on a NULL communicator
[SERVER-2:8912] *** Unknown error
[SERVER-2:8912] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[SERVER-2:8912] *** and potentially your MPI job)
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.

   Reason: Before MPI_INIT completed
   Local host: SERVER-2
   PID: 8912
--------------------------------------------------------------------------
[sv-1][[62216,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create]
[btl_openib_proc.c:157] ompi_modex_recv failed for peer [[62216,1],0]
[sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
[sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.

   Process 1 ([[62216,1],8]) is on host: sv-1
   Process 2 ([[62216,1],0]) is on host: SERVER-2
   BTLs attempted: openib self sm tcp

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[sv-3][[62216,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create]
[btl_openib_proc.c:157] ompi_modex_recv failed for peer [[62216,1],0]
[sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
[sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another. This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used. Your MPI job will now abort.

You may wish to try to narrow down the problem;

  * Check the output of ompi_info to see which BTL/MTL plugins are
    available.
  * Run your application with MPI_THREAD_SINGLE.
  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
    if using MTL-based communications) to see exactly which
    communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[sv-2][[62216,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create]
[btl_openib_proc.c:157] ompi_modex_recv failed for peer [[62216,1],0]
[sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
[sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
mca_base_modex_recv: failed with return value=-13
[SERVER-2:08907] sess_dir_finalize: proc session dir not empty - leaving
[sv-4:12040] sess_dir_finalize: job session dir not empty - leaving
[SERVER-14:10755] sess_dir_finalize: job session dir not empty - leaving
[SERVER-2:08907] sess_dir_finalize: proc session dir not empty - leaving
[SERVER-6:11595] sess_dir_finalize: proc session dir not empty - leaving
[SERVER-6:11595] sess_dir_finalize: proc session dir not empty - leaving
[SERVER-4:27445] sess_dir_finalize: proc session dir not empty - leaving
exiting with status 0
[SERVER-4:27445] sess_dir_finalize: proc session dir not empty - leaving
[SERVER-6:11595] sess_dir_finalize: job session dir not empty - leaving
[SERVER-7:02607] sess_dir_finalize: proc session dir not empty - leaving
[SERVER-7:02607] sess_dir_finalize: proc session dir not empty - leaving
[SERVER-7:02607] sess_dir_finalize: job session dir not empty - leaving
[SERVER-5:16404] sess_dir_finalize: proc session dir not empty - leaving
[SERVER-5:16404] sess_dir_finalize: proc session dir not empty - leaving
exiting with status 0
exiting with status 0
exiting with status 0
[SERVER-4:27445] sess_dir_finalize: job session dir not empty - leaving
exiting with status 0
[SERVER-3:32517] sess_dir_finalize: proc session dir not empty - leaving
[SERVER-3:32517] sess_dir_finalize: proc session dir not empty - leaving
[sv-3:08575] sess_dir_finalize: proc session dir not empty - leaving
[sv-3:08575] sess_dir_finalize: job session dir not empty - leaving
exiting with status 0
[sv-1:46100] sess_dir_finalize: proc session dir not empty - leaving
[sv-1:46100] sess_dir_finalize: job session dir not empty - leaving
exiting with status 0
[sv-2:07725] sess_dir_finalize: proc session dir not empty - leaving
[sv-2:07725] sess_dir_finalize: job session dir not empty - leaving
exiting with status 0
[SERVER-5:16404] sess_dir_finalize: job session dir not empty - leaving
exiting with status 0
[SERVER-3:32517] sess_dir_finalize: job session dir not empty - leaving
exiting with status 0
--------------------------------------------------------------------------
mpirun has exited due to process rank 6 with PID 10768 on
node x.x.x.41 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.

--------------------------------------------------------------------------
[SERVER-2:08907] 6 more processes have sent help message
help-mpi-runtime / mpi_init:startup:internal-failure
[SERVER-2:08907] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
[SERVER-2:08907] 9 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
[SERVER-2:08907] 9 more processes have sent help message
help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
[SERVER-2:08907] 2 more processes have sent help message
help-mca-bml-r2.txt / unreachable proc
[SERVER-2:08907] 2 more processes have sent help message
help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
[SERVER-2:08907] sess_dir_finalize: job session dir not empty - leaving
exiting with status 1

//******************************************************************

On 8/3/13 4:34 AM, Ralph Castain wrote:
> It looks like SERVER-2 cannot talk to your x.x.x.100 machine. I note
> that you have some entries at the end of the hostfile that I don't
> understand - a list of hosts that can be reached? And I see that your
> x.x.x.22 machine isn't on it. Is that SERVER-2 by chance?
>
> Our hostfile parsing changed between the release series, but I know we
> never consciously supported the syntax you show below where you list
> capabilities, and then re-list the hosts in an apparent attempt to
> filter which ones can actually be used. It is possible that the 1.5
> series somehow used that to exclude the 22 machine, and that the 1.7
> parser now doesn't do that.
>
> If you only include machines you actually intend to use in your
> hostfile, does the 1.7 series work?
>
> On Aug 3, 2013, at 3:58 AM, RoboBeans <robobeans_at_[hidden]
> <mailto:robobeans_at_[hidden]>> wrote:
>
>> Hello everyone,
>>
>> I have installed openmpi 1.5.4 on 11 node cluster using "yum install
>> openmpi openmpi-devel" and everything seems to be working fine. For
>> testing I am using this test program
>>
>> //******************************************************************
>>
>> *$ cat test.cpp*
>>
>> #include <stdio.h>
>> #include <mpi.h>
>>
>> int main (int argc, char *argv[])
>> {
>> int id, np;
>> char name[MPI_MAX_PROCESSOR_NAME];
>> int namelen;
>> int i;
>>
>> MPI_Init (&argc, &argv);
>>
>> MPI_Comm_size (MPI_COMM_WORLD, &np);
>> MPI_Comm_rank (MPI_COMM_WORLD, &id);
>> MPI_Get_processor_name (name, &namelen);
>>
>> printf ("This is Process %2d out of %2d running on host %s\n", id,
>> np, name);
>>
>> MPI_Finalize ();
>>
>> return (0);
>> }
>>
>> //******************************************************************
>>
>> and my hosts file look like this:
>>
>> *$ cat mpi_hostfile*
>>
>> # The Hostfile for Open MPI
>>
>> # specify number of slots for processes to run locally.
>> #localhost slots=12
>> #x.x.x.16 slots=12 max-slots=12
>> #x.x.x.17 slots=12 max-slots=12
>> #x.x.x.18 slots=12 max-slots=12
>> #x.x.1x.19 slots=12 max-slots=12
>> #x.x.x.20 slots=12 max-slots=12
>> #x.x.x.55 slots=46 max-slots=46
>> #x.x.x.56 slots=46 max-slots=46
>>
>> x.x.x.22 slots=15 max-slots=15
>> x.x.x.24 slots=2 max-slots=2
>> x.x.x.26 slots=14 max-slots=14
>> x.x.x.28 slots=16 max-slots=16
>> x.x.x.29 slots=14 max-slots=14
>> x.x.x.30 slots=16 max-slots=16
>> x.x.x.41 slots=46 max-slots=46
>> x.x.x.101 slots=46 max-slots=46
>> x.x.x.100 slots=46 max-slots=46
>> x.x.x.102 slots=22 max-slots=22
>> x.x.x.103 slots=22 max-slots=22
>>
>> # The following slave nodes are available to this machine:
>> x.x.x.24
>> x.x.x.26
>> x.x.x.28
>> x.x.x.29
>> x.x.x.30
>> x.x.x.41
>> x.x.x.101
>> x.x.x.100
>> x.x.x.102
>> x.x.x.103
>>
>> //******************************************************************
>>
>> this is how my .bashrc looks like on each node:
>>
>> *$ cat ~/.bashrc*
>>
>> # .bashrc
>>
>> # Source global definitions
>> if [ -f /etc/bashrc ]; then
>> . /etc/bashrc
>> fi
>>
>> # User specific aliases and functions
>> umask 077
>>
>> export PSM_SHAREDCONTEXTS_MAX=20
>>
>> #export PATH=/usr/lib64/openmpi/bin${PATH:+:$PATH}
>> export PATH=/usr/OPENMPI/openmpi-1.7.2/bin${PATH:+:$PATH}
>>
>> #export
>> LD_LIBRARY_PATH=/usr/lib64/openmpi/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
>> export
>> LD_LIBRARY_PATH=/usr/OPENMPI/openmpi-1.7.2/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
>>
>> export PATH="$PATH":/bin/:/usr/lib/:/usr/lib:/usr:/usr/
>>
>> //******************************************************************
>>
>> *$ mpic++ test.cpp -o test*
>>
>> *$ mpirun -d --display-map -np 10 --hostfile mpi_hostfile --bynode
>> ./test*
>>
>> //******************************************************************
>>
>> These nodes are running 2.6.32-358.2.1.el6.x86_64 release
>>
>> *$ **uname*
>> Linux
>> *$ **uname -r*
>> 2.6.32-358.2.1.el6.x86_64
>> *$ cat /etc/issue*
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>>
>> //******************************************************************
>>
>> Now, if I install openmpi 1.7.2 on each node separately then I can
>> only use it on either first 7 nodes or last 4 nodes but not on all of
>> them.
>>
>> //******************************************************************
>>
>> *$ gunzip -c openmpi-1.7.2.tar.gz | tar xf -**
>> **
>> **$ cd openmpi-1.7.2**
>> ****
>> **$ ./configure --prefix=/usr/OPENMPI/openmpi-1.7.2
>> --enable-event-thread-support --enable-opal-multi-threads
>> --enable-orte-progress-threads --enable-mpi-thread-multiple**
>> **
>> **$ make all install*
>>
>> //******************************************************************
>>
>> This is the error message that i am receiving:
>>
>>
>> *$ mpirun -d --display-map -np 10 --hostfile mpi_hostfile --bynode
>> ./test*
>>
>> [SERVER-2:05284] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/0/0
>> [SERVER-2:05284] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/0
>> [SERVER-2:05284] top: openmpi-sessions-mpidemo_at_SERVER-2_0
>> [SERVER-2:05284] tmp: /tmp
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> [SERVER-3:28993] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/0/1
>> [SERVER-3:28993] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/0
>> [SERVER-3:28993] top: openmpi-sessions-mpidemo_at_SERVER-3_0
>> [SERVER-3:28993] tmp: /tmp
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> [SERVER-6:09087] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/0/4
>> [SERVER-6:09087] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/0
>> [SERVER-6:09087] top: openmpi-sessions-mpidemo_at_SERVER-6_0
>> [SERVER-6:09087] tmp: /tmp
>> [SERVER-7:32563] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/0/5
>> [SERVER-7:32563] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/0
>> [SERVER-7:32563] top: openmpi-sessions-mpidemo_at_SERVER-7_0
>> [SERVER-7:32563] tmp: /tmp
>> [SERVER-4:15711] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/0/2
>> [SERVER-4:15711] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/0
>> [SERVER-4:15711] top: openmpi-sessions-mpidemo_at_SERVER-4_0
>> [SERVER-4:15711] tmp: /tmp
>> [sv-1:45701] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/0/8
>> [sv-1:45701] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/0
>> [sv-1:45701] top: openmpi-sessions-mpidemo_at_sv-1_0
>> [sv-1:45701] tmp: /tmp
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> [sv-3:08352] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/0/9
>> [sv-3:08352] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/0
>> [sv-3:08352] top: openmpi-sessions-mpidemo_at_sv-3_0
>> [sv-3:08352] tmp: /tmp
>> [SERVER-5:12534] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/0/3
>> [SERVER-5:12534] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/0
>> [SERVER-5:12534] top: openmpi-sessions-mpidemo_at_SERVER-5_0
>> [SERVER-5:12534] tmp: /tmp
>> [SERVER-14:08399] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/0/6
>> [SERVER-14:08399] jobdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/0
>> [SERVER-14:08399] top: openmpi-sessions-mpidemo_at_SERVER-14_0
>> [SERVER-14:08399] tmp: /tmp
>> [sv-4:11802] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-4_0/50535/0/10
>> [sv-4:11802] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-4_0/50535/0
>> [sv-4:11802] top: openmpi-sessions-mpidemo_at_sv-4_0
>> [sv-4:11802] tmp: /tmp
>> [sv-2:07503] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/0/7
>> [sv-2:07503] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/0
>> [sv-2:07503] top: openmpi-sessions-mpidemo_at_sv-2_0
>> [sv-2:07503] tmp: /tmp
>>
>> Mapper requested: NULL Last mapper: round_robin Mapping policy:
>> BYNODE Ranking policy: NODE Binding policy: NONE[NODE] Cpu set:
>> NULL PPR: NULL
>> Num new daemons: 0 New daemon starting vpid INVALID
>> Num nodes: 10
>>
>> Data for node: SERVER-2 Launch id: -1 State: 2
>> Daemon: [[50535,0],0] Daemon launched: True
>> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 15 Max slots: 15
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[50535,1],0]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
>> State: INITIALIZED Restarts: 0 App_context: 0
>> Locale: 0-15 Binding: NULL[0]
>>
>> Data for node: x.x.x.24 Launch id: -1 State: 0
>> Daemon: [[50535,0],1] Daemon launched: False
>> Num slots: 3 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 3 Max slots: 2
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[50535,1],1]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 1
>> State: INITIALIZED Restarts: 0 App_context: 0
>> Locale: 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.26 Launch id: -1 State: 0
>> Daemon: [[50535,0],2] Daemon launched: False
>> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 15 Max slots: 14
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[50535,1],2]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 2
>> State: INITIALIZED Restarts: 0 App_context: 0
>> Locale: 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.28 Launch id: -1 State: 0
>> Daemon: [[50535,0],3] Daemon launched: False
>> Num slots: 17 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 17 Max slots: 16
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[50535,1],3]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 3
>> State: INITIALIZED Restarts: 0 App_context: 0
>> Locale: 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.29 Launch id: -1 State: 0
>> Daemon: [[50535,0],4] Daemon launched: False
>> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 15 Max slots: 14
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[50535,1],4]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 4
>> State: INITIALIZED Restarts: 0 App_context: 0
>> Locale: 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.30 Launch id: -1 State: 0
>> Daemon: [[50535,0],5] Daemon launched: False
>> Num slots: 17 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 17 Max slots: 16
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[50535,1],5]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 5
>> State: INITIALIZED Restarts: 0 App_context: 0
>> Locale: 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.41 Launch id: -1 State: 0
>> Daemon: [[50535,0],6] Daemon launched: False
>> Num slots: 47 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 47 Max slots: 46
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[50535,1],6]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 6
>> State: INITIALIZED Restarts: 0 App_context: 0
>> Locale: 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.101 Launch id: -1 State: 0
>> Daemon: [[50535,0],7] Daemon launched: False
>> Num slots: 47 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 47 Max slots: 46
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[50535,1],7]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 7
>> State: INITIALIZED Restarts: 0 App_context: 0
>> Locale: 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.100 Launch id: -1 State: 0
>> Daemon: [[50535,0],8] Daemon launched: False
>> Num slots: 47 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 47 Max slots: 46
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[50535,1],8]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 8
>> State: INITIALIZED Restarts: 0 App_context: 0
>> Locale: 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.102 Launch id: -1 State: 0
>> Daemon: [[50535,0],9] Daemon launched: False
>> Num slots: 23 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 23 Max slots: 22
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[50535,1],9]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 9
>> State: INITIALIZED Restarts: 0 App_context: 0
>> Locale: 0-7 Binding: NULL[0]
>> [sv-1:45712] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/1/8
>> [sv-1:45712] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/1
>> [sv-1:45712] top: openmpi-sessions-mpidemo_at_sv-1_0
>> [sv-1:45712] tmp: /tmp
>> [SERVER-14:08412] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/1/6
>> [SERVER-14:08412] jobdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/1
>> [SERVER-14:08412] top: openmpi-sessions-mpidemo_at_SERVER-14_0
>> [SERVER-14:08412] tmp: /tmp
>> [SERVER-2:05291] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/1/0
>> [SERVER-2:05291] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/1
>> [SERVER-2:05291] top: openmpi-sessions-mpidemo_at_SERVER-2_0
>> [SERVER-2:05291] tmp: /tmp
>> [SERVER-4:15726] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/1/2
>> [SERVER-4:15726] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/1
>> [SERVER-4:15726] top: openmpi-sessions-mpidemo_at_SERVER-4_0
>> [SERVER-4:15726] tmp: /tmp
>> [SERVER-6:09100] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/1/4
>> [SERVER-6:09100] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/1
>> [SERVER-6:09100] top: openmpi-sessions-mpidemo_at_SERVER-6_0
>> [SERVER-6:09100] tmp: /tmp
>> [SERVER-7:32576] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/1/5
>> [SERVER-7:32576] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/1
>> [SERVER-7:32576] top: openmpi-sessions-mpidemo_at_SERVER-7_0
>> [SERVER-7:32576] tmp: /tmp
>> [sv-3:08363] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/1/9
>> [sv-3:08363] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/1
>> [sv-3:08363] top: openmpi-sessions-mpidemo_at_sv-3_0
>> [sv-3:08363] tmp: /tmp
>> [sv-2:07514] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/1/7
>> [sv-2:07514] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/1
>> [sv-2:07514] top: openmpi-sessions-mpidemo_at_sv-2_0
>> [sv-2:07514] tmp: /tmp
>> [SERVER-5:12548] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/1/3
>> [SERVER-5:12548] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/1
>> [SERVER-5:12548] top: openmpi-sessions-mpidemo_at_SERVER-5_0
>> [SERVER-5:12548] tmp: /tmp
>> [SERVER-3:29009] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/1/1
>> [SERVER-3:29009] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/1
>> [SERVER-3:29009] top: openmpi-sessions-mpidemo_at_SERVER-3_0
>> [SERVER-3:29009] tmp: /tmp
>> MPIR_being_debugged = 0
>> MPIR_debug_state = 1
>> MPIR_partial_attach_ok = 1
>> MPIR_i_am_starter = 0
>> MPIR_forward_output = 0
>> MPIR_proctable_size = 10
>> MPIR_proctable:
>> (i, host, exe, pid) = (0, SERVER-2,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 5291)
>> (i, host, exe, pid) = (1, x.x.x.24,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 29009)
>> (i, host, exe, pid) = (2, x.x.x.26,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 15726)
>> (i, host, exe, pid) = (3, x.x.x.28,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 12548)
>> (i, host, exe, pid) = (4, x.x.x.29,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 9100)
>> (i, host, exe, pid) = (5, x.x.x.30,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32576)
>> (i, host, exe, pid) = (6, x.x.x.41,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8412)
>> (i, host, exe, pid) = (7, x.x.x.101,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7514)
>> (i, host, exe, pid) = (8, x.x.x.100,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 45712)
>> (i, host, exe, pid) = (9, x.x.x.102,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8363)
>> MPIR_executable_path: NULL
>> MPIR_server_arguments: NULL
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or
>> environment
>> problems. This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>> PML add procs failed
>> --> Returned "Error" (-1) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> [SERVER-2:5291] *** An error occurred in MPI_Init
>> [SERVER-2:5291] *** reported by process [140508871983105,140505560121344]
>> [SERVER-2:5291] *** on a NULL communicator
>> [SERVER-2:5291] *** Unknown error
>> [SERVER-2:5291] *** MPI_ERRORS_ARE_FATAL (processes in this
>> communicator will now abort,
>> [SERVER-2:5291] *** and potentially your MPI job)
>> --------------------------------------------------------------------------
>> An MPI process is aborting at a time when it cannot guarantee that all
>> of its peer processes in the job will be killed properly. You should
>> double check that everything has shut down cleanly.
>>
>> Reason: Before MPI_INIT completed
>> Local host: SERVER-2
>> PID: 5291
>> --------------------------------------------------------------------------
>> [sv-1][[50535,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[50535,1],0]
>> [sv-3][[50535,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[50535,1],0]
>> [sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>> mca_base_modex_recv: failed with return value=-13
>> [sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>> mca_base_modex_recv: failed with return value=-13
>> [sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>> mca_base_modex_recv: failed with return value=-13
>> [sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>> mca_base_modex_recv: failed with return value=-13
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications. This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes. This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other. This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>> Process 1 ([[50535,1],8]) is on host: sv-1
>> Process 2 ([[50535,1],0]) is on host: SERVER-2
>> BTLs attempted: openib self sm tcp
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> MPI_INIT has failed because at least one MPI process is unreachable
>> from another. This *usually* means that an underlying communication
>> plugin -- such as a BTL or an MTL -- has either not loaded or not
>> allowed itself to be used. Your MPI job will now abort.
>>
>> You may wish to try to narrow down the problem;
>>
>> * Check the output of ompi_info to see which BTL/MTL plugins are
>> available.
>> * Run your application with MPI_THREAD_SINGLE.
>> * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>> if using MTL-based communications) to see exactly which
>> communication plugins were considered and/or discarded.
>> --------------------------------------------------------------------------
>> [sv-2][[50535,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[50535,1],0]
>> [sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>> mca_base_modex_recv: failed with return value=-13
>> [sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>> mca_base_modex_recv: failed with return value=-13
>> [SERVER-2:05284] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-2:05284] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-4:11802] sess_dir_finalize: job session dir not empty - leaving
>> [SERVER-14:08399] sess_dir_finalize: job session dir not empty - leaving
>> [SERVER-6:09087] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-6:09087] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-4:15711] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-4:15711] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-6:09087] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [SERVER-7:32563] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-7:32563] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-5:12534] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-5:12534] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-7:32563] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> exiting with status 0
>> exiting with status 0
>> [SERVER-4:15711] sess_dir_finalize: job session dir not empty - leaving
>> [SERVER-3:28993] sess_dir_finalize: proc session dir not empty - leaving
>> exiting with status 0
>> [SERVER-3:28993] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-3:08352] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-3:08352] sess_dir_finalize: job session dir not empty - leaving
>> [sv-1:45701] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-1:45701] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> exiting with status 0
>> [sv-2:07503] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-2:07503] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [SERVER-5:12534] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [SERVER-3:28993] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 6 with PID 8412 on
>> node x.x.x.41 exiting improperly. There are three reasons this could
>> occur:
>>
>> 1. this process did not call "init" before exiting, but others in
>> the job did. This can cause a job to hang indefinitely while it waits
>> for all processes to call "init". By rule, if one process calls "init",
>> then ALL processes must call "init" prior to termination.
>>
>> 2. this process called "init", but exited without calling "finalize".
>> By rule, all processes that call "init" MUST call "finalize" prior to
>> exiting or it will be considered an "abnormal termination"
>>
>> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
>> orte_create_session_dirs is set to false. In this case, the run-time
>> cannot
>> detect that the abort call was an abnormal termination. Hence, the only
>> error message you will receive is this one.
>>
>> This may have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>>
>> You can avoid this message by specifying -quiet on the mpirun command
>> line.
>>
>> --------------------------------------------------------------------------
>> [SERVER-2:05284] 6 more processes have sent help message
>> help-mpi-runtime / mpi_init:startup:internal-failure
>> [SERVER-2:05284] Set MCA parameter "orte_base_help_aggregate" to 0 to
>> see all help / error messages
>> [SERVER-2:05284] 9 more processes have sent help message
>> help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
>> [SERVER-2:05284] 9 more processes have sent help message
>> help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
>> [SERVER-2:05284] 2 more processes have sent help message
>> help-mca-bml-r2.txt / unreachable proc
>> [SERVER-2:05284] 2 more processes have sent help message
>> help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
>> [SERVER-2:05284] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 1
>>
>> //******************************************************************
>>
>> Any feedback will be helpful. Thank you!
>>
>> Mr. Beans
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users