Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ERROR: At least one pair of MPI processes are unable to reach each other for MPI communications.
From: RoboBeans (robobeans_at_[hidden])
Date: 2013-08-03 19:06:58


Hi Ralph, I tried using 1.5.4, 1.6.5 and 1.7.2 (compiled from source
code) with no configuration arguments but I am facing the same issue.
When I run a job using 1.5.4 (installed using yum), I get warnings but
it doesn't affect my output.

Example of warning that I get:

sv-2.7960ipath_userinit: Mismatched user minor version (12) and driver
minor version (11) while context sharing. Ensure that driver and library
are from the same release.

Each system has a QLogic card ("QLE7342-CK dual port IB card"), with the
same OS but different kernel revision no. (e.g.
2.6.32-358.2.1.el6.x86_64, 2.6.32-358.el6.x86_64).

Thank you for your time.

On 8/3/13 2:05 PM, Ralph Castain wrote:
> Hmmm...strange indeed. I would remove those four configure options and
> give it a try. That will eliminate all the obvious things, I would
> think, though they aren't generally involved in the issue shown here.
> Still, worth taking out potential trouble sources.
>
> What is the connectivity between SERVER-2 and node 100? Should I
> assume that the first seven nodes are connected via one type of
> interconnect, and the other four are connected to those seven by
> another type?
>
>
> On Aug 3, 2013, at 1:30 PM, RoboBeans <robobeans_at_[hidden]
> <mailto:robobeans_at_[hidden]>> wrote:
>
>> Thanks for looking into in Ralph. I modified the hosts file but I am
>> still getting the same error. Any other pointers you can think of?
>> The difference between this 1.7.2 installation and 1.5.4 is that I
>> installed 1.5.4 using yum and for 1.7.2, I used the source code and
>> configured with *--enable-event-thread-support
>> --enable-opal-multi-threads --enable-orte-progress-threads
>> --enable-mpi-thread-multiple**
>> *. Am I missing something here?
>>
>> //******************************************************************
>>
>> *$ cat mpi_hostfile*
>>
>> x.x.x.22 slots=15 max-slots=15
>> x.x.x.24 slots=2 max-slots=2
>> x.x.x.26 slots=14 max-slots=14
>> x.x.x.28 slots=16 max-slots=16
>> x.x.x.29 slots=14 max-slots=14
>> x.x.x.30 slots=16 max-slots=16
>> x.x.x.41 slots=46 max-slots=46
>> x.x.x.101 slots=46 max-slots=46
>> x.x.x.100 slots=46 max-slots=46
>> x.x.x.102 slots=22 max-slots=22
>> x.x.x.103 slots=22 max-slots=22
>>
>> //******************************************************************
>> *$ mpirun -d --display-map -np 10 --hostfile mpi_hostfile --bynode
>> ./test**
>> *
>> [SERVER-2:08907] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/0/0
>> [SERVER-2:08907] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/0
>> [SERVER-2:08907] top: openmpi-sessions-mpidemo_at_SERVER-2_0
>> [SERVER-2:08907] tmp: /tmp
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> [SERVER-3:32517] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/0/1
>> [SERVER-3:32517] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/0
>> [SERVER-3:32517] top: openmpi-sessions-mpidemo_at_SERVER-3_0
>> [SERVER-3:32517] tmp: /tmp
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> [SERVER-6:11595] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/0/4
>> [SERVER-6:11595] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/0
>> [SERVER-6:11595] top: openmpi-sessions-mpidemo_at_SERVER-6_0
>> [SERVER-6:11595] tmp: /tmp
>> [SERVER-4:27445] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/0/2
>> [SERVER-4:27445] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/0
>> [SERVER-4:27445] top: openmpi-sessions-mpidemo_at_SERVER-4_0
>> [SERVER-4:27445] tmp: /tmp
>> [SERVER-7:02607] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/0/5
>> [SERVER-7:02607] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/0
>> [SERVER-7:02607] top: openmpi-sessions-mpidemo_at_SERVER-7_0
>> [SERVER-7:02607] tmp: /tmp
>> [sv-1:46100] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/0/8
>> [sv-1:46100] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/0
>> [sv-1:46100] top: openmpi-sessions-mpidemo_at_sv-1_0
>> [sv-1:46100] tmp: /tmp
>> CentOS release 6.4 (Final)
>> Kernel \r on an \m
>> [SERVER-5:16404] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/0/3
>> [SERVER-5:16404] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/0
>> [SERVER-5:16404] top: openmpi-sessions-mpidemo_at_SERVER-5_0
>> [SERVER-5:16404] tmp: /tmp
>> [sv-3:08575] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/0/9
>> [sv-3:08575] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/0
>> [sv-3:08575] top: openmpi-sessions-mpidemo_at_sv-3_0
>> [sv-3:08575] tmp: /tmp
>> [SERVER-14:10755] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/0/6
>> [SERVER-14:10755] jobdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/0
>> [SERVER-14:10755] top: openmpi-sessions-mpidemo_at_SERVER-14_0
>> [SERVER-14:10755] tmp: /tmp
>> [sv-4:12040] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-4_0/62216/0/10
>> [sv-4:12040] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-4_0/62216/0
>> [sv-4:12040] top: openmpi-sessions-mpidemo_at_sv-4_0
>> [sv-4:12040] tmp: /tmp
>> [sv-2:07725] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/0/7
>> [sv-2:07725] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/0
>> [sv-2:07725] top: openmpi-sessions-mpidemo_at_sv-2_0
>> [sv-2:07725] tmp: /tmp
>>
>> Mapper requested: NULL Last mapper: round_robin Mapping policy:
>> BYNODE Ranking policy: NODE Binding policy: NONE[NODE] Cpu set:
>> NULL PPR: NULL
>> Num new daemons: 0 New daemon starting vpid INVALID
>> Num nodes: 10
>>
>> Data for node: SERVER-2 Launch id: -1 State: 2
>> Daemon: [[62216,0],0] Daemon launched: True
>> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 15 Max slots: 15
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[62216,1],0]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>> 0-15 Binding: NULL[0]
>>
>> Data for node: x.x.x.24 Launch id: -1 State: 0
>> Daemon: [[62216,0],1] Daemon launched: False
>> Num slots: 2 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 2 Max slots: 2
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[62216,1],1]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 1
>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>> 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.26 Launch id: -1 State: 0
>> Daemon: [[62216,0],2] Daemon launched: False
>> Num slots: 14 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 14 Max slots: 14
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[62216,1],2]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 2
>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>> 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.28 Launch id: -1 State: 0
>> Daemon: [[62216,0],3] Daemon launched: False
>> Num slots: 16 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 16 Max slots: 16
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[62216,1],3]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 3
>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>> 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.29 Launch id: -1 State: 0
>> Daemon: [[62216,0],4] Daemon launched: False
>> Num slots: 14 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 14 Max slots: 14
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[62216,1],4]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 4
>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>> 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.30 Launch id: -1 State: 0
>> Daemon: [[62216,0],5] Daemon launched: False
>> Num slots: 16 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 16 Max slots: 16
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[62216,1],5]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 5
>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>> 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.41 Launch id: -1 State: 0
>> Daemon: [[62216,0],6] Daemon launched: False
>> Num slots: 46 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 46 Max slots: 46
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[62216,1],6]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 6
>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>> 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.101 Launch id: -1 State: 0
>> Daemon: [[62216,0],7] Daemon launched: False
>> Num slots: 46 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 46 Max slots: 46
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[62216,1],7]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 7
>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>> 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.100 Launch id: -1 State: 0
>> Daemon: [[62216,0],8] Daemon launched: False
>> Num slots: 46 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 46 Max slots: 46
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[62216,1],8]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 8
>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>> 0-7 Binding: NULL[0]
>>
>> Data for node: x.x.x.102 Launch id: -1 State: 0
>> Daemon: [[62216,0],9] Daemon launched: False
>> Num slots: 22 Slots in use: 1 Oversubscribed: FALSE
>> Num slots allocated: 22 Max slots: 22
>> Username on node: NULL
>> Num procs: 1 Next node_rank: 1
>> Data for proc: [[62216,1],9]
>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 9
>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>> 0-7 Binding: NULL[0]
>> [sv-1:46111] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/1/8
>> [sv-1:46111] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/1
>> [sv-1:46111] top: openmpi-sessions-mpidemo_at_sv-1_0
>> [sv-1:46111] tmp: /tmp
>> [SERVER-14:10768] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/1/6
>> [SERVER-14:10768] jobdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/1
>> [SERVER-14:10768] top: openmpi-sessions-mpidemo_at_SERVER-14_0
>> [SERVER-14:10768] tmp: /tmp
>> [SERVER-2:08912] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/1/0
>> [SERVER-2:08912] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/1
>> [SERVER-2:08912] top: openmpi-sessions-mpidemo_at_SERVER-2_0
>> [SERVER-2:08912] tmp: /tmp
>> [SERVER-4:27460] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/1/2
>> [SERVER-4:27460] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/1
>> [SERVER-4:27460] top: openmpi-sessions-mpidemo_at_SERVER-4_0
>> [SERVER-4:27460] tmp: /tmp
>> [SERVER-6:11608] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/1/4
>> [SERVER-6:11608] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/1
>> [SERVER-6:11608] top: openmpi-sessions-mpidemo_at_SERVER-6_0
>> [SERVER-6:11608] tmp: /tmp
>> [SERVER-7:02620] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/1/5
>> [SERVER-7:02620] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/1
>> [SERVER-7:02620] top: openmpi-sessions-mpidemo_at_SERVER-7_0
>> [SERVER-7:02620] tmp: /tmp
>> [sv-3:08586] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/1/9
>> [sv-3:08586] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/1
>> [sv-3:08586] top: openmpi-sessions-mpidemo_at_sv-3_0
>> [sv-3:08586] tmp: /tmp
>> [sv-2:07736] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/1/7
>> [sv-2:07736] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/1
>> [sv-2:07736] top: openmpi-sessions-mpidemo_at_sv-2_0
>> [sv-2:07736] tmp: /tmp
>> [SERVER-5:16418] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/1/3
>> [SERVER-5:16418] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/1
>> [SERVER-5:16418] top: openmpi-sessions-mpidemo_at_SERVER-5_0
>> [SERVER-5:16418] tmp: /tmp
>> [SERVER-3:32533] procdir:
>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/1/1
>> [SERVER-3:32533] jobdir: /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/1
>> [SERVER-3:32533] top: openmpi-sessions-mpidemo_at_SERVER-3_0
>> [SERVER-3:32533] tmp: /tmp
>> MPIR_being_debugged = 0
>> MPIR_debug_state = 1
>> MPIR_partial_attach_ok = 1
>> MPIR_i_am_starter = 0
>> MPIR_forward_output = 0
>> MPIR_proctable_size = 10
>> MPIR_proctable:
>> (i, host, exe, pid) = (0, SERVER-2,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8912)
>> (i, host, exe, pid) = (1, x.x.x.24,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32533)
>> (i, host, exe, pid) = (2, x.x.x.26,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 27460)
>> (i, host, exe, pid) = (3, x.x.x.28,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 16418)
>> (i, host, exe, pid) = (4, x.x.x.29,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 11608)
>> (i, host, exe, pid) = (5, x.x.x.30,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 2620)
>> (i, host, exe, pid) = (6, x.x.x.41,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 10768)
>> (i, host, exe, pid) = (7, x.x.x.101,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7736)
>> (i, host, exe, pid) = (8, x.x.x.100,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 46111)
>> (i, host, exe, pid) = (9, x.x.x.102,
>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8586)
>> MPIR_executable_path: NULL
>> MPIR_server_arguments: NULL
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort. There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or
>> environment
>> problems. This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>> PML add procs failed
>> --> Returned "Error" (-1) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> [SERVER-2:8912] *** An error occurred in MPI_Init
>> [SERVER-2:8912] *** reported by process [140393673392129,140389596004352]
>> [SERVER-2:8912] *** on a NULL communicator
>> [SERVER-2:8912] *** Unknown error
>> [SERVER-2:8912] *** MPI_ERRORS_ARE_FATAL (processes in this
>> communicator will now abort,
>> [SERVER-2:8912] *** and potentially your MPI job)
>> --------------------------------------------------------------------------
>> An MPI process is aborting at a time when it cannot guarantee that all
>> of its peer processes in the job will be killed properly. You should
>> double check that everything has shut down cleanly.
>>
>> Reason: Before MPI_INIT completed
>> Local host: SERVER-2
>> PID: 8912
>> --------------------------------------------------------------------------
>> [sv-1][[62216,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[62216,1],0]
>> [sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>> mca_base_modex_recv: failed with return value=-13
>> [sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>> mca_base_modex_recv: failed with return value=-13
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications. This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes. This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other. This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>> Process 1 ([[62216,1],8]) is on host: sv-1
>> Process 2 ([[62216,1],0]) is on host: SERVER-2
>> BTLs attempted: openib self sm tcp
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> [sv-3][[62216,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[62216,1],0]
>> [sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>> mca_base_modex_recv: failed with return value=-13
>> [sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>> mca_base_modex_recv: failed with return value=-13
>> --------------------------------------------------------------------------
>> MPI_INIT has failed because at least one MPI process is unreachable
>> from another. This *usually* means that an underlying communication
>> plugin -- such as a BTL or an MTL -- has either not loaded or not
>> allowed itself to be used. Your MPI job will now abort.
>>
>> You may wish to try to narrow down the problem;
>>
>> * Check the output of ompi_info to see which BTL/MTL plugins are
>> available.
>> * Run your application with MPI_THREAD_SINGLE.
>> * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>> if using MTL-based communications) to see exactly which
>> communication plugins were considered and/or discarded.
>> --------------------------------------------------------------------------
>> [sv-2][[62216,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[62216,1],0]
>> [sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>> mca_base_modex_recv: failed with return value=-13
>> [sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>> mca_base_modex_recv: failed with return value=-13
>> [SERVER-2:08907] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-4:12040] sess_dir_finalize: job session dir not empty - leaving
>> [SERVER-14:10755] sess_dir_finalize: job session dir not empty - leaving
>> [SERVER-2:08907] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-6:11595] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-6:11595] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-4:27445] sess_dir_finalize: proc session dir not empty - leaving
>> exiting with status 0
>> [SERVER-4:27445] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-6:11595] sess_dir_finalize: job session dir not empty - leaving
>> [SERVER-7:02607] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-7:02607] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-7:02607] sess_dir_finalize: job session dir not empty - leaving
>> [SERVER-5:16404] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-5:16404] sess_dir_finalize: proc session dir not empty - leaving
>> exiting with status 0
>> exiting with status 0
>> exiting with status 0
>> [SERVER-4:27445] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [SERVER-3:32517] sess_dir_finalize: proc session dir not empty - leaving
>> [SERVER-3:32517] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-3:08575] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-3:08575] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [sv-1:46100] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-1:46100] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [sv-2:07725] sess_dir_finalize: proc session dir not empty - leaving
>> [sv-2:07725] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [SERVER-5:16404] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> [SERVER-3:32517] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 0
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 6 with PID 10768 on
>> node x.x.x.41 exiting improperly. There are three reasons this could
>> occur:
>>
>> 1. this process did not call "init" before exiting, but others in
>> the job did. This can cause a job to hang indefinitely while it waits
>> for all processes to call "init". By rule, if one process calls "init",
>> then ALL processes must call "init" prior to termination.
>>
>> 2. this process called "init", but exited without calling "finalize".
>> By rule, all processes that call "init" MUST call "finalize" prior to
>> exiting or it will be considered an "abnormal termination"
>>
>> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
>> orte_create_session_dirs is set to false. In this case, the run-time
>> cannot
>> detect that the abort call was an abnormal termination. Hence, the only
>> error message you will receive is this one.
>>
>> This may have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>>
>> You can avoid this message by specifying -quiet on the mpirun command
>> line.
>>
>> --------------------------------------------------------------------------
>> [SERVER-2:08907] 6 more processes have sent help message
>> help-mpi-runtime / mpi_init:startup:internal-failure
>> [SERVER-2:08907] Set MCA parameter "orte_base_help_aggregate" to 0 to
>> see all help / error messages
>> [SERVER-2:08907] 9 more processes have sent help message
>> help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
>> [SERVER-2:08907] 9 more processes have sent help message
>> help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
>> [SERVER-2:08907] 2 more processes have sent help message
>> help-mca-bml-r2.txt / unreachable proc
>> [SERVER-2:08907] 2 more processes have sent help message
>> help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
>> [SERVER-2:08907] sess_dir_finalize: job session dir not empty - leaving
>> exiting with status 1
>>
>> //******************************************************************
>>
>> On 8/3/13 4:34 AM, Ralph Castain wrote:
>>> It looks like SERVER-2 cannot talk to your x.x.x.100 machine. I note
>>> that you have some entries at the end of the hostfile that I don't
>>> understand - a list of hosts that can be reached? And I see that
>>> your x.x.x.22 machine isn't on it. Is that SERVER-2 by chance?
>>>
>>> Our hostfile parsing changed between the release series, but I know
>>> we never consciously supported the syntax you show below where you
>>> list capabilities, and then re-list the hosts in an apparent attempt
>>> to filter which ones can actually be used. It is possible that the
>>> 1.5 series somehow used that to exclude the 22 machine, and that the
>>> 1.7 parser now doesn't do that.
>>>
>>> If you only include machines you actually intend to use in your
>>> hostfile, does the 1.7 series work?
>>>
>>> On Aug 3, 2013, at 3:58 AM, RoboBeans <robobeans_at_[hidden]
>>> <mailto:robobeans_at_[hidden]>> wrote:
>>>
>>>> Hello everyone,
>>>>
>>>> I have installed openmpi 1.5.4 on 11 node cluster using "yum
>>>> install openmpi openmpi-devel" and everything seems to be working
>>>> fine. For testing I am using this test program
>>>>
>>>> //******************************************************************
>>>>
>>>> *$ cat test.cpp*
>>>>
>>>> #include <stdio.h>
>>>> #include <mpi.h>
>>>>
>>>> int main (int argc, char *argv[])
>>>> {
>>>> int id, np;
>>>> char name[MPI_MAX_PROCESSOR_NAME];
>>>> int namelen;
>>>> int i;
>>>>
>>>> MPI_Init (&argc, &argv);
>>>>
>>>> MPI_Comm_size (MPI_COMM_WORLD, &np);
>>>> MPI_Comm_rank (MPI_COMM_WORLD, &id);
>>>> MPI_Get_processor_name (name, &namelen);
>>>>
>>>> printf ("This is Process %2d out of %2d running on host %s\n",
>>>> id, np, name);
>>>>
>>>> MPI_Finalize ();
>>>>
>>>> return (0);
>>>> }
>>>>
>>>> //******************************************************************
>>>>
>>>> and my hosts file look like this:
>>>>
>>>> *$ cat mpi_hostfile*
>>>>
>>>> # The Hostfile for Open MPI
>>>>
>>>> # specify number of slots for processes to run locally.
>>>> #localhost slots=12
>>>> #x.x.x.16 slots=12 max-slots=12
>>>> #x.x.x.17 slots=12 max-slots=12
>>>> #x.x.x.18 slots=12 max-slots=12
>>>> #x.x.1x.19 slots=12 max-slots=12
>>>> #x.x.x.20 slots=12 max-slots=12
>>>> #x.x.x.55 slots=46 max-slots=46
>>>> #x.x.x.56 slots=46 max-slots=46
>>>>
>>>> x.x.x.22 slots=15 max-slots=15
>>>> x.x.x.24 slots=2 max-slots=2
>>>> x.x.x.26 slots=14 max-slots=14
>>>> x.x.x.28 slots=16 max-slots=16
>>>> x.x.x.29 slots=14 max-slots=14
>>>> x.x.x.30 slots=16 max-slots=16
>>>> x.x.x.41 slots=46 max-slots=46
>>>> x.x.x.101 slots=46 max-slots=46
>>>> x.x.x.100 slots=46 max-slots=46
>>>> x.x.x.102 slots=22 max-slots=22
>>>> x.x.x.103 slots=22 max-slots=22
>>>>
>>>> # The following slave nodes are available to this machine:
>>>> x.x.x.24
>>>> x.x.x.26
>>>> x.x.x.28
>>>> x.x.x.29
>>>> x.x.x.30
>>>> x.x.x.41
>>>> x.x.x.101
>>>> x.x.x.100
>>>> x.x.x.102
>>>> x.x.x.103
>>>>
>>>> //******************************************************************
>>>>
>>>> this is how my .bashrc looks like on each node:
>>>>
>>>> *$ cat ~/.bashrc*
>>>>
>>>> # .bashrc
>>>>
>>>> # Source global definitions
>>>> if [ -f /etc/bashrc ]; then
>>>> . /etc/bashrc
>>>> fi
>>>>
>>>> # User specific aliases and functions
>>>> umask 077
>>>>
>>>> export PSM_SHAREDCONTEXTS_MAX=20
>>>>
>>>> #export PATH=/usr/lib64/openmpi/bin${PATH:+:$PATH}
>>>> export PATH=/usr/OPENMPI/openmpi-1.7.2/bin${PATH:+:$PATH}
>>>>
>>>> #export
>>>> LD_LIBRARY_PATH=/usr/lib64/openmpi/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
>>>> export
>>>> LD_LIBRARY_PATH=/usr/OPENMPI/openmpi-1.7.2/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
>>>>
>>>> export PATH="$PATH":/bin/:/usr/lib/:/usr/lib:/usr:/usr/
>>>>
>>>> //******************************************************************
>>>>
>>>> *$ mpic++ test.cpp -o test*
>>>>
>>>> *$ mpirun -d --display-map -np 10 --hostfile mpi_hostfile --bynode
>>>> ./test*
>>>>
>>>> //******************************************************************
>>>>
>>>> These nodes are running 2.6.32-358.2.1.el6.x86_64 release
>>>>
>>>> *$ **uname*
>>>> Linux
>>>> *$ **uname -r*
>>>> 2.6.32-358.2.1.el6.x86_64
>>>> *$ cat /etc/issue*
>>>> CentOS release 6.4 (Final)
>>>> Kernel \r on an \m
>>>>
>>>> //******************************************************************
>>>>
>>>> Now, if I install openmpi 1.7.2 on each node separately then I can
>>>> only use it on either first 7 nodes or last 4 nodes but not on all
>>>> of them.
>>>>
>>>> //******************************************************************
>>>>
>>>> *$ gunzip -c openmpi-1.7.2.tar.gz | tar xf -**
>>>> **
>>>> **$ cd openmpi-1.7.2**
>>>> ****
>>>> **$ ./configure --prefix=/usr/OPENMPI/openmpi-1.7.2
>>>> --enable-event-thread-support --enable-opal-multi-threads
>>>> --enable-orte-progress-threads --enable-mpi-thread-multiple**
>>>> **
>>>> **$ make all install*
>>>>
>>>> //******************************************************************
>>>>
>>>> This is the error message that i am receiving:
>>>>
>>>>
>>>> *$ mpirun -d --display-map -np 10 --hostfile mpi_hostfile --bynode
>>>> ./test*
>>>>
>>>> [SERVER-2:05284] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/0/0
>>>> [SERVER-2:05284] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/0
>>>> [SERVER-2:05284] top: openmpi-sessions-mpidemo_at_SERVER-2_0
>>>> [SERVER-2:05284] tmp: /tmp
>>>> CentOS release 6.4 (Final)
>>>> Kernel \r on an \m
>>>> CentOS release 6.4 (Final)
>>>> Kernel \r on an \m
>>>> CentOS release 6.4 (Final)
>>>> Kernel \r on an \m
>>>> [SERVER-3:28993] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/0/1
>>>> [SERVER-3:28993] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/0
>>>> [SERVER-3:28993] top: openmpi-sessions-mpidemo_at_SERVER-3_0
>>>> [SERVER-3:28993] tmp: /tmp
>>>> CentOS release 6.4 (Final)
>>>> Kernel \r on an \m
>>>> CentOS release 6.4 (Final)
>>>> Kernel \r on an \m
>>>> [SERVER-6:09087] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/0/4
>>>> [SERVER-6:09087] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/0
>>>> [SERVER-6:09087] top: openmpi-sessions-mpidemo_at_SERVER-6_0
>>>> [SERVER-6:09087] tmp: /tmp
>>>> [SERVER-7:32563] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/0/5
>>>> [SERVER-7:32563] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/0
>>>> [SERVER-7:32563] top: openmpi-sessions-mpidemo_at_SERVER-7_0
>>>> [SERVER-7:32563] tmp: /tmp
>>>> [SERVER-4:15711] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/0/2
>>>> [SERVER-4:15711] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/0
>>>> [SERVER-4:15711] top: openmpi-sessions-mpidemo_at_SERVER-4_0
>>>> [SERVER-4:15711] tmp: /tmp
>>>> [sv-1:45701] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/0/8
>>>> [sv-1:45701] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/0
>>>> [sv-1:45701] top: openmpi-sessions-mpidemo_at_sv-1_0
>>>> [sv-1:45701] tmp: /tmp
>>>> CentOS release 6.4 (Final)
>>>> Kernel \r on an \m
>>>> [sv-3:08352] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/0/9
>>>> [sv-3:08352] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/0
>>>> [sv-3:08352] top: openmpi-sessions-mpidemo_at_sv-3_0
>>>> [sv-3:08352] tmp: /tmp
>>>> [SERVER-5:12534] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/0/3
>>>> [SERVER-5:12534] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/0
>>>> [SERVER-5:12534] top: openmpi-sessions-mpidemo_at_SERVER-5_0
>>>> [SERVER-5:12534] tmp: /tmp
>>>> [SERVER-14:08399] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/0/6
>>>> [SERVER-14:08399] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/0
>>>> [SERVER-14:08399] top: openmpi-sessions-mpidemo_at_SERVER-14_0
>>>> [SERVER-14:08399] tmp: /tmp
>>>> [sv-4:11802] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-4_0/50535/0/10
>>>> [sv-4:11802] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-4_0/50535/0
>>>> [sv-4:11802] top: openmpi-sessions-mpidemo_at_sv-4_0
>>>> [sv-4:11802] tmp: /tmp
>>>> [sv-2:07503] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/0/7
>>>> [sv-2:07503] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/0
>>>> [sv-2:07503] top: openmpi-sessions-mpidemo_at_sv-2_0
>>>> [sv-2:07503] tmp: /tmp
>>>>
>>>> Mapper requested: NULL Last mapper: round_robin Mapping policy:
>>>> BYNODE Ranking policy: NODE Binding policy: NONE[NODE] Cpu set:
>>>> NULL PPR: NULL
>>>> Num new daemons: 0 New daemon starting vpid INVALID
>>>> Num nodes: 10
>>>>
>>>> Data for node: SERVER-2 Launch id: -1 State: 2
>>>> Daemon: [[50535,0],0] Daemon launched: True
>>>> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 15 Max slots: 15
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[50535,1],0]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-15 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.24 Launch id: -1 State: 0
>>>> Daemon: [[50535,0],1] Daemon launched: False
>>>> Num slots: 3 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 3 Max slots: 2
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[50535,1],1]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 1
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.26 Launch id: -1 State: 0
>>>> Daemon: [[50535,0],2] Daemon launched: False
>>>> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 15 Max slots: 14
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[50535,1],2]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 2
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.28 Launch id: -1 State: 0
>>>> Daemon: [[50535,0],3] Daemon launched: False
>>>> Num slots: 17 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 17 Max slots: 16
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[50535,1],3]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 3
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.29 Launch id: -1 State: 0
>>>> Daemon: [[50535,0],4] Daemon launched: False
>>>> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 15 Max slots: 14
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[50535,1],4]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 4
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.30 Launch id: -1 State: 0
>>>> Daemon: [[50535,0],5] Daemon launched: False
>>>> Num slots: 17 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 17 Max slots: 16
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[50535,1],5]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 5
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.41 Launch id: -1 State: 0
>>>> Daemon: [[50535,0],6] Daemon launched: False
>>>> Num slots: 47 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 47 Max slots: 46
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[50535,1],6]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 6
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.101 Launch id: -1 State: 0
>>>> Daemon: [[50535,0],7] Daemon launched: False
>>>> Num slots: 47 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 47 Max slots: 46
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[50535,1],7]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 7
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.100 Launch id: -1 State: 0
>>>> Daemon: [[50535,0],8] Daemon launched: False
>>>> Num slots: 47 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 47 Max slots: 46
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[50535,1],8]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 8
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.102 Launch id: -1 State: 0
>>>> Daemon: [[50535,0],9] Daemon launched: False
>>>> Num slots: 23 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 23 Max slots: 22
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[50535,1],9]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 9
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>> [sv-1:45712] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/1/8
>>>> [sv-1:45712] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/1
>>>> [sv-1:45712] top: openmpi-sessions-mpidemo_at_sv-1_0
>>>> [sv-1:45712] tmp: /tmp
>>>> [SERVER-14:08412] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/1/6
>>>> [SERVER-14:08412] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/1
>>>> [SERVER-14:08412] top: openmpi-sessions-mpidemo_at_SERVER-14_0
>>>> [SERVER-14:08412] tmp: /tmp
>>>> [SERVER-2:05291] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/1/0
>>>> [SERVER-2:05291] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/1
>>>> [SERVER-2:05291] top: openmpi-sessions-mpidemo_at_SERVER-2_0
>>>> [SERVER-2:05291] tmp: /tmp
>>>> [SERVER-4:15726] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/1/2
>>>> [SERVER-4:15726] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/1
>>>> [SERVER-4:15726] top: openmpi-sessions-mpidemo_at_SERVER-4_0
>>>> [SERVER-4:15726] tmp: /tmp
>>>> [SERVER-6:09100] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/1/4
>>>> [SERVER-6:09100] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/1
>>>> [SERVER-6:09100] top: openmpi-sessions-mpidemo_at_SERVER-6_0
>>>> [SERVER-6:09100] tmp: /tmp
>>>> [SERVER-7:32576] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/1/5
>>>> [SERVER-7:32576] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/1
>>>> [SERVER-7:32576] top: openmpi-sessions-mpidemo_at_SERVER-7_0
>>>> [SERVER-7:32576] tmp: /tmp
>>>> [sv-3:08363] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/1/9
>>>> [sv-3:08363] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/1
>>>> [sv-3:08363] top: openmpi-sessions-mpidemo_at_sv-3_0
>>>> [sv-3:08363] tmp: /tmp
>>>> [sv-2:07514] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/1/7
>>>> [sv-2:07514] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/1
>>>> [sv-2:07514] top: openmpi-sessions-mpidemo_at_sv-2_0
>>>> [sv-2:07514] tmp: /tmp
>>>> [SERVER-5:12548] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/1/3
>>>> [SERVER-5:12548] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/1
>>>> [SERVER-5:12548] top: openmpi-sessions-mpidemo_at_SERVER-5_0
>>>> [SERVER-5:12548] tmp: /tmp
>>>> [SERVER-3:29009] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/1/1
>>>> [SERVER-3:29009] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/1
>>>> [SERVER-3:29009] top: openmpi-sessions-mpidemo_at_SERVER-3_0
>>>> [SERVER-3:29009] tmp: /tmp
>>>> MPIR_being_debugged = 0
>>>> MPIR_debug_state = 1
>>>> MPIR_partial_attach_ok = 1
>>>> MPIR_i_am_starter = 0
>>>> MPIR_forward_output = 0
>>>> MPIR_proctable_size = 10
>>>> MPIR_proctable:
>>>> (i, host, exe, pid) = (0, SERVER-2,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 5291)
>>>> (i, host, exe, pid) = (1, x.x.x.24,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 29009)
>>>> (i, host, exe, pid) = (2, x.x.x.26,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 15726)
>>>> (i, host, exe, pid) = (3, x.x.x.28,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 12548)
>>>> (i, host, exe, pid) = (4, x.x.x.29,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 9100)
>>>> (i, host, exe, pid) = (5, x.x.x.30,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32576)
>>>> (i, host, exe, pid) = (6, x.x.x.41,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8412)
>>>> (i, host, exe, pid) = (7, x.x.x.101,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7514)
>>>> (i, host, exe, pid) = (8, x.x.x.100,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 45712)
>>>> (i, host, exe, pid) = (9, x.x.x.102,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8363)
>>>> MPIR_executable_path: NULL
>>>> MPIR_server_arguments: NULL
>>>> --------------------------------------------------------------------------
>>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>>> likely to abort. There are many reasons that a parallel process can
>>>> fail during MPI_INIT; some of which are due to configuration or
>>>> environment
>>>> problems. This failure appears to be an internal failure; here's some
>>>> additional information (which may only be relevant to an Open MPI
>>>> developer):
>>>>
>>>> PML add procs failed
>>>> --> Returned "Error" (-1) instead of "Success" (0)
>>>> --------------------------------------------------------------------------
>>>> [SERVER-2:5291] *** An error occurred in MPI_Init
>>>> [SERVER-2:5291] *** reported by process
>>>> [140508871983105,140505560121344]
>>>> [SERVER-2:5291] *** on a NULL communicator
>>>> [SERVER-2:5291] *** Unknown error
>>>> [SERVER-2:5291] *** MPI_ERRORS_ARE_FATAL (processes in this
>>>> communicator will now abort,
>>>> [SERVER-2:5291] *** and potentially your MPI job)
>>>> --------------------------------------------------------------------------
>>>> An MPI process is aborting at a time when it cannot guarantee that all
>>>> of its peer processes in the job will be killed properly. You should
>>>> double check that everything has shut down cleanly.
>>>>
>>>> Reason: Before MPI_INIT completed
>>>> Local host: SERVER-2
>>>> PID: 5291
>>>> --------------------------------------------------------------------------
>>>> [sv-1][[50535,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[50535,1],0]
>>>> [sv-3][[50535,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[50535,1],0]
>>>> [sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>> mca_base_modex_recv: failed with return value=-13
>>>> [sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>> mca_base_modex_recv: failed with return value=-13
>>>> [sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>> mca_base_modex_recv: failed with return value=-13
>>>> [sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>> mca_base_modex_recv: failed with return value=-13
>>>> --------------------------------------------------------------------------
>>>> At least one pair of MPI processes are unable to reach each other for
>>>> MPI communications. This means that no Open MPI device has indicated
>>>> that it can be used to communicate between these processes. This is
>>>> an error; Open MPI requires that all MPI processes be able to reach
>>>> each other. This error can sometimes be the result of forgetting to
>>>> specify the "self" BTL.
>>>>
>>>> Process 1 ([[50535,1],8]) is on host: sv-1
>>>> Process 2 ([[50535,1],0]) is on host: SERVER-2
>>>> BTLs attempted: openib self sm tcp
>>>>
>>>> Your MPI job is now going to abort; sorry.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> MPI_INIT has failed because at least one MPI process is unreachable
>>>> from another. This *usually* means that an underlying communication
>>>> plugin -- such as a BTL or an MTL -- has either not loaded or not
>>>> allowed itself to be used. Your MPI job will now abort.
>>>>
>>>> You may wish to try to narrow down the problem;
>>>>
>>>> * Check the output of ompi_info to see which BTL/MTL plugins are
>>>> available.
>>>> * Run your application with MPI_THREAD_SINGLE.
>>>> * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>>>> if using MTL-based communications) to see exactly which
>>>> communication plugins were considered and/or discarded.
>>>> --------------------------------------------------------------------------
>>>> [sv-2][[50535,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[50535,1],0]
>>>> [sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>> mca_base_modex_recv: failed with return value=-13
>>>> [sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>> mca_base_modex_recv: failed with return value=-13
>>>> [SERVER-2:05284] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-2:05284] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [sv-4:11802] sess_dir_finalize: job session dir not empty - leaving
>>>> [SERVER-14:08399] sess_dir_finalize: job session dir not empty -
>>>> leaving
>>>> [SERVER-6:09087] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-6:09087] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-4:15711] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-4:15711] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-6:09087] sess_dir_finalize: job session dir not empty - leaving
>>>> exiting with status 0
>>>> [SERVER-7:32563] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-7:32563] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-5:12534] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-5:12534] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-7:32563] sess_dir_finalize: job session dir not empty - leaving
>>>> exiting with status 0
>>>> exiting with status 0
>>>> exiting with status 0
>>>> [SERVER-4:15711] sess_dir_finalize: job session dir not empty - leaving
>>>> [SERVER-3:28993] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> exiting with status 0
>>>> [SERVER-3:28993] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [sv-3:08352] sess_dir_finalize: proc session dir not empty - leaving
>>>> [sv-3:08352] sess_dir_finalize: job session dir not empty - leaving
>>>> [sv-1:45701] sess_dir_finalize: proc session dir not empty - leaving
>>>> [sv-1:45701] sess_dir_finalize: job session dir not empty - leaving
>>>> exiting with status 0
>>>> exiting with status 0
>>>> [sv-2:07503] sess_dir_finalize: proc session dir not empty - leaving
>>>> [sv-2:07503] sess_dir_finalize: job session dir not empty - leaving
>>>> exiting with status 0
>>>> [SERVER-5:12534] sess_dir_finalize: job session dir not empty - leaving
>>>> exiting with status 0
>>>> [SERVER-3:28993] sess_dir_finalize: job session dir not empty - leaving
>>>> exiting with status 0
>>>> --------------------------------------------------------------------------
>>>> mpirun has exited due to process rank 6 with PID 8412 on
>>>> node x.x.x.41 exiting improperly. There are three reasons this
>>>> could occur:
>>>>
>>>> 1. this process did not call "init" before exiting, but others in
>>>> the job did. This can cause a job to hang indefinitely while it waits
>>>> for all processes to call "init". By rule, if one process calls "init",
>>>> then ALL processes must call "init" prior to termination.
>>>>
>>>> 2. this process called "init", but exited without calling "finalize".
>>>> By rule, all processes that call "init" MUST call "finalize" prior to
>>>> exiting or it will be considered an "abnormal termination"
>>>>
>>>> 3. this process called "MPI_Abort" or "orte_abort" and the mca
>>>> parameter
>>>> orte_create_session_dirs is set to false. In this case, the
>>>> run-time cannot
>>>> detect that the abort call was an abnormal termination. Hence, the only
>>>> error message you will receive is this one.
>>>>
>>>> This may have caused other processes in the application to be
>>>> terminated by signals sent by mpirun (as reported here).
>>>>
>>>> You can avoid this message by specifying -quiet on the mpirun
>>>> command line.
>>>>
>>>> --------------------------------------------------------------------------
>>>> [SERVER-2:05284] 6 more processes have sent help message
>>>> help-mpi-runtime / mpi_init:startup:internal-failure
>>>> [SERVER-2:05284] Set MCA parameter "orte_base_help_aggregate" to 0
>>>> to see all help / error messages
>>>> [SERVER-2:05284] 9 more processes have sent help message
>>>> help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
>>>> [SERVER-2:05284] 9 more processes have sent help message
>>>> help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
>>>> [SERVER-2:05284] 2 more processes have sent help message
>>>> help-mca-bml-r2.txt / unreachable proc
>>>> [SERVER-2:05284] 2 more processes have sent help message
>>>> help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
>>>> [SERVER-2:05284] sess_dir_finalize: job session dir not empty - leaving
>>>> exiting with status 1
>>>>
>>>> //******************************************************************
>>>>
>>>> Any feedback will be helpful. Thank you!
>>>>
>>>> Mr. Beans
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users