Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ERROR: At least one pair of MPI processes are unable to reach each other for MPI communications.
From: RoboBeans (robobeans_at_[hidden])
Date: 2013-08-03 22:09:09


On first 7 nodes:

*[mpidemo_at_SERVER-3 ~]$ ofed_info | head -n 1*
OFED-1.5.3.2:

*[mpidemo_at_SERVER-3 ~]$ which ofed_info*
/usr/bin/ofed_info

On last 4 nodes:

*[mpidemo_at_sv-2 ~]$ ofed_info | head -n 1*
-bash: ofed_info: command not found

*[mpidemo_at_sv-2 ~]$ which ofed_info*
/usr/bin/which: no ofed_info in
(/usr/OPENMPI/openmpi-1.7.2/bin:/usr/OPENMPI/openmpi-1.7.2/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/bin/:/usr/lib/:/usr/lib:/usr:/usr/:/bin/:/usr/lib/:/usr/lib:/usr:/usr/)

Are there some specific locations where I should look for ofed_info? How
can I make sure that ofed was installed on a node or not?

Thanks again!!!

On 8/3/13 5:52 PM, Ralph Castain wrote:
> Are the ofed versions the same across all the machines? I would
> suspect that might be the problem.
>
>
> On Aug 3, 2013, at 4:06 PM, RoboBeans <robobeans_at_[hidden]
> <mailto:robobeans_at_[hidden]>> wrote:
>
>> Hi Ralph, I tried using 1.5.4, 1.6.5 and 1.7.2 (compiled from source
>> code) with no configuration arguments but I am facing the same issue.
>> When I run a job using 1.5.4 (installed using yum), I get warnings
>> but it doesn't affect my output.
>>
>> Example of warning that I get:
>>
>> sv-2.7960ipath_userinit: Mismatched user minor version (12) and
>> driver minor version (11) while context sharing. Ensure that driver
>> and library are from the same release.
>>
>> Each system has a QLogic card ("QLE7342-CK dual port IB card"), with
>> the same OS but different kernel revision no. (e.g.
>> 2.6.32-358.2.1.el6.x86_64, 2.6.32-358.el6.x86_64).
>>
>> Thank you for your time.
>>
>> On 8/3/13 2:05 PM, Ralph Castain wrote:
>>> Hmmm...strange indeed. I would remove those four configure options
>>> and give it a try. That will eliminate all the obvious things, I
>>> would think, though they aren't generally involved in the issue
>>> shown here. Still, worth taking out potential trouble sources.
>>>
>>> What is the connectivity between SERVER-2 and node 100? Should I
>>> assume that the first seven nodes are connected via one type of
>>> interconnect, and the other four are connected to those seven by
>>> another type?
>>>
>>>
>>> On Aug 3, 2013, at 1:30 PM, RoboBeans <robobeans_at_[hidden]
>>> <mailto:robobeans_at_[hidden]>> wrote:
>>>
>>>> Thanks for looking into in Ralph. I modified the hosts file but I
>>>> am still getting the same error. Any other pointers you can think
>>>> of? The difference between this 1.7.2 installation and 1.5.4 is
>>>> that I installed 1.5.4 using yum and for 1.7.2, I used the source
>>>> code and configured with *--enable-event-thread-support
>>>> --enable-opal-multi-threads --enable-orte-progress-threads
>>>> --enable-mpi-thread-multiple**
>>>> *. Am I missing something here?
>>>>
>>>> //******************************************************************
>>>>
>>>> *$ cat mpi_hostfile*
>>>>
>>>> x.x.x.22 slots=15 max-slots=15
>>>> x.x.x.24 slots=2 max-slots=2
>>>> x.x.x.26 slots=14 max-slots=14
>>>> x.x.x.28 slots=16 max-slots=16
>>>> x.x.x.29 slots=14 max-slots=14
>>>> x.x.x.30 slots=16 max-slots=16
>>>> x.x.x.41 slots=46 max-slots=46
>>>> x.x.x.101 slots=46 max-slots=46
>>>> x.x.x.100 slots=46 max-slots=46
>>>> x.x.x.102 slots=22 max-slots=22
>>>> x.x.x.103 slots=22 max-slots=22
>>>>
>>>> //******************************************************************
>>>> *$ mpirun -d --display-map -np 10 --hostfile mpi_hostfile --bynode
>>>> ./test**
>>>> *
>>>> [SERVER-2:08907] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/0/0
>>>> [SERVER-2:08907] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/0
>>>> [SERVER-2:08907] top: openmpi-sessions-mpidemo_at_SERVER-2_0
>>>> [SERVER-2:08907] tmp: /tmp
>>>> CentOS release 6.4 (Final)
>>>> Kernel \r on an \m
>>>> CentOS release 6.4 (Final)
>>>> Kernel \r on an \m
>>>> CentOS release 6.4 (Final)
>>>> Kernel \r on an \m
>>>> [SERVER-3:32517] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/0/1
>>>> [SERVER-3:32517] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/0
>>>> [SERVER-3:32517] top: openmpi-sessions-mpidemo_at_SERVER-3_0
>>>> [SERVER-3:32517] tmp: /tmp
>>>> CentOS release 6.4 (Final)
>>>> Kernel \r on an \m
>>>> CentOS release 6.4 (Final)
>>>> Kernel \r on an \m
>>>> [SERVER-6:11595] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/0/4
>>>> [SERVER-6:11595] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/0
>>>> [SERVER-6:11595] top: openmpi-sessions-mpidemo_at_SERVER-6_0
>>>> [SERVER-6:11595] tmp: /tmp
>>>> [SERVER-4:27445] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/0/2
>>>> [SERVER-4:27445] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/0
>>>> [SERVER-4:27445] top: openmpi-sessions-mpidemo_at_SERVER-4_0
>>>> [SERVER-4:27445] tmp: /tmp
>>>> [SERVER-7:02607] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/0/5
>>>> [SERVER-7:02607] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/0
>>>> [SERVER-7:02607] top: openmpi-sessions-mpidemo_at_SERVER-7_0
>>>> [SERVER-7:02607] tmp: /tmp
>>>> [sv-1:46100] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/0/8
>>>> [sv-1:46100] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/0
>>>> [sv-1:46100] top: openmpi-sessions-mpidemo_at_sv-1_0
>>>> [sv-1:46100] tmp: /tmp
>>>> CentOS release 6.4 (Final)
>>>> Kernel \r on an \m
>>>> [SERVER-5:16404] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/0/3
>>>> [SERVER-5:16404] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/0
>>>> [SERVER-5:16404] top: openmpi-sessions-mpidemo_at_SERVER-5_0
>>>> [SERVER-5:16404] tmp: /tmp
>>>> [sv-3:08575] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/0/9
>>>> [sv-3:08575] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/0
>>>> [sv-3:08575] top: openmpi-sessions-mpidemo_at_sv-3_0
>>>> [sv-3:08575] tmp: /tmp
>>>> [SERVER-14:10755] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/0/6
>>>> [SERVER-14:10755] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/0
>>>> [SERVER-14:10755] top: openmpi-sessions-mpidemo_at_SERVER-14_0
>>>> [SERVER-14:10755] tmp: /tmp
>>>> [sv-4:12040] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-4_0/62216/0/10
>>>> [sv-4:12040] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-4_0/62216/0
>>>> [sv-4:12040] top: openmpi-sessions-mpidemo_at_sv-4_0
>>>> [sv-4:12040] tmp: /tmp
>>>> [sv-2:07725] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/0/7
>>>> [sv-2:07725] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/0
>>>> [sv-2:07725] top: openmpi-sessions-mpidemo_at_sv-2_0
>>>> [sv-2:07725] tmp: /tmp
>>>>
>>>> Mapper requested: NULL Last mapper: round_robin Mapping policy:
>>>> BYNODE Ranking policy: NODE Binding policy: NONE[NODE] Cpu set:
>>>> NULL PPR: NULL
>>>> Num new daemons: 0 New daemon starting vpid INVALID
>>>> Num nodes: 10
>>>>
>>>> Data for node: SERVER-2 Launch id: -1 State: 2
>>>> Daemon: [[62216,0],0] Daemon launched: True
>>>> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 15 Max slots: 15
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[62216,1],0]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-15 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.24 Launch id: -1 State: 0
>>>> Daemon: [[62216,0],1] Daemon launched: False
>>>> Num slots: 2 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 2 Max slots: 2
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[62216,1],1]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 1
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.26 Launch id: -1 State: 0
>>>> Daemon: [[62216,0],2] Daemon launched: False
>>>> Num slots: 14 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 14 Max slots: 14
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[62216,1],2]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 2
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.28 Launch id: -1 State: 0
>>>> Daemon: [[62216,0],3] Daemon launched: False
>>>> Num slots: 16 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 16 Max slots: 16
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[62216,1],3]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 3
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.29 Launch id: -1 State: 0
>>>> Daemon: [[62216,0],4] Daemon launched: False
>>>> Num slots: 14 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 14 Max slots: 14
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[62216,1],4]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 4
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.30 Launch id: -1 State: 0
>>>> Daemon: [[62216,0],5] Daemon launched: False
>>>> Num slots: 16 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 16 Max slots: 16
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[62216,1],5]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 5
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.41 Launch id: -1 State: 0
>>>> Daemon: [[62216,0],6] Daemon launched: False
>>>> Num slots: 46 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 46 Max slots: 46
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[62216,1],6]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 6
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.101 Launch id: -1 State: 0
>>>> Daemon: [[62216,0],7] Daemon launched: False
>>>> Num slots: 46 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 46 Max slots: 46
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[62216,1],7]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 7
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.100 Launch id: -1 State: 0
>>>> Daemon: [[62216,0],8] Daemon launched: False
>>>> Num slots: 46 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 46 Max slots: 46
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[62216,1],8]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 8
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>>
>>>> Data for node: x.x.x.102 Launch id: -1 State: 0
>>>> Daemon: [[62216,0],9] Daemon launched: False
>>>> Num slots: 22 Slots in use: 1 Oversubscribed: FALSE
>>>> Num slots allocated: 22 Max slots: 22
>>>> Username on node: NULL
>>>> Num procs: 1 Next node_rank: 1
>>>> Data for proc: [[62216,1],9]
>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 9
>>>> State: INITIALIZED Restarts: 0 App_context: 0
>>>> Locale: 0-7 Binding: NULL[0]
>>>> [sv-1:46111] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/1/8
>>>> [sv-1:46111] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/1
>>>> [sv-1:46111] top: openmpi-sessions-mpidemo_at_sv-1_0
>>>> [sv-1:46111] tmp: /tmp
>>>> [SERVER-14:10768] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/1/6
>>>> [SERVER-14:10768] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/1
>>>> [SERVER-14:10768] top: openmpi-sessions-mpidemo_at_SERVER-14_0
>>>> [SERVER-14:10768] tmp: /tmp
>>>> [SERVER-2:08912] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/1/0
>>>> [SERVER-2:08912] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/1
>>>> [SERVER-2:08912] top: openmpi-sessions-mpidemo_at_SERVER-2_0
>>>> [SERVER-2:08912] tmp: /tmp
>>>> [SERVER-4:27460] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/1/2
>>>> [SERVER-4:27460] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/1
>>>> [SERVER-4:27460] top: openmpi-sessions-mpidemo_at_SERVER-4_0
>>>> [SERVER-4:27460] tmp: /tmp
>>>> [SERVER-6:11608] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/1/4
>>>> [SERVER-6:11608] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/1
>>>> [SERVER-6:11608] top: openmpi-sessions-mpidemo_at_SERVER-6_0
>>>> [SERVER-6:11608] tmp: /tmp
>>>> [SERVER-7:02620] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/1/5
>>>> [SERVER-7:02620] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/1
>>>> [SERVER-7:02620] top: openmpi-sessions-mpidemo_at_SERVER-7_0
>>>> [SERVER-7:02620] tmp: /tmp
>>>> [sv-3:08586] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/1/9
>>>> [sv-3:08586] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/1
>>>> [sv-3:08586] top: openmpi-sessions-mpidemo_at_sv-3_0
>>>> [sv-3:08586] tmp: /tmp
>>>> [sv-2:07736] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/1/7
>>>> [sv-2:07736] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/1
>>>> [sv-2:07736] top: openmpi-sessions-mpidemo_at_sv-2_0
>>>> [sv-2:07736] tmp: /tmp
>>>> [SERVER-5:16418] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/1/3
>>>> [SERVER-5:16418] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/1
>>>> [SERVER-5:16418] top: openmpi-sessions-mpidemo_at_SERVER-5_0
>>>> [SERVER-5:16418] tmp: /tmp
>>>> [SERVER-3:32533] procdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/1/1
>>>> [SERVER-3:32533] jobdir:
>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/1
>>>> [SERVER-3:32533] top: openmpi-sessions-mpidemo_at_SERVER-3_0
>>>> [SERVER-3:32533] tmp: /tmp
>>>> MPIR_being_debugged = 0
>>>> MPIR_debug_state = 1
>>>> MPIR_partial_attach_ok = 1
>>>> MPIR_i_am_starter = 0
>>>> MPIR_forward_output = 0
>>>> MPIR_proctable_size = 10
>>>> MPIR_proctable:
>>>> (i, host, exe, pid) = (0, SERVER-2,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8912)
>>>> (i, host, exe, pid) = (1, x.x.x.24,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32533)
>>>> (i, host, exe, pid) = (2, x.x.x.26,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 27460)
>>>> (i, host, exe, pid) = (3, x.x.x.28,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 16418)
>>>> (i, host, exe, pid) = (4, x.x.x.29,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 11608)
>>>> (i, host, exe, pid) = (5, x.x.x.30,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 2620)
>>>> (i, host, exe, pid) = (6, x.x.x.41,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 10768)
>>>> (i, host, exe, pid) = (7, x.x.x.101,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7736)
>>>> (i, host, exe, pid) = (8, x.x.x.100,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 46111)
>>>> (i, host, exe, pid) = (9, x.x.x.102,
>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8586)
>>>> MPIR_executable_path: NULL
>>>> MPIR_server_arguments: NULL
>>>> --------------------------------------------------------------------------
>>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>>> likely to abort. There are many reasons that a parallel process can
>>>> fail during MPI_INIT; some of which are due to configuration or
>>>> environment
>>>> problems. This failure appears to be an internal failure; here's some
>>>> additional information (which may only be relevant to an Open MPI
>>>> developer):
>>>>
>>>> PML add procs failed
>>>> --> Returned "Error" (-1) instead of "Success" (0)
>>>> --------------------------------------------------------------------------
>>>> [SERVER-2:8912] *** An error occurred in MPI_Init
>>>> [SERVER-2:8912] *** reported by process
>>>> [140393673392129,140389596004352]
>>>> [SERVER-2:8912] *** on a NULL communicator
>>>> [SERVER-2:8912] *** Unknown error
>>>> [SERVER-2:8912] *** MPI_ERRORS_ARE_FATAL (processes in this
>>>> communicator will now abort,
>>>> [SERVER-2:8912] *** and potentially your MPI job)
>>>> --------------------------------------------------------------------------
>>>> An MPI process is aborting at a time when it cannot guarantee that all
>>>> of its peer processes in the job will be killed properly. You should
>>>> double check that everything has shut down cleanly.
>>>>
>>>> Reason: Before MPI_INIT completed
>>>> Local host: SERVER-2
>>>> PID: 8912
>>>> --------------------------------------------------------------------------
>>>> [sv-1][[62216,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[62216,1],0]
>>>> [sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>> mca_base_modex_recv: failed with return value=-13
>>>> [sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>> mca_base_modex_recv: failed with return value=-13
>>>> --------------------------------------------------------------------------
>>>> At least one pair of MPI processes are unable to reach each other for
>>>> MPI communications. This means that no Open MPI device has indicated
>>>> that it can be used to communicate between these processes. This is
>>>> an error; Open MPI requires that all MPI processes be able to reach
>>>> each other. This error can sometimes be the result of forgetting to
>>>> specify the "self" BTL.
>>>>
>>>> Process 1 ([[62216,1],8]) is on host: sv-1
>>>> Process 2 ([[62216,1],0]) is on host: SERVER-2
>>>> BTLs attempted: openib self sm tcp
>>>>
>>>> Your MPI job is now going to abort; sorry.
>>>> --------------------------------------------------------------------------
>>>> [sv-3][[62216,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[62216,1],0]
>>>> [sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>> mca_base_modex_recv: failed with return value=-13
>>>> [sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>> mca_base_modex_recv: failed with return value=-13
>>>> --------------------------------------------------------------------------
>>>> MPI_INIT has failed because at least one MPI process is unreachable
>>>> from another. This *usually* means that an underlying communication
>>>> plugin -- such as a BTL or an MTL -- has either not loaded or not
>>>> allowed itself to be used. Your MPI job will now abort.
>>>>
>>>> You may wish to try to narrow down the problem;
>>>>
>>>> * Check the output of ompi_info to see which BTL/MTL plugins are
>>>> available.
>>>> * Run your application with MPI_THREAD_SINGLE.
>>>> * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>>>> if using MTL-based communications) to see exactly which
>>>> communication plugins were considered and/or discarded.
>>>> --------------------------------------------------------------------------
>>>> [sv-2][[62216,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[62216,1],0]
>>>> [sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>> mca_base_modex_recv: failed with return value=-13
>>>> [sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>> mca_base_modex_recv: failed with return value=-13
>>>> [SERVER-2:08907] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [sv-4:12040] sess_dir_finalize: job session dir not empty - leaving
>>>> [SERVER-14:10755] sess_dir_finalize: job session dir not empty -
>>>> leaving
>>>> [SERVER-2:08907] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-6:11595] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-6:11595] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-4:27445] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> exiting with status 0
>>>> [SERVER-4:27445] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-6:11595] sess_dir_finalize: job session dir not empty - leaving
>>>> [SERVER-7:02607] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-7:02607] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-7:02607] sess_dir_finalize: job session dir not empty - leaving
>>>> [SERVER-5:16404] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-5:16404] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> exiting with status 0
>>>> exiting with status 0
>>>> exiting with status 0
>>>> [SERVER-4:27445] sess_dir_finalize: job session dir not empty - leaving
>>>> exiting with status 0
>>>> [SERVER-3:32517] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [SERVER-3:32517] sess_dir_finalize: proc session dir not empty -
>>>> leaving
>>>> [sv-3:08575] sess_dir_finalize: proc session dir not empty - leaving
>>>> [sv-3:08575] sess_dir_finalize: job session dir not empty - leaving
>>>> exiting with status 0
>>>> [sv-1:46100] sess_dir_finalize: proc session dir not empty - leaving
>>>> [sv-1:46100] sess_dir_finalize: job session dir not empty - leaving
>>>> exiting with status 0
>>>> [sv-2:07725] sess_dir_finalize: proc session dir not empty - leaving
>>>> [sv-2:07725] sess_dir_finalize: job session dir not empty - leaving
>>>> exiting with status 0
>>>> [SERVER-5:16404] sess_dir_finalize: job session dir not empty - leaving
>>>> exiting with status 0
>>>> [SERVER-3:32517] sess_dir_finalize: job session dir not empty - leaving
>>>> exiting with status 0
>>>> --------------------------------------------------------------------------
>>>> mpirun has exited due to process rank 6 with PID 10768 on
>>>> node x.x.x.41 exiting improperly. There are three reasons this
>>>> could occur:
>>>>
>>>> 1. this process did not call "init" before exiting, but others in
>>>> the job did. This can cause a job to hang indefinitely while it waits
>>>> for all processes to call "init". By rule, if one process calls "init",
>>>> then ALL processes must call "init" prior to termination.
>>>>
>>>> 2. this process called "init", but exited without calling "finalize".
>>>> By rule, all processes that call "init" MUST call "finalize" prior to
>>>> exiting or it will be considered an "abnormal termination"
>>>>
>>>> 3. this process called "MPI_Abort" or "orte_abort" and the mca
>>>> parameter
>>>> orte_create_session_dirs is set to false. In this case, the
>>>> run-time cannot
>>>> detect that the abort call was an abnormal termination. Hence, the only
>>>> error message you will receive is this one.
>>>>
>>>> This may have caused other processes in the application to be
>>>> terminated by signals sent by mpirun (as reported here).
>>>>
>>>> You can avoid this message by specifying -quiet on the mpirun
>>>> command line.
>>>>
>>>> --------------------------------------------------------------------------
>>>> [SERVER-2:08907] 6 more processes have sent help message
>>>> help-mpi-runtime / mpi_init:startup:internal-failure
>>>> [SERVER-2:08907] Set MCA parameter "orte_base_help_aggregate" to 0
>>>> to see all help / error messages
>>>> [SERVER-2:08907] 9 more processes have sent help message
>>>> help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
>>>> [SERVER-2:08907] 9 more processes have sent help message
>>>> help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
>>>> [SERVER-2:08907] 2 more processes have sent help message
>>>> help-mca-bml-r2.txt / unreachable proc
>>>> [SERVER-2:08907] 2 more processes have sent help message
>>>> help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
>>>> [SERVER-2:08907] sess_dir_finalize: job session dir not empty - leaving
>>>> exiting with status 1
>>>>
>>>> //******************************************************************
>>>>
>>>> On 8/3/13 4:34 AM, Ralph Castain wrote:
>>>>> It looks like SERVER-2 cannot talk to your x.x.x.100 machine. I
>>>>> note that you have some entries at the end of the hostfile that I
>>>>> don't understand - a list of hosts that can be reached? And I see
>>>>> that your x.x.x.22 machine isn't on it. Is that SERVER-2 by chance?
>>>>>
>>>>> Our hostfile parsing changed between the release series, but I
>>>>> know we never consciously supported the syntax you show below
>>>>> where you list capabilities, and then re-list the hosts in an
>>>>> apparent attempt to filter which ones can actually be used. It is
>>>>> possible that the 1.5 series somehow used that to exclude the 22
>>>>> machine, and that the 1.7 parser now doesn't do that.
>>>>>
>>>>> If you only include machines you actually intend to use in your
>>>>> hostfile, does the 1.7 series work?
>>>>>
>>>>> On Aug 3, 2013, at 3:58 AM, RoboBeans <robobeans_at_[hidden]
>>>>> <mailto:robobeans_at_[hidden]>> wrote:
>>>>>
>>>>>> Hello everyone,
>>>>>>
>>>>>> I have installed openmpi 1.5.4 on 11 node cluster using "yum
>>>>>> install openmpi openmpi-devel" and everything seems to be working
>>>>>> fine. For testing I am using this test program
>>>>>>
>>>>>> //******************************************************************
>>>>>>
>>>>>> *$ cat test.cpp*
>>>>>>
>>>>>> #include <stdio.h>
>>>>>> #include <mpi.h>
>>>>>>
>>>>>> int main (int argc, char *argv[])
>>>>>> {
>>>>>> int id, np;
>>>>>> char name[MPI_MAX_PROCESSOR_NAME];
>>>>>> int namelen;
>>>>>> int i;
>>>>>>
>>>>>> MPI_Init (&argc, &argv);
>>>>>>
>>>>>> MPI_Comm_size (MPI_COMM_WORLD, &np);
>>>>>> MPI_Comm_rank (MPI_COMM_WORLD, &id);
>>>>>> MPI_Get_processor_name (name, &namelen);
>>>>>>
>>>>>> printf ("This is Process %2d out of %2d running on host %s\n",
>>>>>> id, np, name);
>>>>>>
>>>>>> MPI_Finalize ();
>>>>>>
>>>>>> return (0);
>>>>>> }
>>>>>>
>>>>>> //******************************************************************
>>>>>>
>>>>>> and my hosts file look like this:
>>>>>>
>>>>>> *$ cat mpi_hostfile*
>>>>>>
>>>>>> # The Hostfile for Open MPI
>>>>>>
>>>>>> # specify number of slots for processes to run locally.
>>>>>> #localhost slots=12
>>>>>> #x.x.x.16 slots=12 max-slots=12
>>>>>> #x.x.x.17 slots=12 max-slots=12
>>>>>> #x.x.x.18 slots=12 max-slots=12
>>>>>> #x.x.1x.19 slots=12 max-slots=12
>>>>>> #x.x.x.20 slots=12 max-slots=12
>>>>>> #x.x.x.55 slots=46 max-slots=46
>>>>>> #x.x.x.56 slots=46 max-slots=46
>>>>>>
>>>>>> x.x.x.22 slots=15 max-slots=15
>>>>>> x.x.x.24 slots=2 max-slots=2
>>>>>> x.x.x.26 slots=14 max-slots=14
>>>>>> x.x.x.28 slots=16 max-slots=16
>>>>>> x.x.x.29 slots=14 max-slots=14
>>>>>> x.x.x.30 slots=16 max-slots=16
>>>>>> x.x.x.41 slots=46 max-slots=46
>>>>>> x.x.x.101 slots=46 max-slots=46
>>>>>> x.x.x.100 slots=46 max-slots=46
>>>>>> x.x.x.102 slots=22 max-slots=22
>>>>>> x.x.x.103 slots=22 max-slots=22
>>>>>>
>>>>>> # The following slave nodes are available to this machine:
>>>>>> x.x.x.24
>>>>>> x.x.x.26
>>>>>> x.x.x.28
>>>>>> x.x.x.29
>>>>>> x.x.x.30
>>>>>> x.x.x.41
>>>>>> x.x.x.101
>>>>>> x.x.x.100
>>>>>> x.x.x.102
>>>>>> x.x.x.103
>>>>>>
>>>>>> //******************************************************************
>>>>>>
>>>>>> this is how my .bashrc looks like on each node:
>>>>>>
>>>>>> *$ cat ~/.bashrc*
>>>>>>
>>>>>> # .bashrc
>>>>>>
>>>>>> # Source global definitions
>>>>>> if [ -f /etc/bashrc ]; then
>>>>>> . /etc/bashrc
>>>>>> fi
>>>>>>
>>>>>> # User specific aliases and functions
>>>>>> umask 077
>>>>>>
>>>>>> export PSM_SHAREDCONTEXTS_MAX=20
>>>>>>
>>>>>> #export PATH=/usr/lib64/openmpi/bin${PATH:+:$PATH}
>>>>>> export PATH=/usr/OPENMPI/openmpi-1.7.2/bin${PATH:+:$PATH}
>>>>>>
>>>>>> #export
>>>>>> LD_LIBRARY_PATH=/usr/lib64/openmpi/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
>>>>>> export
>>>>>> LD_LIBRARY_PATH=/usr/OPENMPI/openmpi-1.7.2/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
>>>>>>
>>>>>> export PATH="$PATH":/bin/:/usr/lib/:/usr/lib:/usr:/usr/
>>>>>>
>>>>>> //******************************************************************
>>>>>>
>>>>>> *$ mpic++ test.cpp -o test*
>>>>>>
>>>>>> *$ mpirun -d --display-map -np 10 --hostfile mpi_hostfile
>>>>>> --bynode ./test*
>>>>>>
>>>>>> //******************************************************************
>>>>>>
>>>>>> These nodes are running 2.6.32-358.2.1.el6.x86_64 release
>>>>>>
>>>>>> *$ **uname*
>>>>>> Linux
>>>>>> *$ **uname -r*
>>>>>> 2.6.32-358.2.1.el6.x86_64
>>>>>> *$ cat /etc/issue*
>>>>>> CentOS release 6.4 (Final)
>>>>>> Kernel \r on an \m
>>>>>>
>>>>>> //******************************************************************
>>>>>>
>>>>>> Now, if I install openmpi 1.7.2 on each node separately then I
>>>>>> can only use it on either first 7 nodes or last 4 nodes but not
>>>>>> on all of them.
>>>>>>
>>>>>> //******************************************************************
>>>>>>
>>>>>> *$ gunzip -c openmpi-1.7.2.tar.gz | tar xf -**
>>>>>> **
>>>>>> **$ cd openmpi-1.7.2**
>>>>>> ****
>>>>>> **$ ./configure --prefix=/usr/OPENMPI/openmpi-1.7.2
>>>>>> --enable-event-thread-support --enable-opal-multi-threads
>>>>>> --enable-orte-progress-threads --enable-mpi-thread-multiple**
>>>>>> **
>>>>>> **$ make all install*
>>>>>>
>>>>>> //******************************************************************
>>>>>>
>>>>>> This is the error message that i am receiving:
>>>>>>
>>>>>>
>>>>>> *$ mpirun -d --display-map -np 10 --hostfile mpi_hostfile
>>>>>> --bynode ./test*
>>>>>>
>>>>>> [SERVER-2:05284] procdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/0/0
>>>>>> [SERVER-2:05284] jobdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/0
>>>>>> [SERVER-2:05284] top: openmpi-sessions-mpidemo_at_SERVER-2_0
>>>>>> [SERVER-2:05284] tmp: /tmp
>>>>>> CentOS release 6.4 (Final)
>>>>>> Kernel \r on an \m
>>>>>> CentOS release 6.4 (Final)
>>>>>> Kernel \r on an \m
>>>>>> CentOS release 6.4 (Final)
>>>>>> Kernel \r on an \m
>>>>>> [SERVER-3:28993] procdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/0/1
>>>>>> [SERVER-3:28993] jobdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/0
>>>>>> [SERVER-3:28993] top: openmpi-sessions-mpidemo_at_SERVER-3_0
>>>>>> [SERVER-3:28993] tmp: /tmp
>>>>>> CentOS release 6.4 (Final)
>>>>>> Kernel \r on an \m
>>>>>> CentOS release 6.4 (Final)
>>>>>> Kernel \r on an \m
>>>>>> [SERVER-6:09087] procdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/0/4
>>>>>> [SERVER-6:09087] jobdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/0
>>>>>> [SERVER-6:09087] top: openmpi-sessions-mpidemo_at_SERVER-6_0
>>>>>> [SERVER-6:09087] tmp: /tmp
>>>>>> [SERVER-7:32563] procdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/0/5
>>>>>> [SERVER-7:32563] jobdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/0
>>>>>> [SERVER-7:32563] top: openmpi-sessions-mpidemo_at_SERVER-7_0
>>>>>> [SERVER-7:32563] tmp: /tmp
>>>>>> [SERVER-4:15711] procdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/0/2
>>>>>> [SERVER-4:15711] jobdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/0
>>>>>> [SERVER-4:15711] top: openmpi-sessions-mpidemo_at_SERVER-4_0
>>>>>> [SERVER-4:15711] tmp: /tmp
>>>>>> [sv-1:45701] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/0/8
>>>>>> [sv-1:45701] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/0
>>>>>> [sv-1:45701] top: openmpi-sessions-mpidemo_at_sv-1_0
>>>>>> [sv-1:45701] tmp: /tmp
>>>>>> CentOS release 6.4 (Final)
>>>>>> Kernel \r on an \m
>>>>>> [sv-3:08352] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/0/9
>>>>>> [sv-3:08352] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/0
>>>>>> [sv-3:08352] top: openmpi-sessions-mpidemo_at_sv-3_0
>>>>>> [sv-3:08352] tmp: /tmp
>>>>>> [SERVER-5:12534] procdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/0/3
>>>>>> [SERVER-5:12534] jobdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/0
>>>>>> [SERVER-5:12534] top: openmpi-sessions-mpidemo_at_SERVER-5_0
>>>>>> [SERVER-5:12534] tmp: /tmp
>>>>>> [SERVER-14:08399] procdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/0/6
>>>>>> [SERVER-14:08399] jobdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/0
>>>>>> [SERVER-14:08399] top: openmpi-sessions-mpidemo_at_SERVER-14_0
>>>>>> [SERVER-14:08399] tmp: /tmp
>>>>>> [sv-4:11802] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-4_0/50535/0/10
>>>>>> [sv-4:11802] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-4_0/50535/0
>>>>>> [sv-4:11802] top: openmpi-sessions-mpidemo_at_sv-4_0
>>>>>> [sv-4:11802] tmp: /tmp
>>>>>> [sv-2:07503] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/0/7
>>>>>> [sv-2:07503] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/0
>>>>>> [sv-2:07503] top: openmpi-sessions-mpidemo_at_sv-2_0
>>>>>> [sv-2:07503] tmp: /tmp
>>>>>>
>>>>>> Mapper requested: NULL Last mapper: round_robin Mapping
>>>>>> policy: BYNODE Ranking policy: NODE Binding policy: NONE[NODE]
>>>>>> Cpu set: NULL PPR: NULL
>>>>>> Num new daemons: 0 New daemon starting vpid INVALID
>>>>>> Num nodes: 10
>>>>>>
>>>>>> Data for node: SERVER-2 Launch id: -1 State: 2
>>>>>> Daemon: [[50535,0],0] Daemon launched: True
>>>>>> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
>>>>>> Num slots allocated: 15 Max slots: 15
>>>>>> Username on node: NULL
>>>>>> Num procs: 1 Next node_rank: 1
>>>>>> Data for proc: [[50535,1],0]
>>>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
>>>>>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>>>>>> 0-15 Binding: NULL[0]
>>>>>>
>>>>>> Data for node: x.x.x.24 Launch id: -1 State: 0
>>>>>> Daemon: [[50535,0],1] Daemon launched: False
>>>>>> Num slots: 3 Slots in use: 1 Oversubscribed: FALSE
>>>>>> Num slots allocated: 3 Max slots: 2
>>>>>> Username on node: NULL
>>>>>> Num procs: 1 Next node_rank: 1
>>>>>> Data for proc: [[50535,1],1]
>>>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 1
>>>>>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>>>>>> 0-7 Binding: NULL[0]
>>>>>>
>>>>>> Data for node: x.x.x.26 Launch id: -1 State: 0
>>>>>> Daemon: [[50535,0],2] Daemon launched: False
>>>>>> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
>>>>>> Num slots allocated: 15 Max slots: 14
>>>>>> Username on node: NULL
>>>>>> Num procs: 1 Next node_rank: 1
>>>>>> Data for proc: [[50535,1],2]
>>>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 2
>>>>>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>>>>>> 0-7 Binding: NULL[0]
>>>>>>
>>>>>> Data for node: x.x.x.28 Launch id: -1 State: 0
>>>>>> Daemon: [[50535,0],3] Daemon launched: False
>>>>>> Num slots: 17 Slots in use: 1 Oversubscribed: FALSE
>>>>>> Num slots allocated: 17 Max slots: 16
>>>>>> Username on node: NULL
>>>>>> Num procs: 1 Next node_rank: 1
>>>>>> Data for proc: [[50535,1],3]
>>>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 3
>>>>>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>>>>>> 0-7 Binding: NULL[0]
>>>>>>
>>>>>> Data for node: x.x.x.29 Launch id: -1 State: 0
>>>>>> Daemon: [[50535,0],4] Daemon launched: False
>>>>>> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
>>>>>> Num slots allocated: 15 Max slots: 14
>>>>>> Username on node: NULL
>>>>>> Num procs: 1 Next node_rank: 1
>>>>>> Data for proc: [[50535,1],4]
>>>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 4
>>>>>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>>>>>> 0-7 Binding: NULL[0]
>>>>>>
>>>>>> Data for node: x.x.x.30 Launch id: -1 State: 0
>>>>>> Daemon: [[50535,0],5] Daemon launched: False
>>>>>> Num slots: 17 Slots in use: 1 Oversubscribed: FALSE
>>>>>> Num slots allocated: 17 Max slots: 16
>>>>>> Username on node: NULL
>>>>>> Num procs: 1 Next node_rank: 1
>>>>>> Data for proc: [[50535,1],5]
>>>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 5
>>>>>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>>>>>> 0-7 Binding: NULL[0]
>>>>>>
>>>>>> Data for node: x.x.x.41 Launch id: -1 State: 0
>>>>>> Daemon: [[50535,0],6] Daemon launched: False
>>>>>> Num slots: 47 Slots in use: 1 Oversubscribed: FALSE
>>>>>> Num slots allocated: 47 Max slots: 46
>>>>>> Username on node: NULL
>>>>>> Num procs: 1 Next node_rank: 1
>>>>>> Data for proc: [[50535,1],6]
>>>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 6
>>>>>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>>>>>> 0-7 Binding: NULL[0]
>>>>>>
>>>>>> Data for node: x.x.x.101 Launch id: -1 State: 0
>>>>>> Daemon: [[50535,0],7] Daemon launched: False
>>>>>> Num slots: 47 Slots in use: 1 Oversubscribed: FALSE
>>>>>> Num slots allocated: 47 Max slots: 46
>>>>>> Username on node: NULL
>>>>>> Num procs: 1 Next node_rank: 1
>>>>>> Data for proc: [[50535,1],7]
>>>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 7
>>>>>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>>>>>> 0-7 Binding: NULL[0]
>>>>>>
>>>>>> Data for node: x.x.x.100 Launch id: -1 State: 0
>>>>>> Daemon: [[50535,0],8] Daemon launched: False
>>>>>> Num slots: 47 Slots in use: 1 Oversubscribed: FALSE
>>>>>> Num slots allocated: 47 Max slots: 46
>>>>>> Username on node: NULL
>>>>>> Num procs: 1 Next node_rank: 1
>>>>>> Data for proc: [[50535,1],8]
>>>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 8
>>>>>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>>>>>> 0-7 Binding: NULL[0]
>>>>>>
>>>>>> Data for node: x.x.x.102 Launch id: -1 State: 0
>>>>>> Daemon: [[50535,0],9] Daemon launched: False
>>>>>> Num slots: 23 Slots in use: 1 Oversubscribed: FALSE
>>>>>> Num slots allocated: 23 Max slots: 22
>>>>>> Username on node: NULL
>>>>>> Num procs: 1 Next node_rank: 1
>>>>>> Data for proc: [[50535,1],9]
>>>>>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 9
>>>>>> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
>>>>>> 0-7 Binding: NULL[0]
>>>>>> [sv-1:45712] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/1/8
>>>>>> [sv-1:45712] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/1
>>>>>> [sv-1:45712] top: openmpi-sessions-mpidemo_at_sv-1_0
>>>>>> [sv-1:45712] tmp: /tmp
>>>>>> [SERVER-14:08412] procdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/1/6
>>>>>> [SERVER-14:08412] jobdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/1
>>>>>> [SERVER-14:08412] top: openmpi-sessions-mpidemo_at_SERVER-14_0
>>>>>> [SERVER-14:08412] tmp: /tmp
>>>>>> [SERVER-2:05291] procdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/1/0
>>>>>> [SERVER-2:05291] jobdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/1
>>>>>> [SERVER-2:05291] top: openmpi-sessions-mpidemo_at_SERVER-2_0
>>>>>> [SERVER-2:05291] tmp: /tmp
>>>>>> [SERVER-4:15726] procdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/1/2
>>>>>> [SERVER-4:15726] jobdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/1
>>>>>> [SERVER-4:15726] top: openmpi-sessions-mpidemo_at_SERVER-4_0
>>>>>> [SERVER-4:15726] tmp: /tmp
>>>>>> [SERVER-6:09100] procdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/1/4
>>>>>> [SERVER-6:09100] jobdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/1
>>>>>> [SERVER-6:09100] top: openmpi-sessions-mpidemo_at_SERVER-6_0
>>>>>> [SERVER-6:09100] tmp: /tmp
>>>>>> [SERVER-7:32576] procdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/1/5
>>>>>> [SERVER-7:32576] jobdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/1
>>>>>> [SERVER-7:32576] top: openmpi-sessions-mpidemo_at_SERVER-7_0
>>>>>> [SERVER-7:32576] tmp: /tmp
>>>>>> [sv-3:08363] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/1/9
>>>>>> [sv-3:08363] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/1
>>>>>> [sv-3:08363] top: openmpi-sessions-mpidemo_at_sv-3_0
>>>>>> [sv-3:08363] tmp: /tmp
>>>>>> [sv-2:07514] procdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/1/7
>>>>>> [sv-2:07514] jobdir: /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/1
>>>>>> [sv-2:07514] top: openmpi-sessions-mpidemo_at_sv-2_0
>>>>>> [sv-2:07514] tmp: /tmp
>>>>>> [SERVER-5:12548] procdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/1/3
>>>>>> [SERVER-5:12548] jobdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/1
>>>>>> [SERVER-5:12548] top: openmpi-sessions-mpidemo_at_SERVER-5_0
>>>>>> [SERVER-5:12548] tmp: /tmp
>>>>>> [SERVER-3:29009] procdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/1/1
>>>>>> [SERVER-3:29009] jobdir:
>>>>>> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/1
>>>>>> [SERVER-3:29009] top: openmpi-sessions-mpidemo_at_SERVER-3_0
>>>>>> [SERVER-3:29009] tmp: /tmp
>>>>>> MPIR_being_debugged = 0
>>>>>> MPIR_debug_state = 1
>>>>>> MPIR_partial_attach_ok = 1
>>>>>> MPIR_i_am_starter = 0
>>>>>> MPIR_forward_output = 0
>>>>>> MPIR_proctable_size = 10
>>>>>> MPIR_proctable:
>>>>>> (i, host, exe, pid) = (0, SERVER-2,
>>>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 5291)
>>>>>> (i, host, exe, pid) = (1, x.x.x.24,
>>>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 29009)
>>>>>> (i, host, exe, pid) = (2, x.x.x.26,
>>>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 15726)
>>>>>> (i, host, exe, pid) = (3, x.x.x.28,
>>>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 12548)
>>>>>> (i, host, exe, pid) = (4, x.x.x.29,
>>>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 9100)
>>>>>> (i, host, exe, pid) = (5, x.x.x.30,
>>>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32576)
>>>>>> (i, host, exe, pid) = (6, x.x.x.41,
>>>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8412)
>>>>>> (i, host, exe, pid) = (7, x.x.x.101,
>>>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7514)
>>>>>> (i, host, exe, pid) = (8, x.x.x.100,
>>>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 45712)
>>>>>> (i, host, exe, pid) = (9, x.x.x.102,
>>>>>> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8363)
>>>>>> MPIR_executable_path: NULL
>>>>>> MPIR_server_arguments: NULL
>>>>>> --------------------------------------------------------------------------
>>>>>> It looks like MPI_INIT failed for some reason; your parallel
>>>>>> process is
>>>>>> likely to abort. There are many reasons that a parallel process can
>>>>>> fail during MPI_INIT; some of which are due to configuration or
>>>>>> environment
>>>>>> problems. This failure appears to be an internal failure; here's
>>>>>> some
>>>>>> additional information (which may only be relevant to an Open MPI
>>>>>> developer):
>>>>>>
>>>>>> PML add procs failed
>>>>>> --> Returned "Error" (-1) instead of "Success" (0)
>>>>>> --------------------------------------------------------------------------
>>>>>> [SERVER-2:5291] *** An error occurred in MPI_Init
>>>>>> [SERVER-2:5291] *** reported by process
>>>>>> [140508871983105,140505560121344]
>>>>>> [SERVER-2:5291] *** on a NULL communicator
>>>>>> [SERVER-2:5291] *** Unknown error
>>>>>> [SERVER-2:5291] *** MPI_ERRORS_ARE_FATAL (processes in this
>>>>>> communicator will now abort,
>>>>>> [SERVER-2:5291] *** and potentially your MPI job)
>>>>>> --------------------------------------------------------------------------
>>>>>> An MPI process is aborting at a time when it cannot guarantee
>>>>>> that all
>>>>>> of its peer processes in the job will be killed properly. You should
>>>>>> double check that everything has shut down cleanly.
>>>>>>
>>>>>> Reason: Before MPI_INIT completed
>>>>>> Local host: SERVER-2
>>>>>> PID: 5291
>>>>>> --------------------------------------------------------------------------
>>>>>> [sv-1][[50535,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[50535,1],0]
>>>>>> [sv-3][[50535,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[50535,1],0]
>>>>>> [sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>>>> mca_base_modex_recv: failed with return value=-13
>>>>>> [sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>>>> mca_base_modex_recv: failed with return value=-13
>>>>>> [sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>>>> mca_base_modex_recv: failed with return value=-13
>>>>>> [sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>>>> mca_base_modex_recv: failed with return value=-13
>>>>>> --------------------------------------------------------------------------
>>>>>> At least one pair of MPI processes are unable to reach each other for
>>>>>> MPI communications. This means that no Open MPI device has indicated
>>>>>> that it can be used to communicate between these processes. This is
>>>>>> an error; Open MPI requires that all MPI processes be able to reach
>>>>>> each other. This error can sometimes be the result of forgetting to
>>>>>> specify the "self" BTL.
>>>>>>
>>>>>> Process 1 ([[50535,1],8]) is on host: sv-1
>>>>>> Process 2 ([[50535,1],0]) is on host: SERVER-2
>>>>>> BTLs attempted: openib self sm tcp
>>>>>>
>>>>>> Your MPI job is now going to abort; sorry.
>>>>>> --------------------------------------------------------------------------
>>>>>> --------------------------------------------------------------------------
>>>>>> MPI_INIT has failed because at least one MPI process is unreachable
>>>>>> from another. This *usually* means that an underlying communication
>>>>>> plugin -- such as a BTL or an MTL -- has either not loaded or not
>>>>>> allowed itself to be used. Your MPI job will now abort.
>>>>>>
>>>>>> You may wish to try to narrow down the problem;
>>>>>>
>>>>>> * Check the output of ompi_info to see which BTL/MTL plugins are
>>>>>> available.
>>>>>> * Run your application with MPI_THREAD_SINGLE.
>>>>>> * Set the MCA parameter btl_base_verbose to 100 (or
>>>>>> mtl_base_verbose,
>>>>>> if using MTL-based communications) to see exactly which
>>>>>> communication plugins were considered and/or discarded.
>>>>>> --------------------------------------------------------------------------
>>>>>> [sv-2][[50535,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[50535,1],0]
>>>>>> [sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>>>> mca_base_modex_recv: failed with return value=-13
>>>>>> [sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
>>>>>> mca_base_modex_recv: failed with return value=-13
>>>>>> [SERVER-2:05284] sess_dir_finalize: proc session dir not empty -
>>>>>> leaving
>>>>>> [SERVER-2:05284] sess_dir_finalize: proc session dir not empty -
>>>>>> leaving
>>>>>> [sv-4:11802] sess_dir_finalize: job session dir not empty - leaving
>>>>>> [SERVER-14:08399] sess_dir_finalize: job session dir not empty -
>>>>>> leaving
>>>>>> [SERVER-6:09087] sess_dir_finalize: proc session dir not empty -
>>>>>> leaving
>>>>>> [SERVER-6:09087] sess_dir_finalize: proc session dir not empty -
>>>>>> leaving
>>>>>> [SERVER-4:15711] sess_dir_finalize: proc session dir not empty -
>>>>>> leaving
>>>>>> [SERVER-4:15711] sess_dir_finalize: proc session dir not empty -
>>>>>> leaving
>>>>>> [SERVER-6:09087] sess_dir_finalize: job session dir not empty -
>>>>>> leaving
>>>>>> exiting with status 0
>>>>>> [SERVER-7:32563] sess_dir_finalize: proc session dir not empty -
>>>>>> leaving
>>>>>> [SERVER-7:32563] sess_dir_finalize: proc session dir not empty -
>>>>>> leaving
>>>>>> [SERVER-5:12534] sess_dir_finalize: proc session dir not empty -
>>>>>> leaving
>>>>>> [SERVER-5:12534] sess_dir_finalize: proc session dir not empty -
>>>>>> leaving
>>>>>> [SERVER-7:32563] sess_dir_finalize: job session dir not empty -
>>>>>> leaving
>>>>>> exiting with status 0
>>>>>> exiting with status 0
>>>>>> exiting with status 0
>>>>>> [SERVER-4:15711] sess_dir_finalize: job session dir not empty -
>>>>>> leaving
>>>>>> [SERVER-3:28993] sess_dir_finalize: proc session dir not empty -
>>>>>> leaving
>>>>>> exiting with status 0
>>>>>> [SERVER-3:28993] sess_dir_finalize: proc session dir not empty -
>>>>>> leaving
>>>>>> [sv-3:08352] sess_dir_finalize: proc session dir not empty - leaving
>>>>>> [sv-3:08352] sess_dir_finalize: job session dir not empty - leaving
>>>>>> [sv-1:45701] sess_dir_finalize: proc session dir not empty - leaving
>>>>>> [sv-1:45701] sess_dir_finalize: job session dir not empty - leaving
>>>>>> exiting with status 0
>>>>>> exiting with status 0
>>>>>> [sv-2:07503] sess_dir_finalize: proc session dir not empty - leaving
>>>>>> [sv-2:07503] sess_dir_finalize: job session dir not empty - leaving
>>>>>> exiting with status 0
>>>>>> [SERVER-5:12534] sess_dir_finalize: job session dir not empty -
>>>>>> leaving
>>>>>> exiting with status 0
>>>>>> [SERVER-3:28993] sess_dir_finalize: job session dir not empty -
>>>>>> leaving
>>>>>> exiting with status 0
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun has exited due to process rank 6 with PID 8412 on
>>>>>> node x.x.x.41 exiting improperly. There are three reasons this
>>>>>> could occur:
>>>>>>
>>>>>> 1. this process did not call "init" before exiting, but others in
>>>>>> the job did. This can cause a job to hang indefinitely while it waits
>>>>>> for all processes to call "init". By rule, if one process calls
>>>>>> "init",
>>>>>> then ALL processes must call "init" prior to termination.
>>>>>>
>>>>>> 2. this process called "init", but exited without calling "finalize".
>>>>>> By rule, all processes that call "init" MUST call "finalize" prior to
>>>>>> exiting or it will be considered an "abnormal termination"
>>>>>>
>>>>>> 3. this process called "MPI_Abort" or "orte_abort" and the mca
>>>>>> parameter
>>>>>> orte_create_session_dirs is set to false. In this case, the
>>>>>> run-time cannot
>>>>>> detect that the abort call was an abnormal termination. Hence,
>>>>>> the only
>>>>>> error message you will receive is this one.
>>>>>>
>>>>>> This may have caused other processes in the application to be
>>>>>> terminated by signals sent by mpirun (as reported here).
>>>>>>
>>>>>> You can avoid this message by specifying -quiet on the mpirun
>>>>>> command line.
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> [SERVER-2:05284] 6 more processes have sent help message
>>>>>> help-mpi-runtime / mpi_init:startup:internal-failure
>>>>>> [SERVER-2:05284] Set MCA parameter "orte_base_help_aggregate" to
>>>>>> 0 to see all help / error messages
>>>>>> [SERVER-2:05284] 9 more processes have sent help message
>>>>>> help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
>>>>>> [SERVER-2:05284] 9 more processes have sent help message
>>>>>> help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed
>>>>>> [SERVER-2:05284] 2 more processes have sent help message
>>>>>> help-mca-bml-r2.txt / unreachable proc
>>>>>> [SERVER-2:05284] 2 more processes have sent help message
>>>>>> help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
>>>>>> [SERVER-2:05284] sess_dir_finalize: job session dir not empty -
>>>>>> leaving
>>>>>> exiting with status 1
>>>>>>
>>>>>> //******************************************************************
>>>>>>
>>>>>> Any feedback will be helpful. Thank you!
>>>>>>
>>>>>> Mr. Beans
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users