Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] ERROR: At least one pair of MPI processes are unable to reach each other for MPI communications.
From: RoboBeans (robobeans_at_[hidden])
Date: 2013-08-04 21:26:46


Hi Tom,

As per your suggestion, i tried

*./configure --with-psm --prefix=/opt/openmpi-1.7.2
--enable-event-thread-support --enable-opal-multi-threads
--enable-orte-progress-threads --enable-mpi-thread-multiple*

but I am getting this error:

--- MCA component mtl:psm (m4 configuration macro)
checking for MCA component mtl:psm compile mode... dso
checking --with-psm value... simple ok (unspecified)
checking --with-psm-libdir value... simple ok (unspecified)
checking psm.h usability... no
checking psm.h presence... yes
configure: WARNING: psm.h: present but cannot be compiled
configure: WARNING: psm.h: check for missing prerequisite headers?
configure: WARNING: psm.h: see the Autoconf documentation
configure: WARNING: psm.h: section "Present But Cannot Be Compiled"
configure: WARNING: psm.h: proceeding with the compiler's result
configure: WARNING: ##
------------------------------------------------------ ##
configure: WARNING: ## Report this to
http://www.open-mpi.org/community/help/ ##
configure: WARNING: ##
------------------------------------------------------ ##
checking for psm.h... no
configure: error: PSM support requested but not found. Aborting

Any feedback will be helpful. Thanks for your time!

Mr. Beans

On 8/4/13 10:31 AM, Elken, Tom wrote:
>
> On 8/3/13 7:09 PM, RoboBeans wrote:
>
> On first 7 nodes:
>
> *[mpidemo_at_SERVER-3 ~]$ ofed_info | head -n 1*
> OFED-1.5.3.2:
>
> On last 4 nodes:
>
> *[mpidemo_at_sv-2 ~]$ ofed_info | head -n 1*
> -bash: ofed_info: command not found
>
> */[Tom] /*
>
> */This is a pretty good clue that OFED is not installed on the
> last 4 nodes. You should fix that by installing OFED 1.5.3.2 on
> the last 4 nodes, OR better (but more work) install a newer OFED
> such as 1.5.4.1 or 3.5 on ALL the nodes (You need to look at the
> OFED release notes to see if your OS is supported by these OFEDs). /*
>
> *//*
>
> */BTW, since you are using QLogic HCAs, they typically work with
> the best performance when using the PSM API to the HCA. PSM is
> part of OFED. To use this by default with Open MPI, you can build
> Open MPI as follows:/*
>
> ./configure --with-psm --prefix=<install directory>
>
> make
>
> make install
>
> */
> /*With an Open MPI that is already built, you can try to use PSM
> as follows:
> mpirun ... --mca mtl psm --mca btl ^openib ...
>
> -Tom
>
> *[mpidemo_at_sv-2 ~]$ which ofed_info*
> /usr/bin/which: no ofed_info in
> (/usr/OPENMPI/openmpi-1.7.2/bin:/usr/OPENMPI/openmpi-1.7.2/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/bin/:/usr/lib/:/usr/lib:/usr:/usr/:/bin/:/usr/lib/:/usr/lib:/usr:/usr/)
>
>
> Are there some specific locations where I should look for
> ofed_info? How can I make sure that ofed was installed on a node
> or not?
>
> Thanks again!!!
>
>
> On 8/3/13 5:52 PM, Ralph Castain wrote:
>
> Are the ofed versions the same across all the machines? I
> would suspect that might be the problem.
>
> On Aug 3, 2013, at 4:06 PM, RoboBeans <robobeans_at_[hidden]
> <mailto:robobeans_at_[hidden]>> wrote:
>
>
>
> Hi Ralph, I tried using 1.5.4, 1.6.5 and 1.7.2 (compiled from
> source code) with no configuration arguments but I am facing
> the same issue. When I run a job using 1.5.4 (installed using
> yum), I get warnings but it doesn't affect my output.
>
> Example of warning that I get:
>
> sv-2.7960ipath_userinit: Mismatched user minor version (12)
> and driver minor version (11) while context sharing. Ensure
> that driver and library are from the same release.
>
> Each system has a QLogic card ("QLE7342-CK dual port IB
> card"), with the same OS but different kernel revision no.
> (e.g. 2.6.32-358.2.1.el6.x86_64, 2.6.32-358.el6.x86_64).
>
> Thank you for your time.
>
> On 8/3/13 2:05 PM, Ralph Castain wrote:
>
> Hmmm...strange indeed. I would remove those four configure
> options and give it a try. That will eliminate all the
> obvious things, I would think, though they aren't
> generally involved in the issue shown here. Still, worth
> taking out potential trouble sources.
>
> What is the connectivity between SERVER-2 and node 100?
> Should I assume that the first seven nodes are connected
> via one type of interconnect, and the other four are
> connected to those seven by another type?
>
> On Aug 3, 2013, at 1:30 PM, RoboBeans <robobeans_at_[hidden]
> <mailto:robobeans_at_[hidden]>> wrote:
>
>
>
> Thanks for looking into in Ralph. I modified the hosts
> file but I am still getting the same error. Any other
> pointers you can think of? The difference between this
> 1.7.2 installation and 1.5.4 is that I installed 1.5.4
> using yum and for 1.7.2, I used the source code and
> configured with *--enable-event-thread-support
> --enable-opal-multi-threads --enable-orte-progress-threads
> --enable-mpi-thread-multiple
> *. Am I missing something here?
>
> //******************************************************************
>
> *$ cat mpi_hostfile*
>
> x.x.x.22 slots=15 max-slots=15
> x.x.x.24 slots=2 max-slots=2
> x.x.x.26 slots=14 max-slots=14
> x.x.x.28 slots=16 max-slots=16
> x.x.x.29 slots=14 max-slots=14
> x.x.x.30 slots=16 max-slots=16
> x.x.x.41 slots=46 max-slots=46
> x.x.x.101 slots=46 max-slots=46
> x.x.x.100 slots=46 max-slots=46
> x.x.x.102 slots=22 max-slots=22
> x.x.x.103 slots=22 max-slots=22
>
> //******************************************************************
> *$ mpirun -d --display-map -np 10 --hostfile mpi_hostfile
> --bynode ./test
> *
> [SERVER-2:08907] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/0/0
> [SERVER-2:08907] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/0
> [SERVER-2:08907] top: openmpi-sessions-mpidemo_at_SERVER-2_0
> [SERVER-2:08907] tmp: /tmp
> CentOS release 6.4 (Final)
> Kernel \r on an \m
> CentOS release 6.4 (Final)
> Kernel \r on an \m
> CentOS release 6.4 (Final)
> Kernel \r on an \m
> [SERVER-3:32517] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/0/1
> [SERVER-3:32517] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/0
> [SERVER-3:32517] top: openmpi-sessions-mpidemo_at_SERVER-3_0
> [SERVER-3:32517] tmp: /tmp
> CentOS release 6.4 (Final)
> Kernel \r on an \m
> CentOS release 6.4 (Final)
> Kernel \r on an \m
> [SERVER-6:11595] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/0/4
> [SERVER-6:11595] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/0
> [SERVER-6:11595] top: openmpi-sessions-mpidemo_at_SERVER-6_0
> [SERVER-6:11595] tmp: /tmp
> [SERVER-4:27445] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/0/2
> [SERVER-4:27445] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/0
> [SERVER-4:27445] top: openmpi-sessions-mpidemo_at_SERVER-4_0
> [SERVER-4:27445] tmp: /tmp
> [SERVER-7:02607] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/0/5
> [SERVER-7:02607] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/0
> [SERVER-7:02607] top: openmpi-sessions-mpidemo_at_SERVER-7_0
> [SERVER-7:02607] tmp: /tmp
> [sv-1:46100] procdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/0/8
> [sv-1:46100] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/0
> [sv-1:46100] top: openmpi-sessions-mpidemo_at_sv-1_0
> [sv-1:46100] tmp: /tmp
> CentOS release 6.4 (Final)
> Kernel \r on an \m
> [SERVER-5:16404] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/0/3
> [SERVER-5:16404] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/0
> [SERVER-5:16404] top: openmpi-sessions-mpidemo_at_SERVER-5_0
> [SERVER-5:16404] tmp: /tmp
> [sv-3:08575] procdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/0/9
> [sv-3:08575] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/0
> [sv-3:08575] top: openmpi-sessions-mpidemo_at_sv-3_0
> [sv-3:08575] tmp: /tmp
> [SERVER-14:10755] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/0/6
> [SERVER-14:10755] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/0
> [SERVER-14:10755] top: openmpi-sessions-mpidemo_at_SERVER-14_0
> [SERVER-14:10755] tmp: /tmp
> [sv-4:12040] procdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-4_0/62216/0/10
> [sv-4:12040] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-4_0/62216/0
> [sv-4:12040] top: openmpi-sessions-mpidemo_at_sv-4_0
> [sv-4:12040] tmp: /tmp
> [sv-2:07725] procdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/0/7
> [sv-2:07725] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/0
> [sv-2:07725] top: openmpi-sessions-mpidemo_at_sv-2_0
> [sv-2:07725] tmp: /tmp
>
> Mapper requested: NULL Last mapper: round_robin Mapping
> policy: BYNODE Ranking policy: NODE Binding policy:
> NONE[NODE] Cpu set: NULL PPR: NULL
> Num new daemons: 0 New daemon starting vpid INVALID
> Num nodes: 10
>
> Data for node: SERVER-2 Launch id: -1 State: 2
> Daemon: [[62216,0],0] Daemon launched: True
> Num slots: 15 Slots in use: 1 Oversubscribed: FALSE
> Num slots allocated: 15 Max slots: 15
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[62216,1],0]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 0
> State: INITIALIZED Restarts: 0 App_context:
> 0 Locale: 0-15 Binding: NULL[0]
>
> Data for node: x.x.x.24 Launch id: -1 State: 0
> Daemon: [[62216,0],1] Daemon launched: False
> Num slots: 2 Slots in use: 1 Oversubscribed: FALSE
> Num slots allocated: 2 Max slots: 2
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[62216,1],1]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 1
> State: INITIALIZED Restarts: 0 App_context:
> 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.26 Launch id: -1 State: 0
> Daemon: [[62216,0],2] Daemon launched: False
> Num slots: 14 Slots in use: 1 Oversubscribed: FALSE
> Num slots allocated: 14 Max slots: 14
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[62216,1],2]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 2
> State: INITIALIZED Restarts: 0 App_context:
> 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.28 Launch id: -1 State: 0
> Daemon: [[62216,0],3] Daemon launched: False
> Num slots: 16 Slots in use: 1 Oversubscribed: FALSE
> Num slots allocated: 16 Max slots: 16
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[62216,1],3]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 3
> State: INITIALIZED Restarts: 0 App_context:
> 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.29 Launch id: -1 State: 0
> Daemon: [[62216,0],4] Daemon launched: False
> Num slots: 14 Slots in use: 1 Oversubscribed: FALSE
> Num slots allocated: 14 Max slots: 14
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[62216,1],4]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 4
> State: INITIALIZED Restarts: 0 App_context:
> 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.30 Launch id: -1 State: 0
> Daemon: [[62216,0],5] Daemon launched: False
> Num slots: 16 Slots in use: 1 Oversubscribed: FALSE
> Num slots allocated: 16 Max slots: 16
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[62216,1],5]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 5
> State: INITIALIZED Restarts: 0 App_context:
> 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.41 Launch id: -1 State: 0
> Daemon: [[62216,0],6] Daemon launched: False
> Num slots: 46 Slots in use: 1 Oversubscribed: FALSE
> Num slots allocated: 46 Max slots: 46
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[62216,1],6]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 6
> State: INITIALIZED Restarts: 0 App_context:
> 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.101 Launch id: -1 State: 0
> Daemon: [[62216,0],7] Daemon launched: False
> Num slots: 46 Slots in use: 1 Oversubscribed: FALSE
> Num slots allocated: 46 Max slots: 46
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[62216,1],7]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 7
> State: INITIALIZED Restarts: 0 App_context:
> 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.100 Launch id: -1 State: 0
> Daemon: [[62216,0],8] Daemon launched: False
> Num slots: 46 Slots in use: 1 Oversubscribed: FALSE
> Num slots allocated: 46 Max slots: 46
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[62216,1],8]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 8
> State: INITIALIZED Restarts: 0 App_context:
> 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.102 Launch id: -1 State: 0
> Daemon: [[62216,0],9] Daemon launched: False
> Num slots: 22 Slots in use: 1 Oversubscribed: FALSE
> Num slots allocated: 22 Max slots: 22
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[62216,1],9]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 9
> State: INITIALIZED Restarts: 0 App_context:
> 0 Locale: 0-7 Binding: NULL[0]
> [sv-1:46111] procdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/1/8
> [sv-1:46111] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-1_0/62216/1
> [sv-1:46111] top: openmpi-sessions-mpidemo_at_sv-1_0
> [sv-1:46111] tmp: /tmp
> [SERVER-14:10768] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/1/6
> [SERVER-14:10768] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/62216/1
> [SERVER-14:10768] top: openmpi-sessions-mpidemo_at_SERVER-14_0
> [SERVER-14:10768] tmp: /tmp
> [SERVER-2:08912] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/1/0
> [SERVER-2:08912] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/62216/1
> [SERVER-2:08912] top: openmpi-sessions-mpidemo_at_SERVER-2_0
> [SERVER-2:08912] tmp: /tmp
> [SERVER-4:27460] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/1/2
> [SERVER-4:27460] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/62216/1
> [SERVER-4:27460] top: openmpi-sessions-mpidemo_at_SERVER-4_0
> [SERVER-4:27460] tmp: /tmp
> [SERVER-6:11608] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/1/4
> [SERVER-6:11608] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/62216/1
> [SERVER-6:11608] top: openmpi-sessions-mpidemo_at_SERVER-6_0
> [SERVER-6:11608] tmp: /tmp
> [SERVER-7:02620] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/1/5
> [SERVER-7:02620] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/62216/1
> [SERVER-7:02620] top: openmpi-sessions-mpidemo_at_SERVER-7_0
> [SERVER-7:02620] tmp: /tmp
> [sv-3:08586] procdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/1/9
> [sv-3:08586] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-3_0/62216/1
> [sv-3:08586] top: openmpi-sessions-mpidemo_at_sv-3_0
> [sv-3:08586] tmp: /tmp
> [sv-2:07736] procdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/1/7
> [sv-2:07736] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-2_0/62216/1
> [sv-2:07736] top: openmpi-sessions-mpidemo_at_sv-2_0
> [sv-2:07736] tmp: /tmp
> [SERVER-5:16418] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/1/3
> [SERVER-5:16418] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/62216/1
> [SERVER-5:16418] top: openmpi-sessions-mpidemo_at_SERVER-5_0
> [SERVER-5:16418] tmp: /tmp
> [SERVER-3:32533] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/1/1
> [SERVER-3:32533] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/62216/1
> [SERVER-3:32533] top: openmpi-sessions-mpidemo_at_SERVER-3_0
> [SERVER-3:32533] tmp: /tmp
> MPIR_being_debugged = 0
> MPIR_debug_state = 1
> MPIR_partial_attach_ok = 1
> MPIR_i_am_starter = 0
> MPIR_forward_output = 0
> MPIR_proctable_size = 10
> MPIR_proctable:
> (i, host, exe, pid) = (0, SERVER-2,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8912)
> (i, host, exe, pid) = (1, x.x.x.24,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32533)
> (i, host, exe, pid) = (2, x.x.x.26,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 27460)
> (i, host, exe, pid) = (3, x.x.x.28,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 16418)
> (i, host, exe, pid) = (4, x.x.x.29,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 11608)
> (i, host, exe, pid) = (5, x.x.x.30,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 2620)
> (i, host, exe, pid) = (6, x.x.x.41,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 10768)
> (i, host, exe, pid) = (7, x.x.x.101,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7736)
> (i, host, exe, pid) = (8, x.x.x.100,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 46111)
> (i, host, exe, pid) = (9, x.x.x.102,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8586)
> MPIR_executable_path: NULL
> MPIR_server_arguments: NULL
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your
> parallel process is
> likely to abort. There are many reasons that a parallel
> process can
> fail during MPI_INIT; some of which are due to
> configuration or environment
> problems. This failure appears to be an internal failure;
> here's some
> additional information (which may only be relevant to an
> Open MPI
> developer):
>
> PML add procs failed
> --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> [SERVER-2:8912] *** An error occurred in MPI_Init
> [SERVER-2:8912] *** reported by process
> [140393673392129,140389596004352]
> [SERVER-2:8912] *** on a NULL communicator
> [SERVER-2:8912] *** Unknown error
> [SERVER-2:8912] *** MPI_ERRORS_ARE_FATAL (processes in
> this communicator will now abort,
> [SERVER-2:8912] *** and potentially your MPI job)
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot
> guarantee that all
> of its peer processes in the job will be killed properly.
> You should
> double check that everything has shut down cleanly.
>
> Reason: Before MPI_INIT completed
> Local host: SERVER-2
> PID: 8912
> --------------------------------------------------------------------------
> [sv-1][[62216,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create]
> [btl_openib_proc.c:157] ompi_modex_recv failed for peer
> [[62216,1],0]
> [sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
> mca_base_modex_recv: failed with return value=-13
> [sv-1][[62216,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
> mca_base_modex_recv: failed with return value=-13
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach
> each other for
> MPI communications. This means that no Open MPI device
> has indicated
> that it can be used to communicate between these
> processes. This is
> an error; Open MPI requires that all MPI processes be able
> to reach
> each other. This error can sometimes be the result of
> forgetting to
> specify the "self" BTL.
>
> Process 1 ([[62216,1],8]) is on host: sv-1
> Process 2 ([[62216,1],0]) is on host: SERVER-2
> BTLs attempted: openib self sm tcp
>
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> [sv-3][[62216,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create]
> [btl_openib_proc.c:157] ompi_modex_recv failed for peer
> [[62216,1],0]
> [sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
> mca_base_modex_recv: failed with return value=-13
> [sv-3][[62216,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
> mca_base_modex_recv: failed with return value=-13
> --------------------------------------------------------------------------
> MPI_INIT has failed because at least one MPI process is
> unreachable
> from another. This *usually* means that an underlying
> communication
> plugin -- such as a BTL or an MTL -- has either not loaded
> or not
> allowed itself to be used. Your MPI job will now abort.
>
> You may wish to try to narrow down the problem;
>
> * Check the output of ompi_info to see which BTL/MTL
> plugins are
> available.
> * Run your application with MPI_THREAD_SINGLE.
> * Set the MCA parameter btl_base_verbose to 100 (or
> mtl_base_verbose,
> if using MTL-based communications) to see exactly which
> communication plugins were considered and/or discarded.
> --------------------------------------------------------------------------
> [sv-2][[62216,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create]
> [btl_openib_proc.c:157] ompi_modex_recv failed for peer
> [[62216,1],0]
> [sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
> mca_base_modex_recv: failed with return value=-13
> [sv-2][[62216,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
> mca_base_modex_recv: failed with return value=-13
> [SERVER-2:08907] sess_dir_finalize: proc session dir not
> empty - leaving
> [sv-4:12040] sess_dir_finalize: job session dir not empty
> - leaving
> [SERVER-14:10755] sess_dir_finalize: job session dir not
> empty - leaving
> [SERVER-2:08907] sess_dir_finalize: proc session dir not
> empty - leaving
> [SERVER-6:11595] sess_dir_finalize: proc session dir not
> empty - leaving
> [SERVER-6:11595] sess_dir_finalize: proc session dir not
> empty - leaving
> [SERVER-4:27445] sess_dir_finalize: proc session dir not
> empty - leaving
> exiting with status 0
> [SERVER-4:27445] sess_dir_finalize: proc session dir not
> empty - leaving
> [SERVER-6:11595] sess_dir_finalize: job session dir not
> empty - leaving
> [SERVER-7:02607] sess_dir_finalize: proc session dir not
> empty - leaving
> [SERVER-7:02607] sess_dir_finalize: proc session dir not
> empty - leaving
> [SERVER-7:02607] sess_dir_finalize: job session dir not
> empty - leaving
> [SERVER-5:16404] sess_dir_finalize: proc session dir not
> empty - leaving
> [SERVER-5:16404] sess_dir_finalize: proc session dir not
> empty - leaving
> exiting with status 0
> exiting with status 0
> exiting with status 0
> [SERVER-4:27445] sess_dir_finalize: job session dir not
> empty - leaving
> exiting with status 0
> [SERVER-3:32517] sess_dir_finalize: proc session dir not
> empty - leaving
> [SERVER-3:32517] sess_dir_finalize: proc session dir not
> empty - leaving
> [sv-3:08575] sess_dir_finalize: proc session dir not empty
> - leaving
> [sv-3:08575] sess_dir_finalize: job session dir not empty
> - leaving
> exiting with status 0
> [sv-1:46100] sess_dir_finalize: proc session dir not empty
> - leaving
> [sv-1:46100] sess_dir_finalize: job session dir not empty
> - leaving
> exiting with status 0
> [sv-2:07725] sess_dir_finalize: proc session dir not empty
> - leaving
> [sv-2:07725] sess_dir_finalize: job session dir not empty
> - leaving
> exiting with status 0
> [SERVER-5:16404] sess_dir_finalize: job session dir not
> empty - leaving
> exiting with status 0
> [SERVER-3:32517] sess_dir_finalize: job session dir not
> empty - leaving
> exiting with status 0
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 6 with PID 10768 on
> node x.x.x.41 exiting improperly. There are three reasons
> this could occur:
>
> 1. this process did not call "init" before exiting, but
> others in
> the job did. This can cause a job to hang indefinitely
> while it waits
> for all processes to call "init". By rule, if one process
> calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling
> "finalize".
> By rule, all processes that call "init" MUST call
> "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> 3. this process called "MPI_Abort" or "orte_abort" and the
> mca parameter
> orte_create_session_dirs is set to false. In this case,
> the run-time cannot
> detect that the abort call was an abnormal termination.
> Hence, the only
> error message you will receive is this one.
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
>
> You can avoid this message by specifying -quiet on the
> mpirun command line.
>
> --------------------------------------------------------------------------
> [SERVER-2:08907] 6 more processes have sent help message
> help-mpi-runtime / mpi_init:startup:internal-failure
> [SERVER-2:08907] Set MCA parameter
> "orte_base_help_aggregate" to 0 to see all help / error
> messages
> [SERVER-2:08907] 9 more processes have sent help message
> help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
> [SERVER-2:08907] 9 more processes have sent help message
> help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all
> killed
> [SERVER-2:08907] 2 more processes have sent help message
> help-mca-bml-r2.txt / unreachable proc
> [SERVER-2:08907] 2 more processes have sent help message
> help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
> [SERVER-2:08907] sess_dir_finalize: job session dir not
> empty - leaving
> exiting with status 1
>
> //******************************************************************
>
> On 8/3/13 4:34 AM, Ralph Castain wrote:
>
> It looks like SERVER-2 cannot talk to your x.x.x.100
> machine. I note that you have some entries at the end
> of the hostfile that I don't understand - a list of
> hosts that can be reached? And I see that your
> x.x.x.22 machine isn't on it. Is that SERVER-2 by chance?
>
> Our hostfile parsing changed between the release
> series, but I know we never consciously supported the
> syntax you show below where you list capabilities, and
> then re-list the hosts in an apparent attempt to
> filter which ones can actually be used. It is possible
> that the 1.5 series somehow used that to exclude the
> 22 machine, and that the 1.7 parser now doesn't do that.
>
> If you only include machines you actually intend to
> use in your hostfile, does the 1.7 series work?
>
> On Aug 3, 2013, at 3:58 AM, RoboBeans
> <robobeans_at_[hidden] <mailto:robobeans_at_[hidden]>> wrote:
>
>
>
> Hello everyone,
>
> I have installed openmpi 1.5.4 on 11 node cluster
> using "yum install openmpi openmpi-devel" and
> everything seems to be working fine. For testing I am
> using this test program
>
> //******************************************************************
>
> *$ cat test.cpp*
>
> #include <stdio.h>
> #include <mpi.h>
>
> int main (int argc, char *argv[])
> {
> int id, np;
> char name[MPI_MAX_PROCESSOR_NAME];
> int namelen;
> int i;
>
> MPI_Init (&argc, &argv);
>
> MPI_Comm_size (MPI_COMM_WORLD, &np);
> MPI_Comm_rank (MPI_COMM_WORLD, &id);
> MPI_Get_processor_name (name, &namelen);
>
> printf ("This is Process %2d out of %2d running on
> host %s\n", id, np, name);
>
> MPI_Finalize ();
>
> return (0);
> }
>
> //******************************************************************
>
> and my hosts file look like this:
>
> *$ cat mpi_hostfile*
>
> # The Hostfile for Open MPI
>
> # specify number of slots for processes to run locally.
> #localhost slots=12
> #x.x.x.16 slots=12 max-slots=12
> #x.x.x.17 slots=12 max-slots=12
> #x.x.x.18 slots=12 max-slots=12
> #x.x.1x.19 slots=12 max-slots=12
> #x.x.x.20 slots=12 max-slots=12
> #x.x.x.55 slots=46 max-slots=46
> #x.x.x.56 slots=46 max-slots=46
>
> x.x.x.22 slots=15 max-slots=15
> x.x.x.24 slots=2 max-slots=2
> x.x.x.26 slots=14 max-slots=14
> x.x.x.28 slots=16 max-slots=16
> x.x.x.29 slots=14 max-slots=14
> x.x.x.30 slots=16 max-slots=16
> x.x.x.41 slots=46 max-slots=46
> x.x.x.101 slots=46 max-slots=46
> x.x.x.100 slots=46 max-slots=46
> x.x.x.102 slots=22 max-slots=22
> x.x.x.103 slots=22 max-slots=22
>
> # The following slave nodes are available to this machine:
> x.x.x.24
> x.x.x.26
> x.x.x.28
> x.x.x.29
> x.x.x.30
> x.x.x.41
> x.x.x.101
> x.x.x.100
> x.x.x.102
> x.x.x.103
>
> //******************************************************************
>
> this is how my .bashrc looks like on each node:
>
> *$ cat ~/.bashrc*
>
> # .bashrc
>
> # Source global definitions
> if [ -f /etc/bashrc ]; then
> . /etc/bashrc
> fi
>
> # User specific aliases and functions
> umask 077
>
> export PSM_SHAREDCONTEXTS_MAX=20
>
> #export PATH=/usr/lib64/openmpi/bin${PATH:+:$PATH}
> export PATH=/usr/OPENMPI/openmpi-1.7.2/bin${PATH:+:$PATH}
>
> #export
> LD_LIBRARY_PATH=/usr/lib64/openmpi/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
> export
> LD_LIBRARY_PATH=/usr/OPENMPI/openmpi-1.7.2/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
>
> export PATH="$PATH":/bin/:/usr/lib/:/usr/lib:/usr:/usr/
>
> //******************************************************************
>
> *$ mpic++ test.cpp -o test*
>
> *$ mpirun -d --display-map -np 10 --hostfile
> mpi_hostfile --bynode ./test*
>
> //******************************************************************
>
> These nodes are running 2.6.32-358.2.1.el6.x86_64 release
>
> *$ uname*
> Linux
> *$ uname -r*
> 2.6.32-358.2.1.el6.x86_64
> *$ cat /etc/issue*
> CentOS release 6.4 (Final)
> Kernel \r on an \m
>
> //******************************************************************
>
> Now, if I install openmpi 1.7.2 on each node
> separately then I can only use it on either first 7
> nodes or last 4 nodes but not on all of them.
>
> //******************************************************************
>
> *$ gunzip -c openmpi-1.7.2.tar.gz | tar xf -
>
> $ cd openmpi-1.7.2
>
> $ ./configure --prefix=/usr/OPENMPI/openmpi-1.7.2
> --enable-event-thread-support
> --enable-opal-multi-threads
> --enable-orte-progress-threads
> --enable-mpi-thread-multiple
>
> $ make all install*
>
> //******************************************************************
>
> This is the error message that i am receiving:
>
>
> *$ mpirun -d --display-map -np 10 --hostfile
> mpi_hostfile --bynode ./test*
>
> [SERVER-2:05284] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/0/0
> [SERVER-2:05284] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/0
> [SERVER-2:05284] top: openmpi-sessions-mpidemo_at_SERVER-2_0
> [SERVER-2:05284] tmp: /tmp
> CentOS release 6.4 (Final)
> Kernel \r on an \m
> CentOS release 6.4 (Final)
> Kernel \r on an \m
> CentOS release 6.4 (Final)
> Kernel \r on an \m
> [SERVER-3:28993] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/0/1
> [SERVER-3:28993] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/0
> [SERVER-3:28993] top: openmpi-sessions-mpidemo_at_SERVER-3_0
> [SERVER-3:28993] tmp: /tmp
> CentOS release 6.4 (Final)
> Kernel \r on an \m
> CentOS release 6.4 (Final)
> Kernel \r on an \m
> [SERVER-6:09087] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/0/4
> [SERVER-6:09087] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/0
> [SERVER-6:09087] top: openmpi-sessions-mpidemo_at_SERVER-6_0
> [SERVER-6:09087] tmp: /tmp
> [SERVER-7:32563] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/0/5
> [SERVER-7:32563] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/0
> [SERVER-7:32563] top: openmpi-sessions-mpidemo_at_SERVER-7_0
> [SERVER-7:32563] tmp: /tmp
> [SERVER-4:15711] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/0/2
> [SERVER-4:15711] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/0
> [SERVER-4:15711] top: openmpi-sessions-mpidemo_at_SERVER-4_0
> [SERVER-4:15711] tmp: /tmp
> [sv-1:45701] procdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/0/8
> [sv-1:45701] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/0
> [sv-1:45701] top: openmpi-sessions-mpidemo_at_sv-1_0
> [sv-1:45701] tmp: /tmp
> CentOS release 6.4 (Final)
> Kernel \r on an \m
> [sv-3:08352] procdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/0/9
> [sv-3:08352] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/0
> [sv-3:08352] top: openmpi-sessions-mpidemo_at_sv-3_0
> [sv-3:08352] tmp: /tmp
> [SERVER-5:12534] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/0/3
> [SERVER-5:12534] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/0
> [SERVER-5:12534] top: openmpi-sessions-mpidemo_at_SERVER-5_0
> [SERVER-5:12534] tmp: /tmp
> [SERVER-14:08399] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/0/6
> [SERVER-14:08399] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/0
> [SERVER-14:08399] top:
> openmpi-sessions-mpidemo_at_SERVER-14_0
> [SERVER-14:08399] tmp: /tmp
> [sv-4:11802] procdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-4_0/50535/0/10
> [sv-4:11802] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-4_0/50535/0
> [sv-4:11802] top: openmpi-sessions-mpidemo_at_sv-4_0
> [sv-4:11802] tmp: /tmp
> [sv-2:07503] procdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/0/7
> [sv-2:07503] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/0
> [sv-2:07503] top: openmpi-sessions-mpidemo_at_sv-2_0
> [sv-2:07503] tmp: /tmp
>
> Mapper requested: NULL Last mapper: round_robin
> Mapping policy: BYNODE Ranking policy: NODE Binding
> policy: NONE[NODE] Cpu set: NULL PPR: NULL
> Num new daemons: 0 New daemon starting vpid
> INVALID
> Num nodes: 10
>
> Data for node: SERVER-2 Launch id: -1 State: 2
> Daemon: [[50535,0],0] Daemon launched: True
> Num slots: 15 Slots in use: 1
> Oversubscribed: FALSE
> Num slots allocated: 15 Max slots: 15
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[50535,1],0]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 0
> State: INITIALIZED Restarts: 0
> App_context: 0 Locale: 0-15 Binding: NULL[0]
>
> Data for node: x.x.x.24 Launch id: -1 State: 0
> Daemon: [[50535,0],1] Daemon launched: False
> Num slots: 3 Slots in use: 1
> Oversubscribed: FALSE
> Num slots allocated: 3 Max slots: 2
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[50535,1],1]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 1
> State: INITIALIZED Restarts: 0
> App_context: 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.26 Launch id: -1 State: 0
> Daemon: [[50535,0],2] Daemon launched: False
> Num slots: 15 Slots in use: 1
> Oversubscribed: FALSE
> Num slots allocated: 15 Max slots: 14
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[50535,1],2]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 2
> State: INITIALIZED Restarts: 0
> App_context: 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.28 Launch id: -1 State: 0
> Daemon: [[50535,0],3] Daemon launched: False
> Num slots: 17 Slots in use: 1
> Oversubscribed: FALSE
> Num slots allocated: 17 Max slots: 16
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[50535,1],3]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 3
> State: INITIALIZED Restarts: 0
> App_context: 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.29 Launch id: -1 State: 0
> Daemon: [[50535,0],4] Daemon launched: False
> Num slots: 15 Slots in use: 1
> Oversubscribed: FALSE
> Num slots allocated: 15 Max slots: 14
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[50535,1],4]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 4
> State: INITIALIZED Restarts: 0
> App_context: 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.30 Launch id: -1 State: 0
> Daemon: [[50535,0],5] Daemon launched: False
> Num slots: 17 Slots in use: 1
> Oversubscribed: FALSE
> Num slots allocated: 17 Max slots: 16
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[50535,1],5]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 5
> State: INITIALIZED Restarts: 0
> App_context: 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.41 Launch id: -1 State: 0
> Daemon: [[50535,0],6] Daemon launched: False
> Num slots: 47 Slots in use: 1
> Oversubscribed: FALSE
> Num slots allocated: 47 Max slots: 46
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[50535,1],6]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 6
> State: INITIALIZED Restarts: 0
> App_context: 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.101 Launch id: -1 State: 0
> Daemon: [[50535,0],7] Daemon launched: False
> Num slots: 47 Slots in use: 1
> Oversubscribed: FALSE
> Num slots allocated: 47 Max slots: 46
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[50535,1],7]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 7
> State: INITIALIZED Restarts: 0
> App_context: 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.100 Launch id: -1 State: 0
> Daemon: [[50535,0],8] Daemon launched: False
> Num slots: 47 Slots in use: 1
> Oversubscribed: FALSE
> Num slots allocated: 47 Max slots: 46
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[50535,1],8]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 8
> State: INITIALIZED Restarts: 0
> App_context: 0 Locale: 0-7 Binding: NULL[0]
>
> Data for node: x.x.x.102 Launch id: -1 State: 0
> Daemon: [[50535,0],9] Daemon launched: False
> Num slots: 23 Slots in use: 1
> Oversubscribed: FALSE
> Num slots allocated: 23 Max slots: 22
> Username on node: NULL
> Num procs: 1 Next node_rank: 1
> Data for proc: [[50535,1],9]
> Pid: 0 Local rank: 0 Node rank: 0 App
> rank: 9
> State: INITIALIZED Restarts: 0
> App_context: 0 Locale: 0-7 Binding: NULL[0]
> [sv-1:45712] procdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/1/8
> [sv-1:45712] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-1_0/50535/1
> [sv-1:45712] top: openmpi-sessions-mpidemo_at_sv-1_0
> [sv-1:45712] tmp: /tmp
> [SERVER-14:08412] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/1/6
> [SERVER-14:08412] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-14_0/50535/1
> [SERVER-14:08412] top:
> openmpi-sessions-mpidemo_at_SERVER-14_0
> [SERVER-14:08412] tmp: /tmp
> [SERVER-2:05291] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/1/0
> [SERVER-2:05291] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-2_0/50535/1
> [SERVER-2:05291] top: openmpi-sessions-mpidemo_at_SERVER-2_0
> [SERVER-2:05291] tmp: /tmp
> [SERVER-4:15726] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/1/2
> [SERVER-4:15726] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-4_0/50535/1
> [SERVER-4:15726] top: openmpi-sessions-mpidemo_at_SERVER-4_0
> [SERVER-4:15726] tmp: /tmp
> [SERVER-6:09100] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/1/4
> [SERVER-6:09100] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-6_0/50535/1
> [SERVER-6:09100] top: openmpi-sessions-mpidemo_at_SERVER-6_0
> [SERVER-6:09100] tmp: /tmp
> [SERVER-7:32576] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/1/5
> [SERVER-7:32576] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-7_0/50535/1
> [SERVER-7:32576] top: openmpi-sessions-mpidemo_at_SERVER-7_0
> [SERVER-7:32576] tmp: /tmp
> [sv-3:08363] procdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/1/9
> [sv-3:08363] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-3_0/50535/1
> [sv-3:08363] top: openmpi-sessions-mpidemo_at_sv-3_0
> [sv-3:08363] tmp: /tmp
> [sv-2:07514] procdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/1/7
> [sv-2:07514] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_sv-2_0/50535/1
> [sv-2:07514] top: openmpi-sessions-mpidemo_at_sv-2_0
> [sv-2:07514] tmp: /tmp
> [SERVER-5:12548] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/1/3
> [SERVER-5:12548] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-5_0/50535/1
> [SERVER-5:12548] top: openmpi-sessions-mpidemo_at_SERVER-5_0
> [SERVER-5:12548] tmp: /tmp
> [SERVER-3:29009] procdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/1/1
> [SERVER-3:29009] jobdir:
> /tmp/openmpi-sessions-mpidemo_at_SERVER-3_0/50535/1
> [SERVER-3:29009] top: openmpi-sessions-mpidemo_at_SERVER-3_0
> [SERVER-3:29009] tmp: /tmp
> MPIR_being_debugged = 0
> MPIR_debug_state = 1
> MPIR_partial_attach_ok = 1
> MPIR_i_am_starter = 0
> MPIR_forward_output = 0
> MPIR_proctable_size = 10
> MPIR_proctable:
> (i, host, exe, pid) = (0, SERVER-2,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 5291)
> (i, host, exe, pid) = (1, x.x.x.24,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 29009)
> (i, host, exe, pid) = (2, x.x.x.26,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 15726)
> (i, host, exe, pid) = (3, x.x.x.28,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 12548)
> (i, host, exe, pid) = (4, x.x.x.29,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 9100)
> (i, host, exe, pid) = (5, x.x.x.30,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 32576)
> (i, host, exe, pid) = (6, x.x.x.41,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8412)
> (i, host, exe, pid) = (7, x.x.x.101,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 7514)
> (i, host, exe, pid) = (8, x.x.x.100,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 45712)
> (i, host, exe, pid) = (9, x.x.x.102,
> /usr2/mpidemo/dev/DISTRIBUTED_COMPUTING/./test, 8363)
> MPIR_executable_path: NULL
> MPIR_server_arguments: NULL
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your
> parallel process is
> likely to abort. There are many reasons that a
> parallel process can
> fail during MPI_INIT; some of which are due to
> configuration or environment
> problems. This failure appears to be an internal
> failure; here's some
> additional information (which may only be relevant to
> an Open MPI
> developer):
>
> PML add procs failed
> --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> [SERVER-2:5291] *** An error occurred in MPI_Init
> [SERVER-2:5291] *** reported by process
> [140508871983105,140505560121344]
> [SERVER-2:5291] *** on a NULL communicator
> [SERVER-2:5291] *** Unknown error
> [SERVER-2:5291] *** MPI_ERRORS_ARE_FATAL (processes in
> this communicator will now abort,
> [SERVER-2:5291] *** and potentially your MPI job)
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot
> guarantee that all
> of its peer processes in the job will be killed
> properly. You should
> double check that everything has shut down cleanly.
>
> Reason: Before MPI_INIT completed
> Local host: SERVER-2
> PID: 5291
> --------------------------------------------------------------------------
> [sv-1][[50535,1],8][btl_openib_proc.c:157:mca_btl_openib_proc_create]
> [btl_openib_proc.c:157] ompi_modex_recv failed for
> peer [[50535,1],0]
> [sv-3][[50535,1],9][btl_openib_proc.c:157:mca_btl_openib_proc_create]
> [btl_openib_proc.c:157] ompi_modex_recv failed for
> peer [[50535,1],0]
> [sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
> mca_base_modex_recv: failed with return value=-13
> [sv-3][[50535,1],9][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
> mca_base_modex_recv: failed with return value=-13
> [sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
> mca_base_modex_recv: failed with return value=-13
> [sv-1][[50535,1],8][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
> mca_base_modex_recv: failed with return value=-13
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach
> each other for
> MPI communications. This means that no Open MPI
> device has indicated
> that it can be used to communicate between these
> processes. This is
> an error; Open MPI requires that all MPI processes be
> able to reach
> each other. This error can sometimes be the result of
> forgetting to
> specify the "self" BTL.
>
> Process 1 ([[50535,1],8]) is on host: sv-1
> Process 2 ([[50535,1],0]) is on host: SERVER-2
> BTLs attempted: openib self sm tcp
>
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> MPI_INIT has failed because at least one MPI process
> is unreachable
> from another. This *usually* means that an underlying
> communication
> plugin -- such as a BTL or an MTL -- has either not
> loaded or not
> allowed itself to be used. Your MPI job will now abort.
>
> You may wish to try to narrow down the problem;
>
> * Check the output of ompi_info to see which BTL/MTL
> plugins are
> available.
> * Run your application with MPI_THREAD_SINGLE.
> * Set the MCA parameter btl_base_verbose to 100 (or
> mtl_base_verbose,
> if using MTL-based communications) to see exactly which
> communication plugins were considered and/or discarded.
> --------------------------------------------------------------------------
> [sv-2][[50535,1],7][btl_openib_proc.c:157:mca_btl_openib_proc_create]
> [btl_openib_proc.c:157] ompi_modex_recv failed for
> peer [[50535,1],0]
> [sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
> mca_base_modex_recv: failed with return value=-13
> [sv-2][[50535,1],7][btl_tcp_proc.c:128:mca_btl_tcp_proc_create]
> mca_base_modex_recv: failed with return value=-13
> [SERVER-2:05284] sess_dir_finalize: proc session dir
> not empty - leaving
> [SERVER-2:05284] sess_dir_finalize: proc session dir
> not empty - leaving
> [sv-4:11802] sess_dir_finalize: job session dir not
> empty - leaving
> [SERVER-14:08399] sess_dir_finalize: job session dir
> not empty - leaving
> [SERVER-6:09087] sess_dir_finalize: proc session dir
> not empty - leaving
> [SERVER-6:09087] sess_dir_finalize: proc session dir
> not empty - leaving
> [SERVER-4:15711] sess_dir_finalize: proc session dir
> not empty - leaving
> [SERVER-4:15711] sess_dir_finalize: proc session dir
> not empty - leaving
> [SERVER-6:09087] sess_dir_finalize: job session dir
> not empty - leaving
> exiting with status 0
> [SERVER-7:32563] sess_dir_finalize: proc session dir
> not empty - leaving
> [SERVER-7:32563] sess_dir_finalize: proc session dir
> not empty - leaving
> [SERVER-5:12534] sess_dir_finalize: proc session dir
> not empty - leaving
> [SERVER-5:12534] sess_dir_finalize: proc session dir
> not empty - leaving
> [SERVER-7:32563] sess_dir_finalize: job session dir
> not empty - leaving
> exiting with status 0
> exiting with status 0
> exiting with status 0
> [SERVER-4:15711] sess_dir_finalize: job session dir
> not empty - leaving
> [SERVER-3:28993] sess_dir_finalize: proc session dir
> not empty - leaving
> exiting with status 0
> [SERVER-3:28993] sess_dir_finalize: proc session dir
> not empty - leaving
> [sv-3:08352] sess_dir_finalize: proc session dir not
> empty - leaving
> [sv-3:08352] sess_dir_finalize: job session dir not
> empty - leaving
> [sv-1:45701] sess_dir_finalize: proc session dir not
> empty - leaving
> [sv-1:45701] sess_dir_finalize: job session dir not
> empty - leaving
> exiting with status 0
> exiting with status 0
> [sv-2:07503] sess_dir_finalize: proc session dir not
> empty - leaving
> [sv-2:07503] sess_dir_finalize: job session dir not
> empty - leaving
> exiting with status 0
> [SERVER-5:12534] sess_dir_finalize: job session dir
> not empty - leaving
> exiting with status 0
> [SERVER-3:28993] sess_dir_finalize: job session dir
> not empty - leaving
> exiting with status 0
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 6 with PID 8412 on
> node x.x.x.41 exiting improperly. There are three
> reasons this could occur:
>
> 1. this process did not call "init" before exiting,
> but others in
> the job did. This can cause a job to hang indefinitely
> while it waits
> for all processes to call "init". By rule, if one
> process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without
> calling "finalize".
> By rule, all processes that call "init" MUST call
> "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> 3. this process called "MPI_Abort" or "orte_abort" and
> the mca parameter
> orte_create_session_dirs is set to false. In this
> case, the run-time cannot
> detect that the abort call was an abnormal
> termination. Hence, the only
> error message you will receive is this one.
>
> This may have caused other processes in the
> application to be
> terminated by signals sent by mpirun (as reported here).
>
> You can avoid this message by specifying -quiet on the
> mpirun command line.
>
> --------------------------------------------------------------------------
> [SERVER-2:05284] 6 more processes have sent help
> message help-mpi-runtime /
> mpi_init:startup:internal-failure
> [SERVER-2:05284] Set MCA parameter
> "orte_base_help_aggregate" to 0 to see all help /
> error messages
> [SERVER-2:05284] 9 more processes have sent help
> message help-mpi-errors.txt / mpi_errors_are_fatal
> unknown handle
> [SERVER-2:05284] 9 more processes have sent help
> message help-mpi-runtime.txt / ompi mpi abort:cannot
> guarantee all killed
> [SERVER-2:05284] 2 more processes have sent help
> message help-mca-bml-r2.txt / unreachable proc
> [SERVER-2:05284] 2 more processes have sent help
> message help-mpi-runtime /
> mpi_init:startup:pml-add-procs-fail
> [SERVER-2:05284] sess_dir_finalize: job session dir
> not empty - leaving
> exiting with status 1
>
> //******************************************************************
>
> Any feedback will be helpful. Thank you!
>
> Mr. Beans
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> _______________________________________________
>
> users mailing list
>
> users_at_[hidden] <mailto:users_at_[hidden]>
>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> _______________________________________________
>
> users mailing list
>
> users_at_[hidden] <mailto:users_at_[hidden]>
>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> _______________________________________________
>
> users mailing list
>
> users_at_[hidden] <mailto:users_at_[hidden]>
>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users