Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi+infiniband
From: christian schmitt (schmitt_at_[hidden])
Date: 2013-07-30 12:45:21


Hallo,

Thank you for this. When I start the mpi test with the option "--mca btl
openib,sm,self" I can start it on on node. But I can't start it on two
nodes. The Error then is:

 schmitt$ /amd/software/openmpi-1.6.5/cltest/bin/mpirun -n 2 -H
cluster1,cluster2 /worklocal/schmitt/imb/3.2.4/src/IMB-MPI1 SENDRECV
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[49963,1],0]) is on host: cluster1.gsc.ce.tu-darmstadt.de
  Process 2 ([[49963,1],1]) is on host: cluster2
  BTLs attempted: self sm

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another. This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used. Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
[cluster1.gsc.ce.tu-darmstadt.de:29116] *** An error occurred in MPI_Init
[cluster1.gsc.ce.tu-darmstadt.de:29116] *** on a NULL communicator
[cluster1.gsc.ce.tu-darmstadt.de:29116] *** Unknown error
[cluster1.gsc.ce.tu-darmstadt.de:29116] *** MPI_ERRORS_ARE_FATAL: your
MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly. You should
double check that everything has shut down cleanly.

  Reason: Before MPI_INIT completed
  Local host: cluster1.gsc.ce.tu-darmstadt.de
  PID: 29116
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 5194 on
node cluster2 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[cluster1.gsc.ce.tu-darmstadt.de:29113] 1 more process has sent help
message help-mca-bml-r2.txt / unreachable proc
[cluster1.gsc.ce.tu-darmstadt.de:29113] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
[cluster1.gsc.ce.tu-darmstadt.de:29113] 1 more process has sent help
message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail
[cluster1.gsc.ce.tu-darmstadt.de:29113] 1 more process has sent help
message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
[cluster1.gsc.ce.tu-darmstadt.de:29113] 1 more process has sent help
message help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed

It seems like the mpi doesn’t know how to communicate between the nodes.
Any idea?

Christian Schmitt
Network and Systemadministrator
Technische Universität Darmstadt
Graduate School of Computational Engineering
Dolivostraße 15, S4 10/326
64293 Darmstadt

Office: +49 (0)6151 / 16-4265
Fax: +49 (0)6151 / 16-4459

schmitt_at_[hidden]

http://www.graduate-school-ce.de/

On 07/30/2013 04:34 PM, Gus Correa wrote:
> Hi Christian
>
> If I understand you right, you want to use Open MPI with
> Infiniband, not Ethernet, right?
>
> If that is the case, try
> '-mca btl openib,sm,self'
> in your mpiexec command line.
>
> I don't think ipoib is required for Open MPI.
>
> See these FAQ (FAQ is the best OpenMPI documentation):
> http://www.open-mpi.org/faq/?category=openfabrics#ib-btl
>
> I hope this helps,
> Gus Correa
>
> On 07/30/2013 09:01 AM, christian schmitt wrote:
>> Hallo,
>>
>> I´m trying to get openmpi(1.6.5) running with/over infiniband.
>> My system is a centOS 6.3. I have installed the Mellanox OFED driver
>> (2.0) and everything seems working. ibhosts shows all hosts and the
>> switch.
>> A "hca_self_test.ofed" shows:
>>
>> ---- Performing Adapter Device Self Test ----
>> Number of CAs Detected ................. 1
>> PCI Device Check ....................... PASS
>> Kernel Arch ............................ x86_64
>> Host Driver Version .................... MLNX_OFED_LINUX-2.0-2.0.5
>> (OFED-2.0-2.0.5): 2.6.32-279.el6.x86_64
>> Host Driver RPM Check .................. PASS
>> Firmware on CA #0 VPI .................. v2.11.500
>> Firmware Check on CA #0 (VPI) .......... PASS
>> Host Driver Initialization ............. PASS"
>> Number of CA Ports Active .............. 1
>> Port State of Port #1 on CA #0 (VPI)..... UP 4X QDR (InfiniBand)
>> Error Counter Check on CA #0 (VPI)...... PASS
>> Kernel Syslog Check .................... PASS
>> Node GUID on CA #0 (VPI) ............... 00:02:c9:03:00:1f:a4:e0
>>
>>
>> A "ompi_info | grep openib" shows:
>> MCA btl: openib (MCA v2.0, API v2.0, Component v1.6.5)
>>
>> So I now compiled openmpi with the option "--with-openib" and tried to
>> run the intel MPI test. But it still uses the Ethernet interface to
>> communicate. Only when I configure ipoib (ib0) and start my job with
>> "--mca btl ^openib --mca btl_tcp_if_include ib0" it runs with
>> infiniband. But when I´m right, it should work without the ib0 interface.
>> I´m quiet new to infiniband so maybe I forgot something.
>> I'm grateful for any information that help me solving this problem.
>>
>> Thank you,
>>
>> Christian
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users