Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2006-05-11 17:30:19


This message indicate that one of the nodes is not able to setup a
route to the peer using the openib device. Did you run any openib
tests on your cluster ? I mean any tests which do not involve MPI ?

Otherwise if you compile in mode debug there are 2 parameters you can
use to get more information out of the system. You should use "--mca
btl_base_debug 1" and "--mca btl_base_verbose 100". If you don't have
a debug mode open mpi, it might happens that nothing will be printed.

Personally I would do these 2 things before anything else:
1. make sure that all (or some) of the openib basic tests succeed on
your cluster.
2. use these 2 mca parameters to get more information from the system.

   Thanks,
     george.

On May 11, 2006, at 5:06 PM, Gurhan Ozen wrote:

> Dagnabbit.. I was specifying ib, not openib .. When i specified
> openib, I got this error:
>
> "
> ----------------------------------------------------------------------
> ----
> It looks like MPI_INIT failed for some reason; your parallel
> process is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> PML add procs failed
> --> Returned value -2 instead of OMPI_SUCCESS
> ----------------------------------------------------------------------
> ----
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
> "
>
> I can run it with openib,self locally, even multi processes with -np
> greater than one.. But once the other node is in the picture , i got
> this error.. Humm does error message help to troubleshoot?
>
> Thanks,
> gurhan
> On 5/11/06, Brian Barrett <brbarret_at_[hidden]> wrote:
>> On May 11, 2006, at 10:10 PM, Gurhan Ozen wrote:
>>
>>> Brian,
>>> Thanks for the very clear answers.
>>>
>>> I did change my code to include fflush() calls after printf() ...
>>>
>>> And I did try with --mca btl ib,self . Interesting result, with --
>>> mca
>>> btl ib,self it hello_world works fine, but broadcast hangs after i
>>> enter the vector length.
>>>
>>> At any rate though, --mca btl ib,self looks like the traffic goes
>>> over
>>> ethernet device .. I couldn't find any documentation on the "self"
>>> argument of mca, does it mean to explore alternatives if the desired
>>> btl (in this case ib) doesn't work?
>>
>> No, self is the loopback device, for sending messages to self. It is
>> never used for message routing outside of the current process, but is
>> required for almost all transports, as send to self can be a sticky
>> issue.
>>
>> You are specifying openib, not ib, as the argument to mpirun,
>> correct? Either way, I'm not really sure how data could be going
>> over TCP -- the TCP transport would definitely be disabled in that
>> case. At this point, I don't know enough about the Open IB driver to
>> be of help -- one of the other developers is going to have to jump in
>> and provide assistance.
>>
>>> Speaking of documentation, it looks like open-mpi didn't come with a
>>> man for mpirun, i thought i had seen in one of the slides of Open
>>> MPI
>>> developer's workshop that it did have mpirun.1 . Do i need to
>>> check it
>>> out from svn?
>>
>> That's one option, or wait for us to release Open MPI 1.0.3 / 1.1.
>>
>> Brian
>>
>>
>>> On 5/11/06, Brian Barrett <brbarret_at_[hidden]> wrote:
>>>> On May 10, 2006, at 10:46 PM, Gurhan Ozen wrote:
>>>>
>>>>> My ultimate goal is to get Open MPI working with openIB stack.
>>>>> First, I had
>>>>> installed lam-mpi , I know it doesn't have support for openIB but
>>>>> it's still
>>>>> relevant to some of my questions I will ask.. Here is the set up
>>>>> I have:
>>>>
>>>> Yes, keep in mind throughout that while Open MPI does support
>>>> MVAPI,
>>>> LAM/MPI will fall back to using IP over IB for communication.
>>>>
>>>>> I have two machines, pe830-01 and pe830-02 .. Both have ethernet
>>>>> interface and
>>>>> HCA interface. The IP addresses follow:
>>>>> eth0 ib0
>>>>> pe830-01 10.12.4.32 192.168.1.32
>>>>> pe830-02 10.12.4.34 192.168.1.34
>>>>>
>>>>> So this has worked even though it lamhosts file is
>>>>> configured to
>>>>> use ib0
>>>>> interfaces. I further verified with tcpdump command that
>>>>> none of
>>>>> this went
>>>>> to eth0 ..
>>>>>
>>>>> Anyhow, if i change the lamhosts file to use the eth0 IPs,
>>>>> things work just
>>>>> as the same with no issues . And in that case i see some
>>>>> traffic
>>>>> on eth0
>>>>> with tcpdump.
>>>>
>>>> Ok, so at least it sounds like your TCP network is sanely
>>>> configured.
>>>>
>>>>> Now, when i installed and used Open MPI, things didn't work as
>>>>> easy.. Here is
>>>>> what happens. After recompiling the sources with the mpicc that
>>>>> comes with
>>>>> open-mpi:
>>>>>
>>>>> $ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi --
>>>>> mca
>>>>> pls_rsh_agent ssh --mca btl tcp -np 2 --host
>>>>> 10.12.4.34,10.12.4.32
>>>>> /path/to/hello_world
>>>>> Hello, world, I am 0 of 2 and this is on : pe830-02.
>>>>> Hello, world, I am 1 of 2 and this is on: pe830-01.
>>>>>
>>>>> So far so good, using eth0 interfaces.. hello_world works just
>>>>> fine. Now,
>>>>> when i try the broadcast program:
>>>>
>>>> In reality, you always need to include two BTLs when
>>>> specifying. You
>>>> need both the one you want to use (mvapi,openib,tcp,etc.) and
>>>> "self". You can run into issues otherwise.
>>>>
>>>>> $ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi --
>>>>> mca
>>>>> pls_rsh_agent ssh --mca btl tcp -np 2 --host
>>>>> 10.12.4.34,10.12.4.32
>>>>> /path/to/broadcast
>>>>>
>>>>> It just hangs there, it doesn't prompt me the "Enter the vector
>>>>> length:"
>>>>> string . So i just enter a number anyway since i know the
>>>>> behavior of the
>>>>> program:
>>>>>
>>>>> 10
>>>>> Enter the vector length: i am: 0 , and i have 5 vector elements
>>>>> i am: 1 , and i have 5 vector elements
>>>>> [0] 10.000000
>>>>> [0] 10.000000
>>>>> [0] 10.000000
>>>>> [0] 10.000000
>>>>> [0] 10.000000
>>>>> [0] 10.000000
>>>>> [0] 10.000000
>>>>> [0] 10.000000
>>>>> [0] 10.000000
>>>>> [0] 10.000000
>>>>>
>>>>> So, that's the first bump with the openmpi.. Now , if i try to
>>>>> use ib0
>>>>> interfaces instead of eth0 ones, i get:
>>>>
>>>> I'm actually surprised this worked in LAM/MPI, to be honest. There
>>>> should be an fflush() after the printf() to make sure that the
>>>> output
>>>> is actually sent out of the application.
>>>>
>>>>> $ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi
>>>>> --mca
>>>>> pls_rsh_agent ssh --mca btl openib -np 2 --host
>>>>> 192.168.1.34,192.168.1.32
>>>>> /path/to/hello_world
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> --
>>>>> --
>>>>> ----
>>>>> No available btl components were found!
>>>>>
>>>>> This means that there are no components of this type installed
>>>>> on your
>>>>> system or all the components reported that they could not be
>>>>> used.
>>>>>
>>>>> This is a fatal error; your MPI process is likely to abort.
>>>>> Check the
>>>>> output of the "ompi_info" command and ensure that components of
>>>>> this
>>>>> type are available on your system. You may also wish to check
>>>>> the
>>>>> value of the "component_path" MCA parameter and ensure that it
>>>>> has at
>>>>> least one directory that contains valid MCA components.
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> --
>>>>> --
>>>>> ----
>>>>> [pe830-01.domain.com:05942]
>>>>>
>>>>> I know, it thinks that it doesn't have openib components
>>>>> installed, however,
>>>>> ompi_info on both machines say otherwise:
>>>>>
>>>>> $ ompi_info | grep openib
>>>>> MCA mpool: openib (MCA v1.0, API v1.0, Component v1.0.2)
>>>>> MCA btl: openib (MCA v1.0, API v1.0, Component v1.0.2)
>>>>
>>>> I don't think it will help, but can you try again with --mca btl
>>>> openib,self? For some reason, it appears that the openib component
>>>> is saying that it can't run.
>>>>
>>>>> Now the questions are...
>>>>> 1- In the case of using lam/mpi over ib0 interfaces.. Does lam/
>>>>> mpi
>>>>> automatically just use IPoIB ?
>>>>
>>>> Yes, LAM has no idea what that Open IB thing is -- it just uses the
>>>> ethernet device.
>>>>
>>>>> 2 - Is there a tcpdump-like utility to dump the traffic on
>>>>> Infiniband HCAs?
>>>>
>>>> I'm not aware of any, but that may occur.
>>>>
>>>>> 3 - In the case of Open MPI, does --mca btl arg option have to
>>>>> be passed
>>>>> everytime? For example,
>>>>>
>>>>> $ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi --
>>>>> mca
>>>>> pls_rsh_agent ssh --mca btl tcp -np 2 --host
>>>>> 10.12.4.34,10.12.4.32
>>>>> /path/to/hello_world
>>>>>
>>>>> works just fine, but the same command without the "--mca btl
>>>>> tcp" bit gives
>>>>> the:
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> --
>>>>> --
>>>>> ----
>>>>> It looks like MPI_INIT failed for some reason; your parallel
>>>>> process is
>>>>> likely to abort. There are many reasons that a parallel
>>>>> process
>>>>> can
>>>>> fail during MPI_INIT; some of which are due to configuration or
>>>>> environment
>>>>> problems. This failure appears to be an internal failure;
>>>>> here's some
>>>>> additional information (which may only be relevant to an
>>>>> Open MPI
>>>>> developer):
>>>>>
>>>>> PML add procs failed
>>>>> --> Returned value -2 instead of OMPI_SUCCESS
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> --
>>>>> --
>>>>> ----
>>>>> *** An error occurred in MPI_Init
>>>>> *** before MPI was initialized
>>>>> *** MPI_ERRORS_ARE_FATAL (goodbye)
>>>>>
>>>>> error ...
>>>>
>>>> This makes it sound like Open IB is failing to setup properly.
>>>> I'm a
>>>> bit out of my league on this one -- is there any application you
>>>> can run
>>>>
>>>>> 4 - How come the behavior of broadcast.c was different on Open
>>>>> MPI
>>>>> than it is
>>>>> on lam/mpi?
>>>>
>>>> I think I answered this one already.
>>>>
>>>>> 5 - Any ideas as to why i am getting no btl component error
>>>>> when
>>>>> i want to
>>>>> use openib even though ompi_info shows it? If it help any
>>>>> further , I have
>>>>> the following openib modules :
>>>>
>>>> This usually (but not always) indicates that something is going
>>>> wrong
>>>> with initializing the hardware interface. ompi_info only tries to
>>>> load the module, but not initialize the network device.
>>>>
>>>>
>>>> Brian
>>>>
>>>> --
>>>> Brian Barrett
>>>> Open MPI developer
>>>> http://www.open-mpi.org/
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users