Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Gurhan Ozen (gurhan.ozen_at_[hidden])
Date: 2006-05-11 17:06:09


Dagnabbit.. I was specifying ib, not openib .. When i specified
openib, I got this error:

"
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned value -2 instead of OMPI_SUCCESS
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
"

I can run it with openib,self locally, even multi processes with -np
greater than one.. But once the other node is in the picture , i got
this error.. Humm does error message help to troubleshoot?

Thanks,
gurhan
On 5/11/06, Brian Barrett <brbarret_at_[hidden]> wrote:
> On May 11, 2006, at 10:10 PM, Gurhan Ozen wrote:
>
> > Brian,
> > Thanks for the very clear answers.
> >
> > I did change my code to include fflush() calls after printf() ...
> >
> > And I did try with --mca btl ib,self . Interesting result, with --mca
> > btl ib,self it hello_world works fine, but broadcast hangs after i
> > enter the vector length.
> >
> > At any rate though, --mca btl ib,self looks like the traffic goes over
> > ethernet device .. I couldn't find any documentation on the "self"
> > argument of mca, does it mean to explore alternatives if the desired
> > btl (in this case ib) doesn't work?
>
> No, self is the loopback device, for sending messages to self. It is
> never used for message routing outside of the current process, but is
> required for almost all transports, as send to self can be a sticky
> issue.
>
> You are specifying openib, not ib, as the argument to mpirun,
> correct? Either way, I'm not really sure how data could be going
> over TCP -- the TCP transport would definitely be disabled in that
> case. At this point, I don't know enough about the Open IB driver to
> be of help -- one of the other developers is going to have to jump in
> and provide assistance.
>
> > Speaking of documentation, it looks like open-mpi didn't come with a
> > man for mpirun, i thought i had seen in one of the slides of Open MPI
> > developer's workshop that it did have mpirun.1 . Do i need to check it
> > out from svn?
>
> That's one option, or wait for us to release Open MPI 1.0.3 / 1.1.
>
> Brian
>
>
> > On 5/11/06, Brian Barrett <brbarret_at_[hidden]> wrote:
> >> On May 10, 2006, at 10:46 PM, Gurhan Ozen wrote:
> >>
> >>> My ultimate goal is to get Open MPI working with openIB stack.
> >>> First, I had
> >>> installed lam-mpi , I know it doesn't have support for openIB but
> >>> it's still
> >>> relevant to some of my questions I will ask.. Here is the set up
> >>> I have:
> >>
> >> Yes, keep in mind throughout that while Open MPI does support MVAPI,
> >> LAM/MPI will fall back to using IP over IB for communication.
> >>
> >>> I have two machines, pe830-01 and pe830-02 .. Both have ethernet
> >>> interface and
> >>> HCA interface. The IP addresses follow:
> >>> eth0 ib0
> >>> pe830-01 10.12.4.32 192.168.1.32
> >>> pe830-02 10.12.4.34 192.168.1.34
> >>>
> >>> So this has worked even though it lamhosts file is configured to
> >>> use ib0
> >>> interfaces. I further verified with tcpdump command that none of
> >>> this went
> >>> to eth0 ..
> >>>
> >>> Anyhow, if i change the lamhosts file to use the eth0 IPs,
> >>> things work just
> >>> as the same with no issues . And in that case i see some traffic
> >>> on eth0
> >>> with tcpdump.
> >>
> >> Ok, so at least it sounds like your TCP network is sanely configured.
> >>
> >>> Now, when i installed and used Open MPI, things didn't work as
> >>> easy.. Here is
> >>> what happens. After recompiling the sources with the mpicc that
> >>> comes with
> >>> open-mpi:
> >>>
> >>> $ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi --
> >>> mca
> >>> pls_rsh_agent ssh --mca btl tcp -np 2 --host
> >>> 10.12.4.34,10.12.4.32
> >>> /path/to/hello_world
> >>> Hello, world, I am 0 of 2 and this is on : pe830-02.
> >>> Hello, world, I am 1 of 2 and this is on: pe830-01.
> >>>
> >>> So far so good, using eth0 interfaces.. hello_world works just
> >>> fine. Now,
> >>> when i try the broadcast program:
> >>
> >> In reality, you always need to include two BTLs when specifying. You
> >> need both the one you want to use (mvapi,openib,tcp,etc.) and
> >> "self". You can run into issues otherwise.
> >>
> >>> $ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi --
> >>> mca
> >>> pls_rsh_agent ssh --mca btl tcp -np 2 --host
> >>> 10.12.4.34,10.12.4.32
> >>> /path/to/broadcast
> >>>
> >>> It just hangs there, it doesn't prompt me the "Enter the vector
> >>> length:"
> >>> string . So i just enter a number anyway since i know the
> >>> behavior of the
> >>> program:
> >>>
> >>> 10
> >>> Enter the vector length: i am: 0 , and i have 5 vector elements
> >>> i am: 1 , and i have 5 vector elements
> >>> [0] 10.000000
> >>> [0] 10.000000
> >>> [0] 10.000000
> >>> [0] 10.000000
> >>> [0] 10.000000
> >>> [0] 10.000000
> >>> [0] 10.000000
> >>> [0] 10.000000
> >>> [0] 10.000000
> >>> [0] 10.000000
> >>>
> >>> So, that's the first bump with the openmpi.. Now , if i try to
> >>> use ib0
> >>> interfaces instead of eth0 ones, i get:
> >>
> >> I'm actually surprised this worked in LAM/MPI, to be honest. There
> >> should be an fflush() after the printf() to make sure that the output
> >> is actually sent out of the application.
> >>
> >>> $ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi
> >>> --mca
> >>> pls_rsh_agent ssh --mca btl openib -np 2 --host
> >>> 192.168.1.34,192.168.1.32
> >>> /path/to/hello_world
> >>>
> >>> --------------------------------------------------------------------
> >>> --
> >>> ----
> >>> No available btl components were found!
> >>>
> >>> This means that there are no components of this type installed
> >>> on your
> >>> system or all the components reported that they could not be
> >>> used.
> >>>
> >>> This is a fatal error; your MPI process is likely to abort.
> >>> Check the
> >>> output of the "ompi_info" command and ensure that components of
> >>> this
> >>> type are available on your system. You may also wish to check
> >>> the
> >>> value of the "component_path" MCA parameter and ensure that it
> >>> has at
> >>> least one directory that contains valid MCA components.
> >>>
> >>>
> >>> --------------------------------------------------------------------
> >>> --
> >>> ----
> >>> [pe830-01.domain.com:05942]
> >>>
> >>> I know, it thinks that it doesn't have openib components
> >>> installed, however,
> >>> ompi_info on both machines say otherwise:
> >>>
> >>> $ ompi_info | grep openib
> >>> MCA mpool: openib (MCA v1.0, API v1.0, Component v1.0.2)
> >>> MCA btl: openib (MCA v1.0, API v1.0, Component v1.0.2)
> >>
> >> I don't think it will help, but can you try again with --mca btl
> >> openib,self? For some reason, it appears that the openib component
> >> is saying that it can't run.
> >>
> >>> Now the questions are...
> >>> 1- In the case of using lam/mpi over ib0 interfaces.. Does lam/
> >>> mpi
> >>> automatically just use IPoIB ?
> >>
> >> Yes, LAM has no idea what that Open IB thing is -- it just uses the
> >> ethernet device.
> >>
> >>> 2 - Is there a tcpdump-like utility to dump the traffic on
> >>> Infiniband HCAs?
> >>
> >> I'm not aware of any, but that may occur.
> >>
> >>> 3 - In the case of Open MPI, does --mca btl arg option have to
> >>> be passed
> >>> everytime? For example,
> >>>
> >>> $ /usr/local/openmpi/bin/mpirun --prefix /usr/local/openmpi --
> >>> mca
> >>> pls_rsh_agent ssh --mca btl tcp -np 2 --host
> >>> 10.12.4.34,10.12.4.32
> >>> /path/to/hello_world
> >>>
> >>> works just fine, but the same command without the "--mca btl
> >>> tcp" bit gives
> >>> the:
> >>>
> >>>
> >>> --------------------------------------------------------------------
> >>> --
> >>> ----
> >>> It looks like MPI_INIT failed for some reason; your parallel
> >>> process is
> >>> likely to abort. There are many reasons that a parallel process
> >>> can
> >>> fail during MPI_INIT; some of which are due to configuration or
> >>> environment
> >>> problems. This failure appears to be an internal failure;
> >>> here's some
> >>> additional information (which may only be relevant to an Open MPI
> >>> developer):
> >>>
> >>> PML add procs failed
> >>> --> Returned value -2 instead of OMPI_SUCCESS
> >>>
> >>> --------------------------------------------------------------------
> >>> --
> >>> ----
> >>> *** An error occurred in MPI_Init
> >>> *** before MPI was initialized
> >>> *** MPI_ERRORS_ARE_FATAL (goodbye)
> >>>
> >>> error ...
> >>
> >> This makes it sound like Open IB is failing to setup properly. I'm a
> >> bit out of my league on this one -- is there any application you
> >> can run
> >>
> >>> 4 - How come the behavior of broadcast.c was different on Open
> >>> MPI
> >>> than it is
> >>> on lam/mpi?
> >>
> >> I think I answered this one already.
> >>
> >>> 5 - Any ideas as to why i am getting no btl component error when
> >>> i want to
> >>> use openib even though ompi_info shows it? If it help any
> >>> further , I have
> >>> the following openib modules :
> >>
> >> This usually (but not always) indicates that something is going wrong
> >> with initializing the hardware interface. ompi_info only tries to
> >> load the module, but not initialize the network device.
> >>
> >>
> >> Brian
> >>
> >> --
> >> Brian Barrett
> >> Open MPI developer
> >> http://www.open-mpi.org/
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>