Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-11-02 13:03:12


George is right that the error message either indicates that the
dlopen() of the udapl component is failing (which seems unlikely, or
you'd see errors when you run ompi_info -- we use the same exact code
to open components in ompi_info as we do in MPI processes), the udapl
BTL is electing not to run for some reason, or the udapl btl doesn't
think that it can reach its peers (e.g., OMPI finds a pair of MPI
processes where no communication route has been defined).

I *think* your udapl component is loading ok. I'm *guessing* that
it's electing to not run for some reason. You might want to attach a
debugger and look in the udapl BTL init functions:

mca_btl_udapl_component_open ()
mca_btl_udapl_component_init ()

I'm guessing that one of these two functions is failing / returning
NULL (meaning "I don't want to run").

I have very little experience with udapl on linux; I have only tried
the udapl btl a few times (a long time ago).

On Nov 2, 2007, at 9:38 AM, Jon Mason wrote:

> On Thu, Nov 01, 2007 at 07:41:33PM -0400, George Bosilca wrote:
>> There are two things that are reflected in your email.
>>
>> 1. You can run Open MPI (or at least ompi_info) on the head node, and
>> udapl is in the list of BTL. This means the head node has all
>> libraries required to load udapl, and your LD_LIBRARY_PATH is
>> correctly configured on the head node.
>>
>> 2. When running between vic12-10g and vic20-10g udapl cannot or
>> refuse
>> to be loaded. This can means 2 things: some of the shared libraries
>> are missing or not in the LD_LIBRARY_PATH or once initialized udapl
>> detect that the connection to the remote node is impossible.
>>
>> The next thing to do is to test that your LD_LIBRARY_PATH is
>> correctly
>> (which means it contain not only the path to the Open MPI libraries
>> but the path to the udapl libraries) set for non-interactive
>> shells on
>> each node in the cluster. A "ssh vic12-10g printenv | grep
>> LD_LIBRARY_PATH" should give you the answer.
>
> Thanks for the help. Per your request, I get the following:
> # ssh vic12-10g printenv | grep LD
> LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-1.2-svn/lib64:
>
> That directory contains the btl udapl libraries, as you said.
> # ls -R /usr/mpi/gcc/openmpi-1.2-svn/lib64/ | grep dapl
> mca_btl_udapl.la
> mca_btl_udapl.so
>
> A search on the system shows libdaplcma and libdat in /usr/lib/. For
> giggles, I added /usr/lib to the env, but the programs still fails to
> run with the same error.
>
> I believe I have the correct rpms installed for the libs. Here is
> what
> I have on the systems.
> # rpm -qa | grep dapl
> dapl-devel-1.2.1-0
> dapl-1.2.1-0
> dapl-utils-1.2.1-0
>
> What should I be looking to link against?
>
> Thanks,
> Jon
>
>>
>> Thanks,
>> georg.e
>>
>> On Nov 1, 2007, at 6:52 PM, Jon Mason wrote:
>>
>>> On Wed, Oct 31, 2007 at 06:45:10PM -0400, Tim Prins wrote:
>>>> Hi Jon,
>>>>
>>>> Just to make sure, running 'ompi_info' shows that you have the
>>>> udapl btl
>>>> installed?
>>>
>>> Yes, I get the following:
>>> # ompi_info | grep dapl
>>> MCA btl: udapl (MCA v1.0, API v1.0, Component v1.2.5)
>>>
>>> If I do not include "self" in the mca, then I get an error saying it
>>> cannot find the btl component:
>>>
>>> # mpirun --n 2 --host vic12-10g,vic20-10g -mca btl udapl /usr/mpi/
>>> gcc/openmpi-1.2.2-1/tests/IMB-2.3/IMB-MPI1 pingpong
>>> --------------------------------------------------------------------
>>> ------
>>> No available btl components were found!
>>>
>>> This means that there are no components of this type installed on
>>> your
>>> system or all the components reported that they could not be used.
>>>
>>> This is a fatal error; your MPI process is likely to abort.
>>> Check the
>>> output of the "ompi_info" command and ensure that components of this
>>> type are available on your system. You may also wish to check the
>>> value of the "component_path" MCA parameter and ensure that it
>>> has at
>>> least one directory that contains valid MCA components.
>>>
>>> --------------------------------------------------------------------
>>> ------
>>> mpirun noticed that job rank 1 with PID 4335 on node vic20-10g
>>> exited on
>>> signal 15 (Terminated).
>>>
>>> # ompi_info --all | grep component_path
>>> MCA mca: parameter "mca_component_path" (current
>>> value: "/usr/mpi/gcc/openmpi-1.2-svn/lib/openmpi:/root/.openmpi/
>>> components")
>>>
>>> # ls /usr/mpi/gcc/openmpi-1.2-svn/lib/openmpi | grep dapl
>>> mca_btl_udapl.la
>>> mca_btl_udapl.so
>>>
>>> So it looks to me like it should be finding it, but perhaps I am
>>> lacking
>>> something in my configuration. Any ideas?
>>>
>>> Thanks,
>>> Jon
>>>
>>>
>>>>
>>>> Tim
>>>>
>>>> On Wednesday 31 October 2007 06:11:39 pm Jon Mason wrote:
>>>>> I am having a bit of a problem getting udapl to work via mpirun
>>>>> (over
>>>>> open-mpi, obviously). I am running a basic pingpong test and I
>>>>> get the
>>>>> following error.
>>>>>
>>>>> # mpirun --n 2 --host vic12-10g,vic20-10g -mca btl udapl,self
>>>>> /usr/mpi/gcc/open*/tests/IMB*/IMB-MPI1 pingpong
>>>>> ------------------------------------------------------------------
>>>>> --------
>>>>> Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
>>>>> If you specified the use of a BTL component, you may have
>>>>> forgotten a component (such as "self") in the list of
>>>>> usable components.
>>>>> ------------------------------------------------------------------
>>>>> --------
>>>>> ------------------------------------------------------------------
>>>>> --------
>>>>> It looks like MPI_INIT failed for some reason; your parallel
>>>>> process is
>>>>> likely to abort. There are many reasons that a parallel
>>>>> process can
>>>>> fail during MPI_INIT; some of which are due to configuration or
>>>>> environment
>>>>> problems. This failure appears to be an internal failure; here's
>>>>> some
>>>>> additional information (which may only be relevant to an Open MPI
>>>>> developer):
>>>>>
>>>>> PML add procs failed
>>>>> --> Returned "Unreachable" (-12) instead of "Success" (0)
>>>>> ------------------------------------------------------------------
>>>>> --------
>>>>> *** An error occurred in MPI_Init
>>>>> *** before MPI was initialized
>>>>> *** MPI_ERRORS_ARE_FATAL (goodbye)
>>>>> ------------------------------------------------------------------
>>>>> --------
>>>>> Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
>>>>> If you specified the use of a BTL component, you may have
>>>>> forgotten a component (such as "self") in the list of
>>>>> usable components.
>>>>> ------------------------------------------------------------------
>>>>> --------
>>>>> ------------------------------------------------------------------
>>>>> --------
>>>>> It looks like MPI_INIT failed for some reason; your parallel
>>>>> process is
>>>>> likely to abort. There are many reasons that a parallel
>>>>> process can
>>>>> fail during MPI_INIT; some of which are due to configuration or
>>>>> environment
>>>>> problems. This failure appears to be an internal failure; here's
>>>>> some
>>>>> additional information (which may only be relevant to an Open MPI
>>>>> developer):
>>>>>
>>>>> PML add procs failed
>>>>> --> Returned "Unreachable" (-12) instead of "Success" (0)
>>>>> ------------------------------------------------------------------
>>>>> --------
>>>>> *** An error occurred in MPI_Init
>>>>> *** before MPI was initialized
>>>>> *** MPI_ERRORS_ARE_FATAL (goodbye)
>>>>>
>>>>>
>>>>>
>>>>> The command is successful if udapl is replaced with tcp or
>>>>> openib. So I
>>>>> think my setup is correct. Also, dapltest successfully completes
>>>>> without any problems over IB or iWARP.
>>>>>
>>>>> Any thoughts or suggestions would be greatly appreciated.
>>>>>
>>>>> Thanks,
>>>>> Jon
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems