Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Newbie question?
From: Jingcha Joba (pukkimonkey_at_[hidden])
Date: 2012-09-16 03:21:40


John,

BTL refers to Byte Transfer Layer, a framework to send/receive point to point messages on different network. It has several components (implementations) like openib, tcp, mx, shared mem, etc.

^openib means "not" to use openib component for p2p messages.

On a side note, do you have an RDMA supporting device ( Infiniband/RoCE/iWarp) ? If so, is OFED installed correctly and is running?
If you do not have, is the OFED running, which it should not, otherwise ?

The message that you are getting could be because of this. As a consequence, if you have a RDMA supported device, you might be getting poor performance.
 
A wealth of information is available in the FAQ section regarding these things.

--
Sent from my iPhone
On Sep 15, 2012, at 9:49 PM, John Chludzinski <john.chludzinski_at_[hidden]> wrote:
> BTW, I looked up the -mca option:
> 
>  -mca |--mca <arg0> <arg1>  
>               Pass context-specific MCA parameters; they are
>               considered global if --gmca is not used and only
>               one context is specified (arg0 is the parameter
>               name; arg1 is the parameter value)
> 
> Could you explain the args: btl and ^openib ?
> 
> ---John
> 
> 
> On Sun, Sep 16, 2012 at 12:26 AM, John Chludzinski <john.chludzinski_at_[hidden]> wrote:
> BINGO!  That did it.  Thanks.  ---John
> 
> 
> On Sat, Sep 15, 2012 at 9:32 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> No - the mca param has to be specified *before* your executable
> 
> mpiexec -mca btl ^openib -n 4 ./a.out
> 
> Also, note the space between "btl" and "^openib"
> 
> 
> On Sep 15, 2012, at 5:45 PM, John Chludzinski <john.chludzinski_at_[hidden]> wrote:
> 
>> Is this what you intended(?):
>> 
>> $ mpiexec -n 4 ./a.out -mca btl^openib
>> 
>> librdmacm: couldn't read ABI version.
>> librdmacm: assuming: 4
>> CMA: unable to get RDMA device list
>> --------------------------------------------------------------------------
>> [[5991,1],0]: A high-performance Open MPI point-to-point messaging module
>> was unable to find any relevant network interfaces:
>> 
>> Module: OpenFabrics (openib)
>>   Host: elzbieta
>> 
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> librdmacm: couldn't read ABI version.
>> librdmacm: assuming: 4
>> CMA: unable to get RDMA device list
>> librdmacm: couldn't read ABI version.
>> librdmacm: assuming: 4
>> CMA: unable to get RDMA device list
>> librdmacm: couldn't read ABI version.
>> librdmacm: assuming: 4
>> CMA: unable to get RDMA device list
>>  rank=            1  Results:    5.0000000       6.0000000       7.0000000       8.0000000    
>>  rank=            0  Results:    1.0000000       2.0000000       3.0000000       4.0000000    
>>  rank=            2  Results:    9.0000000       10.000000       11.000000       12.000000    
>>  rank=            3  Results:    13.000000       14.000000       15.000000       16.000000    
>> [elzbieta:02374] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
>> [elzbieta:02374] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>> 
>> 
>> On Sat, Sep 15, 2012 at 8:22 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> Try adding "-mca btl ^openib" to your cmd line and see if that cleans it up.
>> 
>> 
>> On Sep 15, 2012, at 12:44 PM, John Chludzinski <john.chludzinski_at_[hidden]> wrote:
>> 
>>> There was a bug in the code.  So now I get this, which is correct but how do I get rid of all these ABI, CMA, etc. messages?
>>> 
>>> $ mpiexec -n 4 ./a.out 
>>> librdmacm: couldn't read ABI version.
>>> librdmacm: couldn't read ABI version.
>>> librdmacm: assuming: 4
>>> CMA: unable to get RDMA device list
>>> librdmacm: assuming: 4
>>> CMA: unable to get RDMA device list
>>> CMA: unable to get RDMA device list
>>> librdmacm: couldn't read ABI version.
>>> librdmacm: assuming: 4
>>> librdmacm: couldn't read ABI version.
>>> librdmacm: assuming: 4
>>> CMA: unable to get RDMA device list
>>> --------------------------------------------------------------------------
>>> [[6110,1],1]: A high-performance Open MPI point-to-point messaging module
>>> was unable to find any relevant network interfaces:
>>> 
>>> Module: OpenFabrics (openib)
>>>   Host: elzbieta
>>> 
>>> Another transport will be used instead, although this may result in
>>> lower performance.
>>> --------------------------------------------------------------------------
>>>  rank=            1  Results:    5.0000000       6.0000000       7.0000000       8.0000000    
>>>  rank=            2  Results:    9.0000000       10.000000       11.000000       12.000000    
>>>  rank=            0  Results:    1.0000000       2.0000000       3.0000000       4.0000000    
>>>  rank=            3  Results:    13.000000       14.000000       15.000000       16.000000    
>>> [elzbieta:02559] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
>>> [elzbieta:02559] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>>> 
>>> 
>>> On Sat, Sep 15, 2012 at 3:34 PM, John Chludzinski <john.chludzinski_at_[hidden]> wrote:
>>> BTW, here the example code:
>>> 
>>> program scatter
>>> include 'mpif.h'
>>> 
>>> integer, parameter :: SIZE=4
>>> integer :: numtasks, rank, sendcount, recvcount, source, ierr
>>> real :: sendbuf(SIZE,SIZE), recvbuf(SIZE)
>>> 
>>> !  Fortran stores this array in column major order, so the 
>>> !  scatter will actually scatter columns, not rows.
>>> data sendbuf /1.0, 2.0, 3.0, 4.0, &
>>> 5.0, 6.0, 7.0, 8.0, &
>>> 9.0, 10.0, 11.0, 12.0, &
>>> 13.0, 14.0, 15.0, 16.0 /
>>> 
>>> call MPI_INIT(ierr)
>>> call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
>>> call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
>>> 
>>> if (numtasks .eq. SIZE) then
>>>   source = 1
>>>   sendcount = SIZE
>>>   recvcount = SIZE
>>>   call MPI_SCATTER(sendbuf, sendcount, MPI_REAL, recvbuf, &
>>>                    recvcount, MPI_REAL, source, MPI_COMM_WORLD, ierr)
>>>   print *, 'rank= ',rank,' Results: ',recvbuf 
>>> else
>>>    print *, 'Must specify',SIZE,' processors.  Terminating.' 
>>> endif
>>> 
>>> call MPI_FINALIZE(ierr)
>>> 
>>> end program
>>> 
>>> 
>>> On Sat, Sep 15, 2012 at 3:02 PM, John Chludzinski <john.chludzinski_at_[hidden]> wrote:
>>> # export LD_LIBRARY_PATH
>>> 
>>> 
>>> # mpiexec -n 1 printenv | grep PATH
>>> LD_LIBRARY_PATH=/usr/lib/openmpi/lib/
>>> 
>>> PATH=/usr/lib/openmpi/bin/:/usr/lib/ccache:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/jski/.local/bin:/home/jski/bin
>>> MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles
>>> WINDOWPATH=1
>>> 
>>> # mpiexec -n 4 ./a.out 
>>> librdmacm: couldn't read ABI version.
>>> librdmacm: assuming: 4
>>> CMA: unable to get RDMA device list
>>> --------------------------------------------------------------------------
>>> [[3598,1],0]: A high-performance Open MPI point-to-point messaging module
>>> was unable to find any relevant network interfaces:
>>> 
>>> Module: OpenFabrics (openib)
>>>   Host: elzbieta
>>> 
>>> Another transport will be used instead, although this may result in
>>> lower performance.
>>> --------------------------------------------------------------------------
>>> librdmacm: couldn't read ABI version.
>>> librdmacm: assuming: 4
>>> librdmacm: couldn't read ABI version.
>>> CMA: unable to get RDMA device list
>>> librdmacm: assuming: 4
>>> CMA: unable to get RDMA device list
>>> librdmacm: couldn't read ABI version.
>>> librdmacm: assuming: 4
>>> CMA: unable to get RDMA device list
>>> [elzbieta:4145] *** An error occurred in MPI_Scatter
>>> [elzbieta:4145] *** on communicator MPI_COMM_WORLD
>>> [elzbieta:4145] *** MPI_ERR_TYPE: invalid datatype
>>> [elzbieta:4145] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>>> --------------------------------------------------------------------------
>>> mpiexec has exited due to process rank 1 with PID 4145 on
>>> node elzbieta exiting improperly. There are two reasons this could occur:
>>> 
>>> 1. this process did not call "init" before exiting, but others in
>>> the job did. This can cause a job to hang indefinitely while it waits
>>> for all processes to call "init". By rule, if one process calls "init",
>>> then ALL processes must call "init" prior to termination.
>>> 
>>> 2. this process called "init", but exited without calling "finalize".
>>> By rule, all processes that call "init" MUST call "finalize" prior to
>>> exiting or it will be considered an "abnormal termination"
>>> 
>>> This may have caused other processes in the application to be
>>> terminated by signals sent by mpiexec (as reported here).
>>> --------------------------------------------------------------------------
>>> 
>>> 
>>> 
>>> On Sat, Sep 15, 2012 at 2:24 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>> Ah - note that there is no LD_LIBRARY_PATH in the environment. That's the problem
>>> 
>>> On Sep 15, 2012, at 11:19 AM, John Chludzinski <john.chludzinski_at_[hidden]> wrote:
>>> 
>>>> $ which mpiexec
>>>> /usr/lib/openmpi/bin/mpiexec
>>>> 
>>>> # mpiexec -n 1 printenv | grep PATH
>>>> PATH=/usr/lib/openmpi/bin/:/usr/lib/ccache:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/jski/.local/bin:/home/jski/bin
>>>> MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles
>>>> WINDOWPATH=1
>>>> 
>>>> 
>>>> 
>>>> On Sat, Sep 15, 2012 at 1:11 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>> Couple of things worth checking:
>>>> 
>>>> 1. verify that you executed the "mpiexec" you think you did - a simple "which mpiexec" should suffice
>>>> 
>>>> 2. verify that your environment is correct by "mpiexec -n 1 printenv | grep PATH". Sometimes the ld_library_path doesn't carry over like you think it should
>>>> 
>>>> 
>>>> On Sep 15, 2012, at 10:00 AM, John Chludzinski <john.chludzinski_at_[hidden]> wrote:
>>>> 
>>>>> I installed OpenMPI (I have a simple dual core AMD notebook with Fedora 16) via:
>>>>> 
>>>>> # yum install openmpi
>>>>> # yum install openmpi-devel
>>>>> # mpirun --version
>>>>> mpirun (Open MPI) 1.5.4
>>>>> 
>>>>> I added: 
>>>>> 
>>>>> $ PATH=PATH=/usr/lib/openmpi/bin/:$PATH
>>>>> $ LD_LIBRARY_PATH=/usr/lib/openmpi/lib/
>>>>> 
>>>>> Then:
>>>>> 
>>>>> $ mpif90 ex1.f95
>>>>> $ mpiexec -n 4 ./a.out 
>>>>> ./a.out: error while loading shared libraries: libmpi_f90.so.1: cannot open shared object file: No such file or directory
>>>>> ./a.out: error while loading shared libraries: libmpi_f90.so.1: cannot open shared object file: No such file or directory
>>>>> ./a.out: error while loading shared libraries: libmpi_f90.so.1: cannot open shared object file: No such file or directory
>>>>> ./a.out: error while loading shared libraries: libmpi_f90.so.1: cannot open shared object file: No such file or directory
>>>>> --------------------------------------------------------------------------
>>>>> mpiexec noticed that the job aborted, but has no info as to the process
>>>>> that caused that situation.
>>>>> --------------------------------------------------------------------------
>>>>> 
>>>>> ls -l /usr/lib/openmpi/lib/
>>>>> total 6788
>>>>> lrwxrwxrwx. 1 root root      25 Sep 15 12:25 libmca_common_sm.so -> libmca_common_sm.so.2.0.0
>>>>> lrwxrwxrwx. 1 root root      25 Sep 14 16:14 libmca_common_sm.so.2 -> libmca_common_sm.so.2.0.0
>>>>> -rwxr-xr-x. 1 root root    8492 Jan 20  2012 libmca_common_sm.so.2.0.0
>>>>> lrwxrwxrwx. 1 root root      19 Sep 15 12:25 libmpi_cxx.so -> libmpi_cxx.so.1.0.1
>>>>> lrwxrwxrwx. 1 root root      19 Sep 14 16:14 libmpi_cxx.so.1 -> libmpi_cxx.so.1.0.1
>>>>> -rwxr-xr-x. 1 root root   87604 Jan 20  2012 libmpi_cxx.so.1.0.1
>>>>> lrwxrwxrwx. 1 root root      19 Sep 15 12:25 libmpi_f77.so -> libmpi_f77.so.1.0.2
>>>>> lrwxrwxrwx. 1 root root      19 Sep 14 16:14 libmpi_f77.so.1 -> libmpi_f77.so.1.0.2
>>>>> -rwxr-xr-x. 1 root root  179912 Jan 20  2012 libmpi_f77.so.1.0.2
>>>>> lrwxrwxrwx. 1 root root      19 Sep 15 12:25 libmpi_f90.so -> libmpi_f90.so.1.1.0
>>>>> lrwxrwxrwx. 1 root root      19 Sep 14 16:14 libmpi_f90.so.1 -> libmpi_f90.so.1.1.0
>>>>> -rwxr-xr-x. 1 root root   10364 Jan 20  2012 libmpi_f90.so.1.1.0
>>>>> lrwxrwxrwx. 1 root root      15 Sep 15 12:25 libmpi.so -> libmpi.so.1.0.2
>>>>> lrwxrwxrwx. 1 root root      15 Sep 14 16:14 libmpi.so.1 -> libmpi.so.1.0.2
>>>>> -rwxr-xr-x. 1 root root 1383444 Jan 20  2012 libmpi.so.1.0.2
>>>>> lrwxrwxrwx. 1 root root      21 Sep 15 12:25 libompitrace.so -> libompitrace.so.0.0.0
>>>>> lrwxrwxrwx. 1 root root      21 Sep 14 16:14 libompitrace.so.0 -> libompitrace.so.0.0.0
>>>>> -rwxr-xr-x. 1 root root   13572 Jan 20  2012 libompitrace.so.0.0.0
>>>>> lrwxrwxrwx. 1 root root      20 Sep 15 12:25 libopen-pal.so -> libopen-pal.so.3.0.0
>>>>> lrwxrwxrwx. 1 root root      20 Sep 14 16:14 libopen-pal.so.3 -> libopen-pal.so.3.0.0
>>>>> -rwxr-xr-x. 1 root root  386324 Jan 20  2012 libopen-pal.so.3.0.0
>>>>> lrwxrwxrwx. 1 root root      20 Sep 15 12:25 libopen-rte.so -> libopen-rte.so.3.0.0
>>>>> lrwxrwxrwx. 1 root root      20 Sep 14 16:14 libopen-rte.so.3 -> libopen-rte.so.3.0.0
>>>>> -rwxr-xr-x. 1 root root  790052 Jan 20  2012 libopen-rte.so.3.0.0
>>>>> -rw-r--r--. 1 root root  301520 Jan 20  2012 libotf.a
>>>>> lrwxrwxrwx. 1 root root      15 Sep 15 12:25 libotf.so -> libotf.so.0.0.1
>>>>> lrwxrwxrwx. 1 root root      15 Sep 14 16:14 libotf.so.0 -> libotf.so.0.0.1
>>>>> -rwxr-xr-x. 1 root root  206384 Jan 20  2012 libotf.so.0.0.1
>>>>> -rw-r--r--. 1 root root  337970 Jan 20  2012 libvt.a
>>>>> -rw-r--r--. 1 root root  591070 Jan 20  2012 libvt-hyb.a
>>>>> lrwxrwxrwx. 1 root root      18 Sep 15 12:25 libvt-hyb.so -> libvt-hyb.so.0.0.0
>>>>> lrwxrwxrwx. 1 root root      18 Sep 14 16:14 libvt-hyb.so.0 -> libvt-hyb.so.0.0.0
>>>>> -rwxr-xr-x. 1 root root  428844 Jan 20  2012 libvt-hyb.so.0.0.0
>>>>> -rw-r--r--. 1 root root  541004 Jan 20  2012 libvt-mpi.a
>>>>> lrwxrwxrwx. 1 root root      18 Sep 15 12:25 libvt-mpi.so -> libvt-mpi.so.0.0.0
>>>>> lrwxrwxrwx. 1 root root      18 Sep 14 16:14 libvt-mpi.so.0 -> libvt-mpi.so.0.0.0
>>>>> -rwxr-xr-x. 1 root root  396352 Jan 20  2012 libvt-mpi.so.0.0.0
>>>>> -rw-r--r--. 1 root root  372352 Jan 20  2012 libvt-mt.a
>>>>> lrwxrwxrwx. 1 root root      17 Sep 15 12:25 libvt-mt.so -> libvt-mt.so.0.0.0
>>>>> lrwxrwxrwx. 1 root root      17 Sep 14 16:14 libvt-mt.so.0 -> libvt-mt.so.0.0.0
>>>>> -rwxr-xr-x. 1 root root  266104 Jan 20  2012 libvt-mt.so.0.0.0
>>>>> -rw-r--r--. 1 root root   60390 Jan 20  2012 libvt-pomp.a
>>>>> lrwxrwxrwx. 1 root root      14 Sep 15 12:25 libvt.so -> libvt.so.0.0.0
>>>>> lrwxrwxrwx. 1 root root      14 Sep 14 16:14 libvt.so.0 -> libvt.so.0.0.0
>>>>> -rwxr-xr-x. 1 root root  242604 Jan 20  2012 libvt.so.0.0.0
>>>>> -rwxr-xr-x. 1 root root  303591 Jan 20  2012 mpi.mod
>>>>> drwxr-xr-x. 2 root root    4096 Sep 14 16:14 openmpi
>>>>> 
>>>>> 
>>>>> The file (actually, a link) it claims it can't find: libmpi_f90.so.1, is clearly there. And LD_LIBRARY_PATH=/usr/lib/openmpi/lib/.
>>>>> 
>>>>> What's the problem?
>>>>> 
>>>>> ---John
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users