Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Newbie question?
From: John Chludzinski (john.chludzinski_at_[hidden])
Date: 2012-09-16 00:49:47


BTW, I looked up the -mca option:

 -mca |--mca <arg0> <arg1>
              Pass context-specific MCA parameters; they are
              considered global if --gmca is not used and only
              one context is specified (arg0 is the parameter
              name; arg1 is the parameter value)

Could you explain the args: btl and ^openib ?

---John

On Sun, Sep 16, 2012 at 12:26 AM, John Chludzinski <
john.chludzinski_at_[hidden]> wrote:

> BINGO! That did it. Thanks. ---John
>
>
> On Sat, Sep 15, 2012 at 9:32 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>> No - the mca param has to be specified *before* your executable
>>
>> mpiexec -mca btl ^openib -n 4 ./a.out
>>
>> Also, note the space between "btl" and "^openib"
>>
>>
>> On Sep 15, 2012, at 5:45 PM, John Chludzinski <john.chludzinski_at_[hidden]>
>> wrote:
>>
>> Is this what you intended(?):
>>
>> *$ mpiexec -n 4 ./a.out -mca btl^openib
>>
>> *librdmacm: couldn't read ABI version.
>> librdmacm: assuming: 4
>> CMA: unable to get RDMA device list
>> --------------------------------------------------------------------------
>> [[5991,1],0]: A high-performance Open MPI point-to-point messaging module
>> was unable to find any relevant network interfaces:
>>
>> Module: OpenFabrics (openib)
>> Host: elzbieta
>>
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> librdmacm: couldn't read ABI version.
>> librdmacm: assuming: 4
>> CMA: unable to get RDMA device list
>> librdmacm: couldn't read ABI version.
>> librdmacm: assuming: 4
>> CMA: unable to get RDMA device list
>> librdmacm: couldn't read ABI version.
>> librdmacm: assuming: 4
>> CMA: unable to get RDMA device list
>> rank= 1 Results: 5.0000000 6.0000000
>> 7.0000000 8.0000000
>> rank= 0 Results: 1.0000000 2.0000000
>> 3.0000000 4.0000000
>> rank= 2 Results: 9.0000000 10.000000
>> 11.000000 12.000000
>> rank= 3 Results: 13.000000 14.000000
>> 15.000000 16.000000
>> [elzbieta:02374] 3 more processes have sent help message
>> help-mpi-btl-base.txt / btl:no-nics
>> [elzbieta:02374] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>> all help / error messages
>>
>>
>> On Sat, Sep 15, 2012 at 8:22 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>> Try adding "-mca btl ^openib" to your cmd line and see if that cleans it
>>> up.
>>>
>>>
>>> On Sep 15, 2012, at 12:44 PM, John Chludzinski <
>>> john.chludzinski_at_[hidden]> wrote:
>>>
>>> There was a bug in the code. So now I get this, which is correct but
>>> how do I get rid of all these ABI, CMA, etc. messages?
>>>
>>> $ mpiexec -n 4 ./a.out
>>> librdmacm: couldn't read ABI version.
>>> librdmacm: couldn't read ABI version.
>>> librdmacm: assuming: 4
>>> CMA: unable to get RDMA device list
>>> librdmacm: assuming: 4
>>> CMA: unable to get RDMA device list
>>> CMA: unable to get RDMA device list
>>> librdmacm: couldn't read ABI version.
>>> librdmacm: assuming: 4
>>> librdmacm: couldn't read ABI version.
>>> librdmacm: assuming: 4
>>> CMA: unable to get RDMA device list
>>>
>>> --------------------------------------------------------------------------
>>> [[6110,1],1]: A high-performance Open MPI point-to-point messaging module
>>> was unable to find any relevant network interfaces:
>>>
>>> Module: OpenFabrics (openib)
>>> Host: elzbieta
>>>
>>> Another transport will be used instead, although this may result in
>>> lower performance.
>>>
>>> --------------------------------------------------------------------------
>>> rank= 1 Results: 5.0000000 6.0000000
>>> 7.0000000 8.0000000
>>> rank= 2 Results: 9.0000000 10.000000
>>> 11.000000 12.000000
>>> rank= 0 Results: 1.0000000 2.0000000
>>> 3.0000000 4.0000000
>>> rank= 3 Results: 13.000000 14.000000
>>> 15.000000 16.000000
>>> [elzbieta:02559] 3 more processes have sent help message
>>> help-mpi-btl-base.txt / btl:no-nics
>>> [elzbieta:02559] Set MCA parameter "orte_base_help_aggregate" to 0 to
>>> see all help / error messages
>>>
>>>
>>> On Sat, Sep 15, 2012 at 3:34 PM, John Chludzinski <
>>> john.chludzinski_at_[hidden]> wrote:
>>>
>>>> BTW, here the example code:
>>>>
>>>> program scatter
>>>> include 'mpif.h'
>>>>
>>>> integer, parameter :: SIZE=4
>>>> integer :: numtasks, rank, sendcount, recvcount, source, ierr
>>>> real :: sendbuf(SIZE,SIZE), recvbuf(SIZE)
>>>>
>>>> ! Fortran stores this array in column major order, so the
>>>> ! scatter will actually scatter columns, not rows.
>>>> data sendbuf /1.0, 2.0, 3.0, 4.0, &
>>>> 5.0, 6.0, 7.0, 8.0, &
>>>> 9.0, 10.0, 11.0, 12.0, &
>>>> 13.0, 14.0, 15.0, 16.0 /
>>>>
>>>> call MPI_INIT(ierr)
>>>> call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
>>>> call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)
>>>>
>>>> if (numtasks .eq. SIZE) then
>>>> source = 1
>>>> sendcount = SIZE
>>>> recvcount = SIZE
>>>> call MPI_SCATTER(sendbuf, sendcount, MPI_REAL, recvbuf, &
>>>> recvcount, MPI_REAL, source, MPI_COMM_WORLD, ierr)
>>>> print *, 'rank= ',rank,' Results: ',recvbuf
>>>> else
>>>> print *, 'Must specify',SIZE,' processors. Terminating.'
>>>> endif
>>>>
>>>> call MPI_FINALIZE(ierr)
>>>>
>>>> end program
>>>>
>>>>
>>>> On Sat, Sep 15, 2012 at 3:02 PM, John Chludzinski <
>>>> john.chludzinski_at_[hidden]> wrote:
>>>>
>>>>> # export LD_LIBRARY_PATH
>>>>>
>>>>>
>>>>> # mpiexec -n 1 printenv | grep PATH
>>>>> LD_LIBRARY_PATH=/usr/lib/openmpi/lib/
>>>>>
>>>>>
>>>>> PATH=/usr/lib/openmpi/bin/:/usr/lib/ccache:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/jski/.local/bin:/home/jski/bin
>>>>> MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles
>>>>> WINDOWPATH=1
>>>>>
>>>>> # mpiexec -n 4 ./a.out
>>>>> librdmacm: couldn't read ABI version.
>>>>> librdmacm: assuming: 4
>>>>> CMA: unable to get RDMA device list
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> [[3598,1],0]: A high-performance Open MPI point-to-point messaging
>>>>> module
>>>>> was unable to find any relevant network interfaces:
>>>>>
>>>>> Module: OpenFabrics (openib)
>>>>> Host: elzbieta
>>>>>
>>>>> Another transport will be used instead, although this may result in
>>>>> lower performance.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> librdmacm: couldn't read ABI version.
>>>>> librdmacm: assuming: 4
>>>>> librdmacm: couldn't read ABI version.
>>>>> CMA: unable to get RDMA device list
>>>>> librdmacm: assuming: 4
>>>>> CMA: unable to get RDMA device list
>>>>> librdmacm: couldn't read ABI version.
>>>>> librdmacm: assuming: 4
>>>>> CMA: unable to get RDMA device list
>>>>> [elzbieta:4145] *** An error occurred in MPI_Scatter
>>>>> [elzbieta:4145] *** on communicator MPI_COMM_WORLD
>>>>> [elzbieta:4145] *** MPI_ERR_TYPE: invalid datatype
>>>>> [elzbieta:4145] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> mpiexec has exited due to process rank 1 with PID 4145 on
>>>>> node elzbieta exiting improperly. There are two reasons this could
>>>>> occur:
>>>>>
>>>>> 1. this process did not call "init" before exiting, but others in
>>>>> the job did. This can cause a job to hang indefinitely while it waits
>>>>> for all processes to call "init". By rule, if one process calls "init",
>>>>> then ALL processes must call "init" prior to termination.
>>>>>
>>>>> 2. this process called "init", but exited without calling "finalize".
>>>>> By rule, all processes that call "init" MUST call "finalize" prior to
>>>>> exiting or it will be considered an "abnormal termination"
>>>>>
>>>>> This may have caused other processes in the application to be
>>>>> terminated by signals sent by mpiexec (as reported here).
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Sep 15, 2012 at 2:24 PM, Ralph Castain <rhc_at_[hidden]>wrote:
>>>>>
>>>>>> Ah - note that there is no LD_LIBRARY_PATH in the environment. That's
>>>>>> the problem
>>>>>>
>>>>>> On Sep 15, 2012, at 11:19 AM, John Chludzinski <
>>>>>> john.chludzinski_at_[hidden]> wrote:
>>>>>>
>>>>>> $ which mpiexec
>>>>>> /usr/lib/openmpi/bin/mpiexec
>>>>>>
>>>>>> # mpiexec -n 1 printenv | grep PATH
>>>>>>
>>>>>> PATH=/usr/lib/openmpi/bin/:/usr/lib/ccache:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/jski/.local/bin:/home/jski/bin
>>>>>> MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles
>>>>>> WINDOWPATH=1
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Sep 15, 2012 at 1:11 PM, Ralph Castain <rhc_at_[hidden]>wrote:
>>>>>>
>>>>>>> Couple of things worth checking:
>>>>>>>
>>>>>>> 1. verify that you executed the "mpiexec" you think you did - a
>>>>>>> simple "which mpiexec" should suffice
>>>>>>>
>>>>>>> 2. verify that your environment is correct by "mpiexec -n 1 printenv
>>>>>>> | grep PATH". Sometimes the ld_library_path doesn't carry over like you
>>>>>>> think it should
>>>>>>>
>>>>>>>
>>>>>>> On Sep 15, 2012, at 10:00 AM, John Chludzinski <
>>>>>>> john.chludzinski_at_[hidden]> wrote:
>>>>>>>
>>>>>>> I installed OpenMPI (I have a simple dual core AMD notebook with
>>>>>>> Fedora 16) via:
>>>>>>>
>>>>>>> # yum install openmpi
>>>>>>> # yum install openmpi-devel
>>>>>>> # mpirun --version
>>>>>>> mpirun (Open MPI) 1.5.4
>>>>>>>
>>>>>>> I added:
>>>>>>>
>>>>>>> $ PATH=PATH=/usr/lib/openmpi/bin/:$PATH
>>>>>>> $ LD_LIBRARY_PATH=/usr/lib/openmpi/lib/
>>>>>>>
>>>>>>> Then:
>>>>>>>
>>>>>>> $ mpif90 ex1.f95
>>>>>>> $ mpiexec -n 4 ./a.out
>>>>>>> ./a.out: error while loading shared libraries: libmpi_f90.so.1:
>>>>>>> cannot open shared object file: No such file or directory
>>>>>>> ./a.out: error while loading shared libraries: libmpi_f90.so.1:
>>>>>>> cannot open shared object file: No such file or directory
>>>>>>> ./a.out: error while loading shared libraries: libmpi_f90.so.1:
>>>>>>> cannot open shared object file: No such file or directory
>>>>>>> ./a.out: error while loading shared libraries: libmpi_f90.so.1:
>>>>>>> cannot open shared object file: No such file or directory
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpiexec noticed that the job aborted, but has no info as to the
>>>>>>> process
>>>>>>> that caused that situation.
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> ls -l /usr/lib/openmpi/lib/
>>>>>>> total 6788
>>>>>>> lrwxrwxrwx. 1 root root 25 Sep 15 12:25 libmca_common_sm.so ->
>>>>>>> libmca_common_sm.so.2.0.0
>>>>>>> lrwxrwxrwx. 1 root root 25 Sep 14 16:14 libmca_common_sm.so.2
>>>>>>> -> libmca_common_sm.so.2.0.0
>>>>>>> -rwxr-xr-x. 1 root root 8492 Jan 20 2012
>>>>>>> libmca_common_sm.so.2.0.0
>>>>>>> lrwxrwxrwx. 1 root root 19 Sep 15 12:25 libmpi_cxx.so ->
>>>>>>> libmpi_cxx.so.1.0.1
>>>>>>> lrwxrwxrwx. 1 root root 19 Sep 14 16:14 libmpi_cxx.so.1 ->
>>>>>>> libmpi_cxx.so.1.0.1
>>>>>>> -rwxr-xr-x. 1 root root 87604 Jan 20 2012 libmpi_cxx.so.1.0.1
>>>>>>> lrwxrwxrwx. 1 root root 19 Sep 15 12:25 libmpi_f77.so ->
>>>>>>> libmpi_f77.so.1.0.2
>>>>>>> lrwxrwxrwx. 1 root root 19 Sep 14 16:14 libmpi_f77.so.1 ->
>>>>>>> libmpi_f77.so.1.0.2
>>>>>>> -rwxr-xr-x. 1 root root 179912 Jan 20 2012 libmpi_f77.so.1.0.2
>>>>>>> lrwxrwxrwx. 1 root root 19 Sep 15 12:25 libmpi_f90.so ->
>>>>>>> libmpi_f90.so.1.1.0
>>>>>>> lrwxrwxrwx. 1 root root 19 Sep 14 16:14 libmpi_f90.so.1 ->
>>>>>>> libmpi_f90.so.1.1.0
>>>>>>> -rwxr-xr-x. 1 root root 10364 Jan 20 2012 libmpi_f90.so.1.1.0
>>>>>>> lrwxrwxrwx. 1 root root 15 Sep 15 12:25 libmpi.so ->
>>>>>>> libmpi.so.1.0.2
>>>>>>> lrwxrwxrwx. 1 root root 15 Sep 14 16:14 libmpi.so.1 ->
>>>>>>> libmpi.so.1.0.2
>>>>>>> -rwxr-xr-x. 1 root root 1383444 Jan 20 2012 libmpi.so.1.0.2
>>>>>>> lrwxrwxrwx. 1 root root 21 Sep 15 12:25 libompitrace.so ->
>>>>>>> libompitrace.so.0.0.0
>>>>>>> lrwxrwxrwx. 1 root root 21 Sep 14 16:14 libompitrace.so.0 ->
>>>>>>> libompitrace.so.0.0.0
>>>>>>> -rwxr-xr-x. 1 root root 13572 Jan 20 2012 libompitrace.so.0.0.0
>>>>>>> lrwxrwxrwx. 1 root root 20 Sep 15 12:25 libopen-pal.so ->
>>>>>>> libopen-pal.so.3.0.0
>>>>>>> lrwxrwxrwx. 1 root root 20 Sep 14 16:14 libopen-pal.so.3 ->
>>>>>>> libopen-pal.so.3.0.0
>>>>>>> -rwxr-xr-x. 1 root root 386324 Jan 20 2012 libopen-pal.so.3.0.0
>>>>>>> lrwxrwxrwx. 1 root root 20 Sep 15 12:25 libopen-rte.so ->
>>>>>>> libopen-rte.so.3.0.0
>>>>>>> lrwxrwxrwx. 1 root root 20 Sep 14 16:14 libopen-rte.so.3 ->
>>>>>>> libopen-rte.so.3.0.0
>>>>>>> -rwxr-xr-x. 1 root root 790052 Jan 20 2012 libopen-rte.so.3.0.0
>>>>>>> -rw-r--r--. 1 root root 301520 Jan 20 2012 libotf.a
>>>>>>> lrwxrwxrwx. 1 root root 15 Sep 15 12:25 libotf.so ->
>>>>>>> libotf.so.0.0.1
>>>>>>> lrwxrwxrwx. 1 root root 15 Sep 14 16:14 libotf.so.0 ->
>>>>>>> libotf.so.0.0.1
>>>>>>> -rwxr-xr-x. 1 root root 206384 Jan 20 2012 libotf.so.0.0.1
>>>>>>> -rw-r--r--. 1 root root 337970 Jan 20 2012 libvt.a
>>>>>>> -rw-r--r--. 1 root root 591070 Jan 20 2012 libvt-hyb.a
>>>>>>> lrwxrwxrwx. 1 root root 18 Sep 15 12:25 libvt-hyb.so ->
>>>>>>> libvt-hyb.so.0.0.0
>>>>>>> lrwxrwxrwx. 1 root root 18 Sep 14 16:14 libvt-hyb.so.0 ->
>>>>>>> libvt-hyb.so.0.0.0
>>>>>>> -rwxr-xr-x. 1 root root 428844 Jan 20 2012 libvt-hyb.so.0.0.0
>>>>>>> -rw-r--r--. 1 root root 541004 Jan 20 2012 libvt-mpi.a
>>>>>>> lrwxrwxrwx. 1 root root 18 Sep 15 12:25 libvt-mpi.so ->
>>>>>>> libvt-mpi.so.0.0.0
>>>>>>> lrwxrwxrwx. 1 root root 18 Sep 14 16:14 libvt-mpi.so.0 ->
>>>>>>> libvt-mpi.so.0.0.0
>>>>>>> -rwxr-xr-x. 1 root root 396352 Jan 20 2012 libvt-mpi.so.0.0.0
>>>>>>> -rw-r--r--. 1 root root 372352 Jan 20 2012 libvt-mt.a
>>>>>>> lrwxrwxrwx. 1 root root 17 Sep 15 12:25 libvt-mt.so ->
>>>>>>> libvt-mt.so.0.0.0
>>>>>>> lrwxrwxrwx. 1 root root 17 Sep 14 16:14 libvt-mt.so.0 ->
>>>>>>> libvt-mt.so.0.0.0
>>>>>>> -rwxr-xr-x. 1 root root 266104 Jan 20 2012 libvt-mt.so.0.0.0
>>>>>>> -rw-r--r--. 1 root root 60390 Jan 20 2012 libvt-pomp.a
>>>>>>> lrwxrwxrwx. 1 root root 14 Sep 15 12:25 libvt.so ->
>>>>>>> libvt.so.0.0.0
>>>>>>> lrwxrwxrwx. 1 root root 14 Sep 14 16:14 libvt.so.0 ->
>>>>>>> libvt.so.0.0.0
>>>>>>> -rwxr-xr-x. 1 root root 242604 Jan 20 2012 libvt.so.0.0.0
>>>>>>> -rwxr-xr-x. 1 root root 303591 Jan 20 2012 mpi.mod
>>>>>>> drwxr-xr-x. 2 root root 4096 Sep 14 16:14 openmpi
>>>>>>>
>>>>>>>
>>>>>>> The file (actually, a link) it claims it can't find:
>>>>>>> libmpi_f90.so.1, is clearly there. And
>>>>>>> LD_LIBRARY_PATH=/usr/lib/openmpi/lib/.
>>>>>>>
>>>>>>> What's the problem?
>>>>>>>
>>>>>>> ---John
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>>
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>