Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-11-13 19:53:40


I can't believe I missed that, sorry. :-(

None of the btl's are capable of doing loopback communication except
"self." Hence, you really can't run "--mca btl foo" if your app ever
sends to itself -- you really need to run "--mca btl foo,self" at a
minimum.

This is not so much an optimization as it is a software engineering
decision; we didn't have to include the special send-to-self case in
any of the other btl components this way (i.e., less code, less complex
maintenance).

On Nov 13, 2005, at 7:12 PM, Brian Barrett wrote:

> One other thing I noticed... You specify -mca btl openib. Try
> specifying -mca btl openib,self. The self component is used for
> "send to self" operations. This could be the cause of your failures.
>
> Brian
>
> On Nov 13, 2005, at 3:02 PM, Jeff Squyres wrote:
>
>> Troy --
>>
>> Were you perchance using multiple processes per node? If so, we
>> literally just fixed some sm btl bugs that could have been affecting
>> you (they could have caused hangs). They're fixed in the nightly
>> snapshots from today (both trunk and v1.0): r8140. If you were using
>> the sm btl and multiple processes per node, could you try again?
>>
>>
>> On Nov 12, 2005, at 10:20 AM, Troy Telford wrote:
>>
>>>> We have very limited openib resources for testing at the moment. Can
>>>> you provide details on how to reproduce?
>>>
>>> My bad; I must've been in a bigger hurry to go home for the weekend
>>> than I thought.
>>>
>>> I'm going to start with the assumption you're interested in the steps
>>> to reproduce it in OpenMPI, and are less interested in the specifics
>>> of the OpenIB setup.
>>>
>>> Hardware Data:
>>> Dual Opteron
>>> 4 GB RAM
>>> PCI-X Mellanox IB HCA's
>>>
>>> Software:
>>> SuSE Linux Enterprise Server 9es, SP2
>>> Linux Kernel 2.6.14 (Kernel IB drivers)
>>> OpenIB.org svn build of the userspace libraries and utilities. (I
>>> mentioned the revision number in an earlier post)
>>>
>>> Setup:
>>> Recompiled Presta, Intel MPI Benchmark, HPL, and HPCC against OpenIB
>>> 1.0RC5
>>>
>>> HPL.dat and HPCC.dat are identical to versions previously posted by
>>> myself. (not included to reduce redundant traffic)
>>>
>>> Execution was started by commenting out the desied binary from the
>>> following (truncated) script:
>>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/hello_world
>>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/IMB-MPI1
>>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/com -o100
>>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/allred 1000 100
>>> 1000
>>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/globalop --help
>>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca ptl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/laten -o 100
>>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/hpcc
>>> mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/xhpl
>>>
>>> As to which tests produce the error: The presta 'com' test almost
>>> always produces it; although at different places in the test on each
>>> run. (there are two files, presta.com-16.rc5 and presta.gen2-16rc5.
>>> Both of these are running he 'com' test, however, note both fail at
>>> different points).
>>>
>>> In addition IMB (Intel MPI Benchmark) also exhibits the same
>>> behavior, halting execution in different places. Similarly, the
>>> 'allred' and 'globalop' tests would also behave the same way,
>>> producing the same error. (However, I did manage to get 'allred' to
>>> actually complete once... somehow.)
>>>
>>> HPL and HPCC also would exit, producing the same errors.
>>>
>>> If there's anything else I may have left out, I'll see what I can do.
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> --
>> {+} Jeff Squyres
>> {+} The Open MPI Project
>> {+} http://www.open-mpi.org/
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> Brian Barrett
> Open MPI developer
> http://www.open-mpi.org/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/