Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2005-11-13 19:53:40


I can't believe I missed that, sorry. :-(

None of the btl's are capable of doing loopback communication except
"self." Hence, you really can't run "--mca btl foo" if your app ever
sends to itself -- you really need to run "--mca btl foo,self" at a
minimum.

This is not so much an optimization as it is a software engineering
decision; we didn't have to include the special send-to-self case in
any of the other btl components this way (i.e., less code, less complex
maintenance).

On Nov 13, 2005, at 7:12 PM, Brian Barrett wrote:

> One other thing I noticed... You specify -mca btl openib. Try
> specifying -mca btl openib,self. The self component is used for
> "send to self" operations. This could be the cause of your failures.
>
> Brian
>
> On Nov 13, 2005, at 3:02 PM, Jeff Squyres wrote:
>
>> Troy --
>>
>> Were you perchance using multiple processes per node? If so, we
>> literally just fixed some sm btl bugs that could have been affecting
>> you (they could have caused hangs). They're fixed in the nightly
>> snapshots from today (both trunk and v1.0): r8140. If you were using
>> the sm btl and multiple processes per node, could you try again?
>>
>>
>> On Nov 12, 2005, at 10:20 AM, Troy Telford wrote:
>>
>>>> We have very limited openib resources for testing at the moment. Can
>>>> you provide details on how to reproduce?
>>>
>>> My bad; I must've been in a bigger hurry to go home for the weekend
>>> than I thought.
>>>
>>> I'm going to start with the assumption you're interested in the steps
>>> to reproduce it in OpenMPI, and are less interested in the specifics
>>> of the OpenIB setup.
>>>
>>> Hardware Data:
>>> Dual Opteron
>>> 4 GB RAM
>>> PCI-X Mellanox IB HCA's
>>>
>>> Software:
>>> SuSE Linux Enterprise Server 9es, SP2
>>> Linux Kernel 2.6.14 (Kernel IB drivers)
>>> OpenIB.org svn build of the userspace libraries and utilities. (I
>>> mentioned the revision number in an earlier post)
>>>
>>> Setup:
>>> Recompiled Presta, Intel MPI Benchmark, HPL, and HPCC against OpenIB
>>> 1.0RC5
>>>
>>> HPL.dat and HPCC.dat are identical to versions previously posted by
>>> myself. (not included to reduce redundant traffic)
>>>
>>> Execution was started by commenting out the desied binary from the
>>> following (truncated) script:
>>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/hello_world
>>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/IMB-MPI1
>>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/com -o100
>>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/allred 1000 100
>>> 1000
>>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/globalop --help
>>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca ptl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/laten -o 100
>>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/hpcc
>>> mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>>> openib -np 16 -machinefile $work_dir/node $work_dir/xhpl
>>>
>>> As to which tests produce the error: The presta 'com' test almost
>>> always produces it; although at different places in the test on each
>>> run. (there are two files, presta.com-16.rc5 and presta.gen2-16rc5.
>>> Both of these are running he 'com' test, however, note both fail at
>>> different points).
>>>
>>> In addition IMB (Intel MPI Benchmark) also exhibits the same
>>> behavior, halting execution in different places. Similarly, the
>>> 'allred' and 'globalop' tests would also behave the same way,
>>> producing the same error. (However, I did manage to get 'allred' to
>>> actually complete once... somehow.)
>>>
>>> HPL and HPCC also would exit, producing the same errors.
>>>
>>> If there's anything else I may have left out, I'll see what I can do.
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> --
>> {+} Jeff Squyres
>> {+} The Open MPI Project
>> {+} http://www.open-mpi.org/
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> Brian Barrett
> Open MPI developer
> http://www.open-mpi.org/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/