Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Brian Barrett (brbarret_at_[hidden])
Date: 2005-11-13 19:12:50

One other thing I noticed... You specify -mca btl openib. Try
specifying -mca btl openib,self. The self component is used for
"send to self" operations. This could be the cause of your failures.


On Nov 13, 2005, at 3:02 PM, Jeff Squyres wrote:

> Troy --
> Were you perchance using multiple processes per node? If so, we
> literally just fixed some sm btl bugs that could have been affecting
> you (they could have caused hangs). They're fixed in the nightly
> snapshots from today (both trunk and v1.0): r8140. If you were using
> the sm btl and multiple processes per node, could you try again?
> On Nov 12, 2005, at 10:20 AM, Troy Telford wrote:
>>> We have very limited openib resources for testing at the moment. Can
>>> you provide details on how to reproduce?
>> My bad; I must've been in a bigger hurry to go home for the weekend
>> than I thought.
>> I'm going to start with the assumption you're interested in the steps
>> to reproduce it in OpenMPI, and are less interested in the specifics
>> of the OpenIB setup.
>> Hardware Data:
>> Dual Opteron
>> 4 GB RAM
>> PCI-X Mellanox IB HCA's
>> Software:
>> SuSE Linux Enterprise Server 9es, SP2
>> Linux Kernel 2.6.14 (Kernel IB drivers)
>> svn build of the userspace libraries and utilities. (I
>> mentioned the revision number in an earlier post)
>> Setup:
>> Recompiled Presta, Intel MPI Benchmark, HPL, and HPCC against OpenIB
>> 1.0RC5
>> HPL.dat and HPCC.dat are identical to versions previously posted by
>> myself. (not included to reduce redundant traffic)
>> Execution was started by commenting out the desied binary from the
>> following (truncated) script:
>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>> openib -np 16 -machinefile $work_dir/node $work_dir/hello_world
>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>> openib -np 16 -machinefile $work_dir/node $work_dir/IMB-MPI1
>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>> openib -np 16 -machinefile $work_dir/node $work_dir/com -o100
>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>> openib -np 16 -machinefile $work_dir/node $work_dir/allred 1000 100
>> 1000
>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>> openib -np 16 -machinefile $work_dir/node $work_dir/globalop --help
>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca ptl
>> openib -np 16 -machinefile $work_dir/node $work_dir/laten -o 100
>> #mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>> openib -np 16 -machinefile $work_dir/node $work_dir/hpcc
>> mpirun --prefix /usr/x86_64-gcc-3.3.3/openmpi-1.0rc5/ --mca btl
>> openib -np 16 -machinefile $work_dir/node $work_dir/xhpl
>> As to which tests produce the error: The presta 'com' test almost
>> always produces it; although at different places in the test on each
>> run. (there are two files, and presta.gen2-16rc5.
>> Both of these are running he 'com' test, however, note both fail at
>> different points).
>> In addition IMB (Intel MPI Benchmark) also exhibits the same
>> behavior, halting execution in different places. Similarly, the
>> 'allred' and 'globalop' tests would also behave the same way,
>> producing the same error. (However, I did manage to get 'allred' to
>> actually complete once... somehow.)
>> HPL and HPCC also would exit, producing the same errors.
>> If there's anything else I may have left out, I'll see what I can do.
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> --
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+}
> _______________________________________________
> users mailing list
> users_at_[hidden]

   Brian Barrett
   Open MPI developer