Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RDMA_CM
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2008-09-25 14:16:09


I think it's a sm bug again I tested with the latest revision, I think it
was r19588 ( before Jeff shuted the svn down).
I run the mpi_p test ( BW between pairs of nodes ) with many nodes and it
got stacked, it also works without sm. I am sorry I couldn't test it
earlier.
# i=1 ; while [ 1 ] ; do echo " ****************** i=$i ******** ";
/home/USERS/lenny/OMPI_ORTE_TRUNK/bin/mpirun -np 84 -hostfile hostfile
/home/USERS/lenny/TESTS/TRUNK/mpi_p1_4_TRUNK -t bw ; let i=i+1; sleep 1 ;
done
  ****************** i=1 ********
BW (84) (size min max avg) 1048576 660.152249 2075.115025 1325.838953
  ****************** i=2 ********
[stucked]

p.s. I will be on vacation until 5-Oct, I hope to fallow mails and run few
tests.
Best Regards
Lenny.
On Thu, Sep 25, 2008 at 6:44 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:

> Note that there *are* other changes to the openib BTL in that branch
> besides just the CPC (meaning: changing the CPC meant changing other things
> as well).
>
> So if you can run with the trunk and you can't run with this branch, then
> there may be something different wrong with the hg tree other than just the
> RDMA CM stuff...
>
> Let me know what you find.
>
>
> On Sep 25, 2008, at 9:21 AM, Lenny Verkhovsky wrote:
>
> after few more tests is seems like -mca btl_openib_cpc_include oob hangs
>> too.
>>
>> so, maybe it's something environmental.
>>
>> let me recheck it.
>>
>>
>> On 9/25/08, Jeff Squyres <jsquyres_at_[hidden]> wrote: On Sep 25, 2008, at
>> 7:25 AM, Lenny Verkhovsky wrote:
>>
>> I have RDMACM got hanged on np=16 ( dual core dual cpu).
>>
>>
>> Yuck. I've run all of the intel tests at 32 procs (4ppn). What exactly
>> did you run and where exactly did it hang? Can you get stack traces?
>>
>> it seems like it got hanged on the last machine (
>> witch1,witch2,witch3,witch4)
>>
>> when I ctrl-c the mpirun, I got defunct procs on the last machine.
>>
>> #ps -ef |grep mpi
>> root 5321 5320 98 14:09 ? 00:03:47 [mpi_p_TRUNK_rdm] <defunct>
>> root 5322 5320 98 14:09 ? 00:03:47 [mpi_p_TRUNK_rdm] <defunct>
>> root 5323 5320 98 14:09 ? 00:03:47 [mpi_p_TRUNK_rdm] <defunct>
>> root 5324 5320 98 14:09 ? 00:03:47 [mpi_p_TRUNK_rdm] <defunct>
>>
>>
>> Are you seeing ORTE problems?
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>>
>>
>
> --
> Jeff Squyres
> Cisco Systems
>
>