I think it's a sm bug again I tested with the latest revision, I think it was r19588 ( before Jeff shuted the svn down).
I run the mpi_p test ( BW between pairs of nodes ) with many nodes and it got stacked, it also works without sm. I am sorry I couldn't test it earlier.
# i=1 ; while [ 1 ] ; do echo " ****************** i=$i ******** "; /home/USERS/lenny/OMPI_ORTE_TRUNK/bin/mpirun -np 84 -hostfile hostfile /home/USERS/lenny/TESTS/TRUNK/mpi_p1_4_TRUNK -t bw ; let i=i+1; sleep 1 ; done
****************** i=1 ********
BW (84) (size min max avg) 1048576 660.152249 2075.115025 1325.838953
****************** i=2 ********
[stucked]
p.s. I will be on vacation until 5-Oct, I hope to fallow mails and run few tests.
Best Regards
Lenny.
On Thu, Sep 25, 2008 at 6:44 PM, Jeff Squyres
<jsquyres@cisco.com> wrote:
Note that there *are* other changes to the openib BTL in that branch besides just the CPC (meaning: changing the CPC meant changing other things as well).
So if you can run with the trunk and you can't run with this branch, then there may be something different wrong with the hg tree other than just the RDMA CM stuff...
Let me know what you find.
On Sep 25, 2008, at 9:21 AM, Lenny Verkhovsky wrote:
after few more tests is seems like -mca btl_openib_cpc_include oob hangs too.
so, maybe it's something environmental.
let me recheck it.
On 9/25/08, Jeff Squyres <jsquyres@cisco.com> wrote: On Sep 25, 2008, at 7:25 AM, Lenny Verkhovsky wrote:
I have RDMACM got hanged on np=16 ( dual core dual cpu).
Yuck. I've run all of the intel tests at 32 procs (4ppn). What exactly did you run and where exactly did it hang? Can you get stack traces?
it seems like it got hanged on the last machine ( witch1,witch2,witch3,witch4)
when I ctrl-c the mpirun, I got defunct procs on the last machine.
#ps -ef |grep mpi
root 5321 5320 98 14:09 ? 00:03:47 [mpi_p_TRUNK_rdm] <defunct>
root 5322 5320 98 14:09 ? 00:03:47 [mpi_p_TRUNK_rdm] <defunct>
root 5323 5320 98 14:09 ? 00:03:47 [mpi_p_TRUNK_rdm] <defunct>
root 5324 5320 98 14:09 ? 00:03:47 [mpi_p_TRUNK_rdm] <defunct>
Are you seeing ORTE problems?
--
Jeff Squyres
Cisco Systems
--
Jeff Squyres
Cisco Systems