Hmmm....well, according to this, it looks like the process ranks are being incorrectly assigned. Shouldn't have anything to do with what environ we are in (slurm, rsh, etc).
I'll look into it - thanks!
OK. The -np only run:---sh-3.1$ mpirun -np 2 --display-allocation --display-devel-map mpi_hello====================== ALLOCATED NODES ======================Data for node: Name: cut1n7 Launch id: -1 Arch: ffc91200 State: 2Num boards: 1 Num sockets/board: 2 Num cores/socket: 4Daemon: [[51868,0],0] Daemon launched: TrueNum slots: 1 Slots in use: 0Num slots allocated: 1 Max slots: 0Username on node: NULLNum procs: 0 Next node_rank: 0Data for node: Name: cut1n8 Launch id: -1 Arch: 0 State: 2Num boards: 1 Num sockets/board: 2 Num cores/socket: 4Daemon: Not defined Daemon launched: FalseNum slots: 0 Slots in use: 0Num slots allocated: 0 Max slots: 0Username on node: NULLNum procs: 0 Next node_rank: 0=================================================================Map generated by mapping policy: 0400Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSENum new daemons: 1 New daemon starting vpid 1Num nodes: 2Data for node: Name: cut1n7 Launch id: -1 Arch: ffc91200 State: 2Num boards: 1 Num sockets/board: 2 Num cores/socket: 4Daemon: [[51868,0],0] Daemon launched: TrueNum slots: 1 Slots in use: 1Num slots allocated: 1 Max slots: 0Username on node: NULLNum procs: 1 Next node_rank: 1Data for proc: [[51868,1],0]Pid: 0 Local rank: 0 Node rank: 0State: 0 App_context: 0 Slot list: NULLData for node: Name: cut1n8 Launch id: -1 Arch: 0 State: 2Num boards: 1 Num sockets/board: 2 Num cores/socket: 4Daemon: [[51868,0],1] Daemon launched: FalseNum slots: 0 Slots in use: 1Num slots allocated: 0 Max slots: 0Username on node: NULLNum procs: 1 Next node_rank: 1Data for proc: [[51868,1],1]Pid: 0 Local rank: 0 Node rank: 0State: 0 App_context: 0 Slot list: NULLHello, I am node cut1n8 with rank 1Hello, I am node cut1n7 with rank 0---Before the segfault I got (using -npernode):---sh-3.1$ mpirun -npernode 1 --display-allocation --display-devel-map mpi_hello
====================== ALLOCATED NODES ======================Data for node: Name: cut1n7 Launch id: -1 Arch: ffc91200 State: 2Num boards: 1 Num sockets/board: 2 Num cores/socket: 4Daemon: [[51942,0],0] Daemon launched: TrueNum slots: 1 Slots in use: 0Num slots allocated: 1 Max slots: 0Username on node: NULLNum procs: 0 Next node_rank: 0Data for node: Name: cut1n8 Launch id: -1 Arch: 0 State: 2Num boards: 1 Num sockets/board: 2 Num cores/socket: 4Daemon: Not defined Daemon launched: FalseNum slots: 0 Slots in use: 0Num slots allocated: 0 Max slots: 0Username on node: NULLNum procs: 0 Next node_rank: 0=================================================================Map generated by mapping policy: 0400Npernode: 1 Oversubscribe allowed: TRUE CPU Lists: FALSENum new daemons: 1 New daemon starting vpid 1Num nodes: 2Data for node: Name: cut1n7 Launch id: -1 Arch: ffc91200 State: 2Num boards: 1 Num sockets/board: 2 Num cores/socket: 4Daemon: [[51942,0],0] Daemon launched: TrueNum slots: 1 Slots in use: 1Num slots allocated: 1 Max slots: 0Username on node: NULLNum procs: 1 Next node_rank: 1Data for proc: [[51942,1],0]Pid: 0 Local rank: 0 Node rank: 0State: 0 App_context: 0 Slot list: NULLData for node: Name: cut1n8 Launch id: -1 Arch: 0 State: 2Num boards: 1 Num sockets/board: 2 Num cores/socket: 4Daemon: [[51942,0],1] Daemon launched: FalseNum slots: 0 Slots in use: 1Num slots allocated: 0 Max slots: 0Username on node: NULLNum procs: 1 Next node_rank: 1Data for proc: [[51942,1],0]Pid: 0 Local rank: 0 Node rank: 0State: 0 App_context: 0 Slot list: NULL[cut1n7:19375] *** Process received signal ***[cut1n7:19375] Signal: Segmentation fault (11)[cut1n7:19375] Signal code: Address not mapped (1)[cut1n7:19375] Failing at address: 0x50[cut1n7:19375] [ 0] /lib64/libpthread.so.0 [0x37bda0de80][cut1n7:19375] [ 1] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xdb) [0x2aed0f93af8b][cut1n7:19375] [ 2] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x655) [0x2aed0f9462f5][cut1n7:19375] [ 3] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x10b) [0x2aed0f94d31b][cut1n7:19375] [ 4] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/openmpi/mca_plm_slurm.so [0x2aed107f6ecf][cut1n7:19375] [ 5] mpirun [0x40335a][cut1n7:19375] [ 6] mpirun [0x4029f3][cut1n7:19375] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x37bce1d8b4][cut1n7:19375] [ 8] mpirun [0x402929][cut1n7:19375] *** End of error message ***Segmentation fault---I'll look into a slurm version update. Previously, SLURM 1.0.30 and Open MPI 1.3.2 working together. Just curious what was giving me heartache here ...On Mon, May 17, 2010 at 4:06 PM, Ralph Castain <rhc@open-mpi.org> wrote:That's a pretty old version of slurm - I don't have access to anything that old to test against. You could try running it with --display-allocation --display-devel-map to see what ORTE thinks the allocation is and how it mapped the procs. It sounds like something may be having a problem there...
On Mon, May 17, 2010 at 11:08 AM, Christopher Maestas <cdmaestas@gmail.com> wrote:
Hello,I've been having some troubles with OpenMPI 1.4.X and slurm recently. I seem to be able to run jobs this way ok:---sh-3.1$ mpirun -np 2 mpi_helloHello, I am node cut1n7 with rank 0Hello, I am node cut1n8 with rank 1--However if I try and use the -npernode option I get:---sh-3.1$ mpirun -npernode 1 mpi_hello[cut1n7:16368] *** Process received signal ***[cut1n7:16368] Signal: Segmentation fault (11)[cut1n7:16368] Signal code: Address not mapped (1)[cut1n7:16368] Failing at address: 0x50[cut1n7:16368] [ 0] /lib64/libpthread.so.0 [0x37bda0de80][cut1n7:16368] [ 1] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xdb) [0x2b73eb84df8b][cut1n7:16368] [ 2] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x655) [0x2b73eb8592f5][cut1n7:16368] [ 3] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x10b) [0x2b73eb86031b][cut1n7:16368] [ 4] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/openmpi/mca_plm_slurm.so [0x2b73ec709ecf][cut1n7:16368] [ 5] mpirun [0x40335a][cut1n7:16368] [ 6] mpirun [0x4029f3][cut1n7:16368] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x37bce1d8b4][cut1n7:16368] [ 8] mpirun [0x402929][cut1n7:16368] *** End of error message ***Segmentation fault---This is ompi 1.4.2, gcc 4.1.1 and slurm 2.0.9 ... I'm sure it's a rather silly detail on my end, but figure I should start this thread for any insights and feedback I can help provide to resolve this.Thanks,-cdm_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users