Hmmm....well, according to this, it looks like the process ranks are being incorrectly assigned. Shouldn't have anything to do with what environ we are in (slurm, rsh, etc).

I'll look into it - thanks!

On Mon, May 17, 2010 at 4:25 PM, Christopher Maestas <cdmaestas@gmail.com> wrote:
OK.  The -np only run:
---
sh-3.1$ mpirun -np 2 --display-allocation --display-devel-map mpi_hello

======================   ALLOCATED NODES   ======================

 Data for node: Name: cut1n7            Launch id: -1   Arch: ffc91200  State: 2
        Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
        Daemon: [[51868,0],0]   Daemon launched: True
        Num slots: 1    Slots in use: 0
        Num slots allocated: 1  Max slots: 0
        Username on node: NULL
        Num procs: 0    Next node_rank: 0
 Data for node: Name: cut1n8            Launch id: -1   Arch: 0 State: 2
        Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
        Daemon: Not defined     Daemon launched: False
        Num slots: 0    Slots in use: 0
        Num slots allocated: 0  Max slots: 0
        Username on node: NULL
        Num procs: 0    Next node_rank: 0

=================================================================

 Map generated by mapping policy: 0400
        Npernode: 0     Oversubscribe allowed: TRUE     CPU Lists: FALSE
        Num new daemons: 1      New daemon starting vpid 1
        Num nodes: 2

 Data for node: Name: cut1n7            Launch id: -1   Arch: ffc91200  State: 2
        Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
        Daemon: [[51868,0],0]   Daemon launched: True
        Num slots: 1    Slots in use: 1
        Num slots allocated: 1  Max slots: 0
        Username on node: NULL
        Num procs: 1    Next node_rank: 1
        Data for proc: [[51868,1],0]
                Pid: 0  Local rank: 0   Node rank: 0
                State: 0        App_context: 0  Slot list: NULL

 Data for node: Name: cut1n8            Launch id: -1   Arch: 0 State: 2
        Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
        Daemon: [[51868,0],1]   Daemon launched: False
        Num slots: 0    Slots in use: 1
        Num slots allocated: 0  Max slots: 0
        Username on node: NULL
        Num procs: 1    Next node_rank: 1
        Data for proc: [[51868,1],1]
                Pid: 0  Local rank: 0   Node rank: 0
                State: 0        App_context: 0  Slot list: NULL
Hello, I am node cut1n8 with rank 1
Hello, I am node cut1n7 with rank 0

---

Before the segfault I got (using -npernode):
---
sh-3.1$ mpirun -npernode 1 --display-allocation --display-devel-map mpi_hello

======================   ALLOCATED NODES   ======================

 Data for node: Name: cut1n7            Launch id: -1   Arch: ffc91200  State: 2
        Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
        Daemon: [[51942,0],0]   Daemon launched: True
        Num slots: 1    Slots in use: 0
        Num slots allocated: 1  Max slots: 0
        Username on node: NULL
        Num procs: 0    Next node_rank: 0
 Data for node: Name: cut1n8            Launch id: -1   Arch: 0 State: 2
        Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
        Daemon: Not defined     Daemon launched: False
        Num slots: 0    Slots in use: 0
        Num slots allocated: 0  Max slots: 0
        Username on node: NULL
        Num procs: 0    Next node_rank: 0
=================================================================

 Map generated by mapping policy: 0400
        Npernode: 1     Oversubscribe allowed: TRUE     CPU Lists: FALSE
        Num new daemons: 1      New daemon starting vpid 1
        Num nodes: 2

 Data for node: Name: cut1n7            Launch id: -1   Arch: ffc91200  State: 2
        Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
        Daemon: [[51942,0],0]   Daemon launched: True
        Num slots: 1    Slots in use: 1
        Num slots allocated: 1  Max slots: 0
        Username on node: NULL
        Num procs: 1    Next node_rank: 1
        Data for proc: [[51942,1],0]
                Pid: 0  Local rank: 0   Node rank: 0
                State: 0        App_context: 0  Slot list: NULL

 Data for node: Name: cut1n8            Launch id: -1   Arch: 0 State: 2
        Num boards: 1   Num sockets/board: 2    Num cores/socket: 4
        Daemon: [[51942,0],1]   Daemon launched: False
        Num slots: 0    Slots in use: 1
        Num slots allocated: 0  Max slots: 0
        Username on node: NULL
        Num procs: 1    Next node_rank: 1
        Data for proc: [[51942,1],0]
                Pid: 0  Local rank: 0   Node rank: 0
                State: 0        App_context: 0  Slot list: NULL
[cut1n7:19375] *** Process received signal ***
[cut1n7:19375] Signal: Segmentation fault (11)
[cut1n7:19375] Signal code: Address not mapped (1)
[cut1n7:19375] Failing at address: 0x50
[cut1n7:19375] [ 0] /lib64/libpthread.so.0 [0x37bda0de80]
[cut1n7:19375] [ 1] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xdb) [0x2aed0f93af8b]  
[cut1n7:19375] [ 2] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x655) [0x2aed0f9462f5]
[cut1n7:19375] [ 3] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x10b) [0x2aed0f94d31b]
[cut1n7:19375] [ 4] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/openmpi/mca_plm_slurm.so [0x2aed107f6ecf]
[cut1n7:19375] [ 5] mpirun [0x40335a]
[cut1n7:19375] [ 6] mpirun [0x4029f3]
[cut1n7:19375] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x37bce1d8b4]
[cut1n7:19375] [ 8] mpirun [0x402929]
[cut1n7:19375] *** End of error message ***
Segmentation fault
---

I'll look into a slurm version update.  Previously, SLURM 1.0.30 and Open MPI 1.3.2 working together.  Just curious what was giving me heartache here ... 

On Mon, May 17, 2010 at 4:06 PM, Ralph Castain <rhc@open-mpi.org> wrote:
That's a pretty old version of slurm - I don't have access to anything that old to test against. You could try running it with --display-allocation --display-devel-map to see what ORTE thinks the allocation is and how it mapped the procs. It sounds like something may be having a problem there...


On Mon, May 17, 2010 at 11:08 AM, Christopher Maestas <cdmaestas@gmail.com> wrote:
Hello,

I've been having some troubles with OpenMPI 1.4.X and slurm recently.  I seem to be able to run jobs this way ok:
---
sh-3.1$ mpirun -np 2 mpi_hello
Hello, I am node cut1n7 with rank 0
Hello, I am node cut1n8 with rank 1
--

However if I try and use the -npernode option I get:
---
sh-3.1$ mpirun -npernode 1 mpi_hello
[cut1n7:16368] *** Process received signal ***
[cut1n7:16368] Signal: Segmentation fault (11)
[cut1n7:16368] Signal code: Address not mapped (1)
[cut1n7:16368] Failing at address: 0x50
[cut1n7:16368] [ 0] /lib64/libpthread.so.0 [0x37bda0de80]
[cut1n7:16368] [ 1] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xdb) [0x2b73eb84df8b]
[cut1n7:16368] [ 2] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x655) [0x2b73eb8592f5]
[cut1n7:16368] [ 3] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x10b) [0x2b73eb86031b]
[cut1n7:16368] [ 4] /apps/mpi/openmpi/1.4.2-gcc-4.1.2-may.12.10/lib/openmpi/mca_plm_slurm.so [0x2b73ec709ecf]
[cut1n7:16368] [ 5] mpirun [0x40335a]
[cut1n7:16368] [ 6] mpirun [0x4029f3]
[cut1n7:16368] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x37bce1d8b4]
[cut1n7:16368] [ 8] mpirun [0x402929]
[cut1n7:16368] *** End of error message ***
Segmentation fault
---

This is ompi 1.4.2, gcc 4.1.1 and slurm 2.0.9 ... I'm sure it's a rather silly detail on my end, but figure I should start this thread for any insights and feedback I can help provide to resolve this.

Thanks,
-cdm

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users