Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] sge tight intregration leads to bad allocation
From: Tom Bryan (tombry_at_[hidden])
Date: 2012-04-03 09:49:23


How are you launching the application?

I had an app that did an Spawn_multiple with tight SGE integration, and
there was a difference in behavior depending on whether or not an app was
launched via mpiexec. I¹m not sure whether it¹s the same issue as you¹re
seeing, but Reuti describes the problem here:
http://www.open-mpi.org/community/lists/users/2012/01/18348.php

It will be resolved at some point, but I imagine that the fix will only go
into new releases:
http://www.open-mpi.org/community/lists/users/2012/02/18399.php

In my case, the workaround was just to launch the app with mpiexec, and the
allocation is handled correctly.

---Tom

On 4/3/12 9:23 AM, "Eloi Gaudry" <eloi.gaudry_at_[hidden]> wrote:

> Hi,
>
>
>
> I¹ve observed a strange behavior during rank allocation on a distributed run
> schedule and submitted using Sge (Son of Grid Egine 8.0.0d) and OpenMPI-1.4.4.
>
> Briefly, there is a one-slot difference between allocated rank/slot for Sge
> and OpenMPI. The issue here is that one node becomes oversubscribed at
> runtime.
>
>
>
> Here is the output of the allocation done for gridengine:
>
>
>
> ====================== ALLOCATED NODES ======================
>
>
>
> Data for node: Name: barney Launch id: -1 Arch: ffc91200
> State: 2
>
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
>
> Daemon: [[22904,0],0] Daemon launched: True
>
> Num slots: 1 Slots in use: 0
>
> Num slots allocated: 1 Max slots: 0
>
> Username on node: NULL
>
> Num procs: 0 Next node_rank: 0
>
> Data for node: Name: carl.fft Launch id: -1 Arch: 0
> State: 2
>
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
>
> Daemon: Not defined Daemon launched: False
>
> Num slots: 1 Slots in use: 0
>
> Num slots allocated: 1 Max slots: 0
>
> Username on node: NULL
>
> Num procs: 0 Next node_rank: 0
>
> Data for node: Name: charlie.fft Launch id: -1
> Arch: 0 State: 2
>
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
>
> Daemon: Not defined Daemon launched: False
>
> Num slots: 2 Slots in use: 0
>
> Num slots allocated: 2 Max slots: 0
>
> Username on node: NULL
>
> Num procs: 0 Next node_rank: 0
>
>
>
>
>
> And here is the allocation finally used:
>
> =================================================================
>
>
>
> Map generated by mapping policy: 0200
>
> Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE
>
> Num new daemons: 2 New daemon starting vpid 1
>
> Num nodes: 3
>
>
>
> Data for node: Name: barney Launch id: -1 Arch: ffc91200
> State: 2
>
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
>
> Daemon: [[22904,0],0] Daemon launched: True
>
> Num slots: 1 Slots in use: 2
>
> Num slots allocated: 1 Max slots: 0
>
> Username on node: NULL
>
> Num procs: 2 Next node_rank: 2
>
> Data for proc: [[22904,1],0]
>
> Pid: 0 Local rank: 0 Node rank: 0
>
> State: 0 App_context: 0
> Slot list: NULL
>
> Data for proc: [[22904,1],3]
>
> Pid: 0 Local rank: 1 Node rank: 1
>
> State: 0 App_context: 0
> Slot list: NULL
>
>
>
> Data for node: Name: carl.fft Launch id: -1 Arch: 0
> State: 2
>
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
>
> Daemon: [[22904,0],1] Daemon launched: False
>
> Num slots: 1 Slots in use: 1
>
> Num slots allocated: 1 Max slots: 0
>
> Username on node: NULL
>
> Num procs: 1 Next node_rank: 1
>
> Data for proc: [[22904,1],1]
>
> Pid: 0 Local rank: 0 Node rank: 0
>
> State: 0 App_context: 0
> Slot list: NULL
>
>
>
> Data for node: Name: charlie.fft Launch id: -1
> Arch: 0 State: 2
>
> Num boards: 1 Num sockets/board: 2 Num cores/socket: 2
>
> Daemon: [[22904,0],2] Daemon launched: False
>
> Num slots: 2 Slots in use: 1
>
> Num slots allocated: 2 Max slots: 0
>
> Username on node: NULL
>
> Num procs: 1 Next node_rank: 1
>
> Data for proc: [[22904,1],2]
>
> Pid: 0 Local rank: 0 Node rank: 0
>
> State: 0 App_context: 0
> Slot list: NULL
>
>
>
> Has anyone already encounter the same behavior ?
>
> Is there a simple fix than not using the tight integration mode between Sge
> and OpenMPI ?
>
>
>
> Eloi
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users