Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts
From: Chris Jewell (chris.jewell_at_[hidden])
Date: 2010-11-13 09:39:15


Hi Dave, Reuti,

Sorry for kicking off this thread, and then disappearing. I've been away for a bit. Anyway, Dave, I'm glad you experienced the same issue as I had with my installation of SGE 6.2u5 and OpenMPI with core binding -- namely that with 'qsub -pe openmpi 8 -binding set linear:1 <myscript.com>', if two or more of the parallel processes get scheduled to the same execution node, then the processes end up being bound to the same core. Not good!

I've been playing around quite a bit trying to understand this issue, and ended up on the GE dev list:

http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=39&dsMessageId=285878

It seems that most people expect that calls to 'qrsh -inherit' (that I assume OpenMPI uses to bind parallel processes to reserved GE slots) activates a separate binding. This does not appear to be the case. I *was* hoping that using -binding pe linear:1 might enable me to write a script that read the pe_hostfile and created a machine file for OpenMPI, but this fails as GE does not appear to give information as to which cores are unbound, only the number required.

So, for now, my solution has been to use a JSV to remove core binding for the MPI jobs (but retain it for serial and SMP jobs). Any more ideas??

Cheers,

Chris

(PS. Dave: how is my alma mater these days??)

--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778