On Dec 5, 2011, at 5:49 AM, arnaud Heritier wrote:

Hello,

I found the solution, thanks to Qlogic support.

The "can't open /dev/ipath, network down (err=26)" message from the ipath driver is really misleading.

Actually, this is an hardware context problem on the Qlogic PSM. PSM can't allocate any hardware context for the job because  other(s) MPI job(s) have already used all available contexts. In order to avoid this problem, every MPI jobs have to use the  PSM_SHAREDCONTEXTS_MAX variable set with the good value, according to the number of processes that will run on the node. If we don't use this variable, PSM will "greedily" use all contexts with the first mpi job spawned on the node.

Sounds like we should be setting this value when starting the process - yes? If so, what is the "good" value, and how do we compute it?


Regards,

Arnaud


On Tue, Nov 29, 2011 at 6:44 PM, Jeff Squyres <jsquyres@cisco.com> wrote:
On Nov 28, 2011, at 11:53 PM, arnaud Heritier wrote:

> I do have a contract and i tried to open a case, but their support is ......

What happens if you put a delay between the two jobs?  E.g., if you just delay a few seconds before the 2nd job starts?  Perhaps the ipath device just needs a little time before it will be available...?  (that's a total guess)

I suggest this because the PSM device will definitely give you better overall performance than the QLogic verbs support.  Their verbs support basically barely works -- PSM is their primary device and the one that we always recommend.

> Anyway. I'm stii working on the strange error message from mpirun saying it can't allocate memory when at the same time it also reports that the memory is unlimited ...
>
>
> Arnaud
>
> On Tue, Nov 29, 2011 at 4:23 AM, Jeff Squyres <jsquyres@cisco.com> wrote:
> I'm afraid we don't have any contacts left at QLogic to ask them any more... do you have a support contract, perchance?
>
> On Nov 27, 2011, at 3:11 PM, Arnaud Heritier wrote:
>
> > Hello,
> >
> > I run into a stange problem with qlogic OFED and openmpi. When i submit (through SGE) 2 jobs on the same node, the second job ends up with:
> >
> > (ipath/PSM)[10292]: can't open /dev/ipath, network down (err=26)
> >
> > I'm pretty sure the infiniband is working well as the other job runs fine.
> >
> > Here is details about the configuration:
> >
> > Qlogic HCA: InfiniPath_QMH7342 (2 ports but only one connected to a switch)
> > qlogic_ofed-1.5.3-7.0.0.0.35 (rocks cluster roll)
> > openmpi 1.5.4 (./configure --with-psm --with-openib --with-sge)
> >
> > -------------
> >
> > In order to fix this problem i recompiled openmpi without psm support, but i faced an other problem:
> >
> > The OpenFabrics (openib) BTL failed to initialize while trying to
> > allocate some locked memory.  This typically can indicate that the
> > memlock limits are set too low.  For most HPC installations, the
> > memlock limits should be set to "unlimited".  The failure occured
> > here:
> >
> >   Local host:    compute-0-6.local
> >   OMPI source:   btl_openib.c:329
> >   Function:      ibv_create_srq()
> >   Device:        qib0
> >   Memlock limit: unlimited
> >
> >
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquyres@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users