Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Andrus, Mr. Brian (Contractor) (brian.andrus_at_[hidden])
Date: 2007-11-07 19:37:05


I have checked those out.

I am trying to test limits. If I ssh directly to a node and check,
everything is ok:
[andrus_at_login1 ~]$ ssh n01 ulimit -l
unlimited

The settings in /etc/security/limits.conf are right too.

Brian Andrus perotsystems
Site Manager | Sr. Computer Scientist
Naval Research Lab
7 Grace Hopper Ave, Monterey, CA 93943
Phone (831) 656-4839 | Fax (831) 656-4866

-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On
Behalf Of Jeff Squyres
Sent: Wednesday, November 07, 2007 4:26 PM
To: Open MPI Users
Subject: Re: [OMPI users] openib errors as user, but not root

Check out:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more

In particular, see the stuff about using resource managers.

On Nov 7, 2007, at 7:22 PM, Andrus, Mr. Brian (Contractor) wrote:

> Ok, I am having some difficulty troubleshooting this.
>
> If I run my hello program without torque, it works fine:
> [root_at_login1 root]# mpirun --mca btl openib,self -host
> n01,n02,n03,n04,n05 /data/root/hello
> Hello from process 0 of 5 on node n01
> Hello from process 1 of 5 on node n02
> Hello from process 2 of 5 on node n03
> Hello from process 3 of 5 on node n04
> Hello from process 4 of 5 on node n05
>
> If I submit it as root, it seems happy:
> [root_at_login1 root]# qsub
> #!/bin/bash
> #PBS -j oe
> #PBS -l nodes=5:ppn=1
> #PBS -W x=NACCESSPOLICY:SINGLEJOB
> #PBS -N TestJob
> #PBS -q long
> #PBS -o output.txt
> #PBS -V
> cd $PBS_O_WORKDIR
> rm -f output.txt
> date
> mpirun --mca btl openib,self /data/root/hello
> 103.cluster.default.domain
> [root_at_login1 root]# cat output.txt
> Wed Nov 7 16:20:33 PST 2007
> Hello from process 0 of 5 on node n05
> Hello from process 1 of 5 on node n04
> Hello from process 2 of 5 on node n03
> Hello from process 3 of 5 on node n02
> Hello from process 4 of 5 on node n01
>
> If I do it as me, not so good:
> [andrus_at_login1 data]$ qsub
> [andrus_at_login1 data]$ qsub
> #!/bin/bash
> #PBS -j oe
> #PBS -l nodes=1:ppn=1
> #PBS -W x=NACCESSPOLICY:SINGLEJOB
> #PBS -N TestJob
> #PBS -q long
> #PBS -o output.txt
> #PBS -V
> cd $PBS_O_WORKDIR
> rm -f output.txt
> date
> mpirun --mca btl openib,self /data/root/hello
> 105.littlemac.default.domain
> [andrus_at_login1 data]$ cat output.txt
> Wed Nov 7 16:23:00 PST 2007
> ----------------------------------------------------------------------
> ---- The OpenIB BTL failed to initialize while trying to allocate some

> locked memory. This typically can indicate that the memlock limits
> are set too low. For most HPC installations, the memlock limits
> should be set to "unlimited". The failure occured here:
>
> Host: n01
> OMPI source: btl_openib.c:828
> Function: ibv_create_cq()
> Device: mthca0
> Memlock limit: 32768
>
> You may need to consult with your system administrator to get this
> problem fixed. This FAQ entry on the Open MPI web site may also be
> helpful:
>
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
> ----------------------------------------------------------------------
> ----
> ----------------------------------------------------------------------
> ---- It looks like MPI_INIT failed for some reason; your parallel
> process is likely to abort. There are many reasons that a parallel
> process can fail during MPI_INIT; some of which are due to
> configuration or environment problems. This failure appears to be an
> internal failure; here's some additional information (which may only
> be relevant to an Open MPI
> developer):
>
> PML add procs failed
> --> Returned "Error" (-1) instead of "Success" (0)
> ----------------------------------------------------------------------
> ----
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
>
>
>
> I have checked that ulimit is unlimited. I cannot seem to figure this.

> Any help?
> Brian Andrus perotsystems
> Site Manager | Sr. Computer Scientist
> Naval Research Lab
> 7 Grace Hopper Ave, Monterey, CA 93943 Phone (831) 656-4839 | Fax
> (831) 656-4866 _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users