Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openib errors as user, but not root
From: pat.o'bryant_at_[hidden]
Date: 2007-11-08 08:54:54


What we discovered is that our PBS mom daemon did not have unlimited locked
memory. So, when your job is created by the mom daemon it inherits the
memory limits. The fix was to cycle the PBS mom daemon after a boot (and
yes, we do start the mom daemon at boot but for some reason it doesn't
inherit unlimited locked memory). The way to determine if this is the
problem is to place a "ulimit -a" in the text of your PBS job. Run your job
and you will see a limit of 32K. Next cycle the mom daemon on the node(s)
of interest and re-run your job. You will now see unlimited memory.
         Thanks,
          Pat O'Bryant

J.W. (Pat) O'Bryant,Jr.
Business Line Infrastructure
Technical Systems, HPC

                                                                           
             Jeff Squyres
             <jsquyres_at_cisc
             o.com> To
             Sent by: Open MPI Users <users_at_[hidden]>
             users-bounces@ cc
             open-mpi.org
                                                                   Subject
                                      Re: [OMPI users] openib errors as
             11/07/07 06:25 user, but not root
             PM
                                                                           
                                                                           
             Please respond
                   to
             Open MPI Users
             <users_at_open-mp
                 i.org>
                                                                           
                                                                           
                                                                           
                                                                           

Check out:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more

In particular, see the stuff about using resource managers.

On Nov 7, 2007, at 7:22 PM, Andrus, Mr. Brian (Contractor) wrote:

> Ok, I am having some difficulty troubleshooting this.
>
> If I run my hello program without torque, it works fine:
> [root_at_login1 root]# mpirun --mca btl openib,self -host
> n01,n02,n03,n04,n05 /data/root/hello
> Hello from process 0 of 5 on node n01
> Hello from process 1 of 5 on node n02
> Hello from process 2 of 5 on node n03
> Hello from process 3 of 5 on node n04
> Hello from process 4 of 5 on node n05
>
> If I submit it as root, it seems happy:
> [root_at_login1 root]# qsub
> #!/bin/bash
> #PBS -j oe
> #PBS -l nodes=5:ppn=1
> #PBS -W x=NACCESSPOLICY:SINGLEJOB
> #PBS -N TestJob
> #PBS -q long
> #PBS -o output.txt
> #PBS -V
> cd $PBS_O_WORKDIR
> rm -f output.txt
> date
> mpirun --mca btl openib,self /data/root/hello
> 103.cluster.default.domain
> [root_at_login1 root]# cat output.txt
> Wed Nov 7 16:20:33 PST 2007
> Hello from process 0 of 5 on node n05
> Hello from process 1 of 5 on node n04
> Hello from process 2 of 5 on node n03
> Hello from process 3 of 5 on node n02
> Hello from process 4 of 5 on node n01
>
> If I do it as me, not so good:
> [andrus_at_login1 data]$ qsub
> [andrus_at_login1 data]$ qsub
> #!/bin/bash
> #PBS -j oe
> #PBS -l nodes=1:ppn=1
> #PBS -W x=NACCESSPOLICY:SINGLEJOB
> #PBS -N TestJob
> #PBS -q long
> #PBS -o output.txt
> #PBS -V
> cd $PBS_O_WORKDIR
> rm -f output.txt
> date
> mpirun --mca btl openib,self /data/root/hello
> 105.littlemac.default.domain
> [andrus_at_login1 data]$ cat output.txt
> Wed Nov 7 16:23:00 PST 2007
>
--------------------------------------------------------------------------
> The OpenIB BTL failed to initialize while trying to allocate some
> locked memory. This typically can indicate that the memlock limits
> are set too low. For most HPC installations, the memlock limits
> should be set to "unlimited". The failure occured here:
>
> Host: n01
> OMPI source: btl_openib.c:828
> Function: ibv_create_cq()
> Device: mthca0
> Memlock limit: 32768
>
> You may need to consult with your system administrator to get this
> problem fixed. This FAQ entry on the Open MPI web site may also be
> helpful:
>
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>
--------------------------------------------------------------------------
>
--------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process
> is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> PML add procs failed
> --> Returned "Error" (-1) instead of "Success" (0)
>
--------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
>
>
>
> I have checked that ulimit is unlimited. I cannot seem to figure
> this. Any help?
> Brian Andrus perotsystems
> Site Manager | Sr. Computer Scientist
> Naval Research Lab
> 7 Grace Hopper Ave, Monterey, CA 93943
> Phone (831) 656-4839 | Fax (831) 656-4866
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users