Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-11-07 19:25:50


Check out:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more

In particular, see the stuff about using resource managers.

On Nov 7, 2007, at 7:22 PM, Andrus, Mr. Brian (Contractor) wrote:

> Ok, I am having some difficulty troubleshooting this.
>
> If I run my hello program without torque, it works fine:
> [root_at_login1 root]# mpirun --mca btl openib,self -host
> n01,n02,n03,n04,n05 /data/root/hello
> Hello from process 0 of 5 on node n01
> Hello from process 1 of 5 on node n02
> Hello from process 2 of 5 on node n03
> Hello from process 3 of 5 on node n04
> Hello from process 4 of 5 on node n05
>
> If I submit it as root, it seems happy:
> [root_at_login1 root]# qsub
> #!/bin/bash
> #PBS -j oe
> #PBS -l nodes=5:ppn=1
> #PBS -W x=NACCESSPOLICY:SINGLEJOB
> #PBS -N TestJob
> #PBS -q long
> #PBS -o output.txt
> #PBS -V
> cd $PBS_O_WORKDIR
> rm -f output.txt
> date
> mpirun --mca btl openib,self /data/root/hello
> 103.cluster.default.domain
> [root_at_login1 root]# cat output.txt
> Wed Nov 7 16:20:33 PST 2007
> Hello from process 0 of 5 on node n05
> Hello from process 1 of 5 on node n04
> Hello from process 2 of 5 on node n03
> Hello from process 3 of 5 on node n02
> Hello from process 4 of 5 on node n01
>
> If I do it as me, not so good:
> [andrus_at_login1 data]$ qsub
> [andrus_at_login1 data]$ qsub
> #!/bin/bash
> #PBS -j oe
> #PBS -l nodes=1:ppn=1
> #PBS -W x=NACCESSPOLICY:SINGLEJOB
> #PBS -N TestJob
> #PBS -q long
> #PBS -o output.txt
> #PBS -V
> cd $PBS_O_WORKDIR
> rm -f output.txt
> date
> mpirun --mca btl openib,self /data/root/hello
> 105.littlemac.default.domain
> [andrus_at_login1 data]$ cat output.txt
> Wed Nov 7 16:23:00 PST 2007
> --------------------------------------------------------------------------
> The OpenIB BTL failed to initialize while trying to allocate some
> locked memory. This typically can indicate that the memlock limits
> are set too low. For most HPC installations, the memlock limits
> should be set to "unlimited". The failure occured here:
>
> Host: n01
> OMPI source: btl_openib.c:828
> Function: ibv_create_cq()
> Device: mthca0
> Memlock limit: 32768
>
> You may need to consult with your system administrator to get this
> problem fixed. This FAQ entry on the Open MPI web site may also be
> helpful:
>
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process
> is
> likely to abort. There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems. This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
> PML add procs failed
> --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (goodbye)
>
>
>
> I have checked that ulimit is unlimited. I cannot seem to figure
> this. Any help?
> Brian Andrus perotsystems
> Site Manager | Sr. Computer Scientist
> Naval Research Lab
> 7 Grace Hopper Ave, Monterey, CA 93943
> Phone (831) 656-4839 | Fax (831) 656-4866
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems