Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Noam Meltzer (noam_at_[hidden])
Date: 2007-08-22 10:20:44


Hi,

Thanks all who answered. The problem was indeed in the max. locked
memory limitation.
Though, changing it in <SGE_ROOT>/default/common/settings.sh was not enough.
I also had to add ". <SGE_ROOT>/default/common/settings.sh" to
<SGE_ROOT>/default/common/sgeexecd (and to /etc/init.d/sgeexecd on the
compute nodes) as when the sgeexecd was executed boot it ignored the
limits.conf.

Best regards,
Noam Meltzer
Software Support Engineer & RHCE
E&M Computing

http://www.emet.co.il

Jeff Squyres wrote:
> I suspect that your SGE daemons are not starting with the proper
> locked memory limits (and therefore jobs started under SGE get
> severely limited locked memory limits).
>
> See these FAQ entries -- the issues described for SLURM are
> applicable to all resource managers (to include SGE):
>
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more
>
>
> On Aug 22, 2007, at 8:31 AM, Noam Meltzer wrote:
>
>
>> Hi,
>>
>> I am running openmpi-1.2.3 compiled for 64bit on RHEL4u4.
>> I also have a Voltaire InfiniBand interconnect.
>> When I manually run jobs using the following command:
>>
>> /opt/local/openmpi-1.2.3-gcc4/bin/orterun -np 8 -hostfile ~/myHostList
>> -mca btl self,openib /tcc/eandm/performance/igor/main.exe.openmpi123
>>
>> The job is executed just fine..
>>
>> Though, when run through SGE I have the weirdest problem, and get the
>> following error (on all hosts in my list):
>> ----------------------------------------------------------------------
>> ----
>> The OpenIB BTL failed to initialize while trying to create an internal
>> queue. This typically indicates a failed OpenFabrics installation or
>> faulty hardware. The failure occured here:
>>
>> Host: node4.grid.technion.ac.il
>> OMPI source: btl_openib.c:828
>> Function: ibv_create_cq()
>> Error: Invalid argument (errno=22)
>> Device: mthca0
>>
>> You may need to consult with your system administrator to get this
>> problem fixed.
>> ----------------------------------------------------------------------
>> ----
>>
>> To send a job to the grid I use the following command:
>> qrsh -cwd -q noam.q -pe orte 8 ./myScript
>>
>> while "myScript" looks like:
>>
>> #!/bin/bash
>> /opt/local/openmpi-1.2.3-gcc4/bin/orterun -np $NSLOTS -mca btl
>> self,openib /tcc/eandm/performance/igor/main.exe.openmpi123
>>
>> If I change "openib" to "tcp" (in myScript) everything works just
>> fine.
>>
>> Any ideas?
>>
>> --
>> Best regards,
>> Noam Meltzer
>> Software Support Engineer & RHCE
>> E&M Computing
>>
>> http://www.emet.co.il
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>