Dear Terry,
Thanks for the reply, and sorry for the delay in getting back to you. Here is the relevant part of the gdb output:
Program terminated with signal 11, Segmentation fault.
#0 0x00002b63ba7f9291 in PMPI_Comm_size () at ./pcomm_size.c:46
46 if ( ompi_comm_invalid (comm)) {
(gdb) where
#0 0x00002b63ba7f9291 in PMPI_Comm_size () at ./pcomm_size.c:46
#1 0x000000000062cb6c in blacs_pinfo_ () at ./blacs_pinfo_.c:29
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Do you think the problem is being caused by SGE feeding the wrong number of processors to BLACS in someway?
As I mentioned previously I am requesting a different number of processors than I am running on, as I run several jobs on the requested processors.
Thanks for your time & help.
Conn
________________________________
From: TERRY DONTJE <terry.dontje_at_[hidden]>
To: users_at_[hidden]
Sent: Friday, 13 January 2012, 13:21
Subject: Re: [OMPI users] Openmpi SGE and BLACS
Do you have a stack of where exactly things are seg faulting in blacs_pinfo?
--td
On 1/13/2012 8:12 AM, Conn ORourke wrote:
Dear Openmpi Users,
>
>
>I am reserving several processors with SGE upon which I want to run a number of openmpi jobs, all of which individually (and combined) use less than the reserved number of processors. The code I am using uses BLACS, and when blacs_pinfo is called I get a seg fault. If the code doesn't call blacs_pinfo it runs fine being submitted in this manner. blacs_pinfo simply returns the number of available processors, so I suspect this is an issue with SGE and openmpi and the requested node number being different to that given to mpirun.
>
>
>
>Can anyone explain why this would happen with openmpi jobs using BLACS on the SGE? And suggest maybe a way around it?
>
>
>
>Many thanks
>
>Conn
>
>
>
>example submission script:
>#!/bin/bash -f -l#$ -V #$ -N test #$ -S /bin/bash#$ -cwd#$ -l vf=1800M#$ -pe ib-ompi 12 #$ -q infiniband.q BIN=~/bin/program
fori inXPOL,YPOL,ZPOL;do mkdir ${TMPDIR}/4ZP; mkdir ${TMPDIR}/4ZP/$i; cp ./4ZP/$i/*${TMPDIR}/4ZP/$i; done cd ${TMPDIR}/4ZP/XPOL; mpirun -np 4-machinefile ${TMPDIR}/machines $BIN >output & cd ${TMPDIR}/4ZP/YPOL; mpirun -np 4-machinefile ${TMPDIR}/machines $BIN >output & cd ${TMPDIR}/4ZP/ZPOL; mpirun -np 4-machinefile ${TMPDIR}/machines $BIN >output ; fori in XPOL YPOL ZPOL ;do cp ${TMPDIR}/4ZP/$i/*${HOME}/4ZP/$i; doneblacs_pinfo.c: #include "Bdef.h"#if (INTFACE == C_CALL)void Cblacs_pinfo(int *mypnum,int *nprocs)#elseF_VOID_FUNC blacs_pinfo_(int *mypnum,int *nprocs)#endif{ int ierr; extern int BI_Iam,BI_Np;/* *Ifthis is our first call,will need to setup some stuff
*/ if(BI_F77_MPI_COMM_WORLD ==NULL) {/* * TheBLACS always call f77's mpi_init. If the user is using C, he should
* explicitly call MPI_Init . . .
*/
MPI_Initialized(nprocs);
#ifdef MainInF77
if (!(*nprocs)) bi_f77_init_();
#else
if (!(*nprocs))
BI_BlacsErr(-1, -1, __FILE__,
"Users with C main programs must explicitly call MPI_Init");
#endif
BI_F77_MPI_COMM_WORLD = (int *) malloc(sizeof(int));
#ifdef UseF77Mpi
BI_F77_MPI_CONSTANTS = (int *) malloc(23*sizeof(int));
ierr = 1;
bi_f77_get_constants_(BI_F77_MPI_COMM_WORLD, &ierr, BI_F77_MPI_CONSTANTS);
#else
ierr = 0;
bi_f77_get_constants_(BI_F77_MPI_COMM_WORLD, &ierr, nprocs);
#endif
BI_MPI_Comm_size(BI_MPI_COMM_WORLD, &BI_Np, ierr);
BI_MPI_Comm_rank(BI_MPI_COMM_WORLD, &BI_Iam, ierr);
}
*mypnum = BI_Iam;
*nprocs = BI_Np;
}
>
_______________________________________________
users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803
Email terry.dontje_at_[hidden]
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users
|