Dear Terry,

Thanks for the reply, and sorry for the delay in getting back to you. Here is the relevant part of the gdb output:

Program terminated with signal 11, Segmentation fault.
#0  0x00002b63ba7f9291 in PMPI_Comm_size () at ./pcomm_size.c:46
46            if ( ompi_comm_invalid (comm)) {
(gdb) where
#0  0x00002b63ba7f9291 in PMPI_Comm_size () at ./pcomm_size.c:46
#1  0x000000000062cb6c in blacs_pinfo_ () at ./blacs_pinfo_.c:29
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Do you think the problem is being caused by SGE feeding the wrong number of processors to BLACS in someway?
As I mentioned previously I am requesting a different number of processors than I am running on, as I run several jobs on the requested processors.

Thanks for your time & help.


Sent: Friday, 13 January 2012, 13:21
Subject: Re: [OMPI users] Openmpi SGE and BLACS

Do you have a stack of where exactly things are seg faulting in blacs_pinfo? 


On 1/13/2012 8:12 AM, Conn ORourke wrote:
Dear Openmpi Users,

I am reserving several processors with SGE upon which I want to run a number of openmpi jobs, all of which individually (and combined) use less than the reserved number of processors. The code I am using uses BLACS, and when blacs_pinfo is called I get a seg fault. If the code doesn't call blacs_pinfo it runs fine being submitted in this manner. blacs_pinfo simply returns the number of available processors, so I suspect this is an issue with SGE and openmpi and the requested node number being different to that given to mpirun.

Can anyone explain why this would happen with openmpi jobs using BLACS  on the SGE? And suggest maybe a way around it?

Many thanks

example submission script:
#!/bin/bash -f -l
#$ -V 
#$ -N test 
#$ -S /bin/bash
#$ -cwd
#$ -l vf=1800M
#$ -pe ib-ompi 12 
#$ -q infiniband.q

    for i in XPOL,YPOL,ZPOL; do
       mkdir ${TMPDIR}/4ZP;
       mkdir ${TMPDIR}/4ZP/$i;
       cp ./4ZP/$i/* ${TMPDIR}/4ZP/$i;

    cd ${TMPDIR}/4ZP/XPOL;
    mpirun -np 4 -machinefile ${TMPDIR}/machines $BIN > output &
    cd ${TMPDIR}/4ZP/YPOL;
    mpirun -np 4 -machinefile ${TMPDIR}/machines $BIN > output &
    cd ${TMPDIR}/4ZP/ZPOL;
    mpirun -np 4 -machinefile ${TMPDIR}/machines $BIN > output ;

    for i in  XPOL YPOL ZPOL  ; do
     cp ${TMPDIR}/4ZP/$i/* ${HOME}/4ZP/$i;

#include "Bdef.h"

void Cblacs_pinfo(int *mypnum, int *nprocs)
F_VOID_FUNC blacs_pinfo_(int *mypnum, int *nprocs)
   int ierr;
   extern int BI_Iam, BI_Np;

 * If this is our first call, will need to set up some stuff
 *    The BLACS always call f77's mpi_init.  If the user is using C, he should
 *    explicitly call MPI_Init . . .
#ifdef MainInF77
      if (!(*nprocs)) bi_f77_init_();
      if (!(*nprocs))
         BI_BlacsErr(-1, -1, __FILE__,
            "Users with C main programs must explicitly call MPI_Init");
      BI_F77_MPI_COMM_WORLD = (int *) malloc(sizeof(int));
#ifdef UseF77Mpi
      BI_F77_MPI_CONSTANTS = (int *)
      ierr = 1;
      bi_f77_get_constants_(BI_F77_MPI_COMM_WORLD, &ierr, BI_F77_MPI_CONSTANTS);
      ierr = 0;
      bi_f77_get_constants_(BI_F77_MPI_COMM_WORLD, &ierr, nprocs);
      BI_MPI_Comm_size(BI_MPI_COMM_WORLD, &BI_Np, ierr);
      BI_MPI_Comm_rank(BI_MPI_COMM_WORLD, &BI_Iam, ierr);
   *mypnum = BI_Iam;
   *nprocs = BI_Np;
_______________________________________________ users mailing list

Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle - Performance Technologies
95 Network Drive, Burlington, MA 01803

users mailing list