Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
From: Michael Curtis (michael.curtis_at_[hidden])
Date: 2011-02-06 23:12:24


On 07/02/2011, at 12:36 PM, Michael Curtis wrote:

>
> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:
>
> Hi,
>
>> I just tried to reproduce the problem that you are experiencing and was unable to.
>>
>> SLURM 2.1.15
>> Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas
>
> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same platform file (the only change was to re-enable btl-tcp).
>
> Unfortunately, the result is the same:

To reply to my own post again (sorry!), I tried OpenMPI 1.5.1. This works fine:
salloc -n16 ~/../openmpi/bin/mpirun --display-map mpi
salloc: Granted job allocation 151

 ======================== JOB MAP ========================

 Data for node: ipc3 Num procs: 8
         Process OMPI jobid: [3365,1] Process rank: 0
         Process OMPI jobid: [3365,1] Process rank: 1
         Process OMPI jobid: [3365,1] Process rank: 2
         Process OMPI jobid: [3365,1] Process rank: 3
         Process OMPI jobid: [3365,1] Process rank: 4
         Process OMPI jobid: [3365,1] Process rank: 5
         Process OMPI jobid: [3365,1] Process rank: 6
         Process OMPI jobid: [3365,1] Process rank: 7

 Data for node: ipc4 Num procs: 8
         Process OMPI jobid: [3365,1] Process rank: 8
         Process OMPI jobid: [3365,1] Process rank: 9
         Process OMPI jobid: [3365,1] Process rank: 10
         Process OMPI jobid: [3365,1] Process rank: 11
         Process OMPI jobid: [3365,1] Process rank: 12
         Process OMPI jobid: [3365,1] Process rank: 13
         Process OMPI jobid: [3365,1] Process rank: 14
         Process OMPI jobid: [3365,1] Process rank: 15

 =============================================================
Process 2 on eng-ipc3.{FQDN} out of 16
Process 4 on eng-ipc3.{FQDN} out of 16
Process 5 on eng-ipc3.{FQDN} out of 16
Process 0 on eng-ipc3.{FQDN} out of 16
Process 1 on eng-ipc3.{FQDN} out of 16
Process 6 on eng-ipc3.{FQDN} out of 16
Process 3 on eng-ipc3.{FQDN} out of 16
Process 7 on eng-ipc3.{FQDN} out of 16
Process 8 on eng-ipc4.{FQDN} out of 16
Process 11 on eng-ipc4.{FQDN} out of 16
Process 12 on eng-ipc4.{FQDN} out of 16
Process 14 on eng-ipc4.{FQDN} out of 16
Process 15 on eng-ipc4.{FQDN} out of 16
Process 10 on eng-ipc4.{FQDN} out of 16
Process 9 on eng-ipc4.{FQDN} out of 16
Process 13 on eng-ipc4.{FQDN} out of 16
salloc: Relinquishing job allocation 151

It does seem very much like there is a bug of some sort in 1.4.3?

Michael