Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-02-07 00:56:37


The 1.4 series is regularly tested on slurm machines after every modification, and has been running at LANL (and other slurm installations) for quite some time, so I doubt that's the core issue. Likewise, nothing in the system depends upon the FQDN (or anything regarding hostname) - it's just used to print diagnostics.

Not sure of the issue, and I don't have an ability to test/debug slurm any more, so I'll have to let Sam continue to look into this for you. It's probably some trivial difference in setup, unfortunately. I don't know if you said before, but it might help to know what slurm version you are using. Slurm tends to change a lot between versions (even minor releases), and it is one of the more finicky platforms we support.

On Feb 6, 2011, at 9:12 PM, Michael Curtis wrote:

>
> On 07/02/2011, at 12:36 PM, Michael Curtis wrote:
>
>>
>> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:
>>
>> Hi,
>>
>>> I just tried to reproduce the problem that you are experiencing and was unable to.
>>>
>>> SLURM 2.1.15
>>> Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas
>>
>> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same platform file (the only change was to re-enable btl-tcp).
>>
>> Unfortunately, the result is the same:
>
> To reply to my own post again (sorry!), I tried OpenMPI 1.5.1. This works fine:
> salloc -n16 ~/../openmpi/bin/mpirun --display-map mpi
> salloc: Granted job allocation 151
>
> ======================== JOB MAP ========================
>
> Data for node: ipc3 Num procs: 8
> Process OMPI jobid: [3365,1] Process rank: 0
> Process OMPI jobid: [3365,1] Process rank: 1
> Process OMPI jobid: [3365,1] Process rank: 2
> Process OMPI jobid: [3365,1] Process rank: 3
> Process OMPI jobid: [3365,1] Process rank: 4
> Process OMPI jobid: [3365,1] Process rank: 5
> Process OMPI jobid: [3365,1] Process rank: 6
> Process OMPI jobid: [3365,1] Process rank: 7
>
> Data for node: ipc4 Num procs: 8
> Process OMPI jobid: [3365,1] Process rank: 8
> Process OMPI jobid: [3365,1] Process rank: 9
> Process OMPI jobid: [3365,1] Process rank: 10
> Process OMPI jobid: [3365,1] Process rank: 11
> Process OMPI jobid: [3365,1] Process rank: 12
> Process OMPI jobid: [3365,1] Process rank: 13
> Process OMPI jobid: [3365,1] Process rank: 14
> Process OMPI jobid: [3365,1] Process rank: 15
>
> =============================================================
> Process 2 on eng-ipc3.{FQDN} out of 16
> Process 4 on eng-ipc3.{FQDN} out of 16
> Process 5 on eng-ipc3.{FQDN} out of 16
> Process 0 on eng-ipc3.{FQDN} out of 16
> Process 1 on eng-ipc3.{FQDN} out of 16
> Process 6 on eng-ipc3.{FQDN} out of 16
> Process 3 on eng-ipc3.{FQDN} out of 16
> Process 7 on eng-ipc3.{FQDN} out of 16
> Process 8 on eng-ipc4.{FQDN} out of 16
> Process 11 on eng-ipc4.{FQDN} out of 16
> Process 12 on eng-ipc4.{FQDN} out of 16
> Process 14 on eng-ipc4.{FQDN} out of 16
> Process 15 on eng-ipc4.{FQDN} out of 16
> Process 10 on eng-ipc4.{FQDN} out of 16
> Process 9 on eng-ipc4.{FQDN} out of 16
> Process 13 on eng-ipc4.{FQDN} out of 16
> salloc: Relinquishing job allocation 151
>
> It does seem very much like there is a bug of some sort in 1.4.3?
>
> Michael
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users