Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Problem with the openmpi-default-hostfile (on the trunk)
From: pascal.deveze_at_[hidden]
Date: 2012-02-28 08:48:12


devel-bounces_at_[hidden] a écrit sur 28/02/2012 10:54:15 :

> De : Ralph Castain <rhc_at_[hidden]>
> A : Open MPI Developers <devel_at_[hidden]>
> Date : 28/02/2012 10:54
> Objet : Re: [OMPI devel] Problem with the openmpi-default-hostfile
> (on the trunk)
> Envoyé par : devel-bounces_at_[hidden]
>
> I'll see what I can do when next I have access to a slurm machine -
> hopefully in a day or two.
>
> Are you sure you are at the top of the trunk? I reviewed the code,
> and it clearly detects that the default hostile is empty and ignores
> it if so. Like I said, I'm not seeing this behavior, and neither are
> the slurm machines on MTT.

I ran with a version from Feb 12th (I had a synchronization problem).
Now with the latest patches (Feb 27th), by default I have no more problem.

But, ... it is no more possible to change the mca parameter
"orte_default_hostfile".
For example in $HOME/.openmpi/mca-params.conf I put:
   orte_default_hostfile=none
Then, even with ompi_info, I get a segfault:

[xxxx:17426] *** Process received signal ***
[xxxx:17426] Signal: Segmentation fault (11)
[xxxx:17426] Signal code: Address not mapped (1)
[xxxx:17426] Failing at address: (nil)
[xxxx:17426] [ 0] /lib64/libpthread.so.0() [0x327220f490]
[xxxx:17426] [ 1] /lib64/libc.so.6() [0x3271f24676]
[xxxx:17426] [ 2] /..../lib/libopen-rte.so.0(orte_register_params+0xaac)
[0x7fa46989677a]
[xxxx:17426] [ 3] mpirun(orterun+0xeb) [0x4039ed]
[xxxx:17426] [ 4] mpirun(main+0x20) [0x4034b4]
[xxxx:17426] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3271e1ec9d]
[xxxx:17426] [ 6] mpirun() [0x4033d9]
[xxxx:17426] *** End of error message ***

After a look at orte/runtime/orte_mca_params.c, I propose the following
patch :
--- a/orte/runtime/orte_mca_params.c Mon Feb 27 15:53:14 2012 +0000
+++ b/orte/runtime/orte_mca_params.c Tue Feb 28 14:44:11 2012 +0100
@@ -301,7 +301,7 @@
         asprintf(&orte_default_hostfile,
"%s/etc/openmpi-default-hostfile", opal_install_dirs.prefix);
         /* flag that nothing was given */
         orte_default_hostfile_given = false;
- } else if (0 == strcmp(orte_default_hostfile, "none")) {
+ } else if (0 == strcmp(strval, "none")) {
         orte_default_hostfile = NULL;
         /* flag that it was given */
         orte_default_hostfile_given = true;

>
> On Feb 28, 2012, at 1:25 AM, pascal.deveze_at_[hidden] wrote:
>
>
> devel-bounces_at_[hidden] a écrit sur 27/02/2012 15:53:06 :
>
> > De : Ralph Castain <rhc_at_[hidden]>
> > A : Open MPI Developers <devel_at_[hidden]>
> > Date : 27/02/2012 16:17
> > Objet : Re: [OMPI devel] Problem with the openmpi-default-hostfile
> > (on the trunk)
> > Envoyé par : devel-bounces_at_[hidden]
> >
> > That's strange - I run on slurm frequently and never have this
> > problem, and my default hostfile is present and empty. Do you have
> > anything in your default mca param file that might be telling us to
> > use the hostfile?
> >
> > The only way I can find to get that behavior is if your default mca
> > param file includes the orte_default_hostfile value. In that case,
> > you are telling us to use the default hostfile, and so we will enforce
it.
>
> Hi Ralph,
>
> On my side, the default value of orte_default_hostfile is a pointer
> to etc/openmpi-default-hostfile.
> The command ompi_info -a gives :
>
> MCA orte: parameter "orte_default_hostfile" (current value: <..../
> etc/openmpi-default-hostfile>, data source: default value)
> Name of the default hostfile (relative or absolute path, "none" to
> ignore environmental or default MCA param setting)
>
> The following files are empty:
> - .../etc/openmpi-mca-params.conf
> - $HOME/.openmpi/mca-params.conf
> Another solution for me is to put "orte_default_hostfile=none" in
> one of these files.
>
> Pascal
>
> >
> > On Feb 27, 2012, at 5:57 AM, pascal.deveze_at_[hidden] wrote:
> >
> > Hi all,
> >
> > I have problems with the openmpi-default-hostfile since the
> > following patch on the trunk
> >
> > changeset: 19874:088fc6c84a9f
> > user: rhc
> > date: Wed Feb 01 17:40:44 2012 +0000
> > summary: In accordance with prior releases, we are supposed to
> > default to looking at the openmpi-default-hostfile as a default
> > hostfile. Restore that behavior, but ignore the file if it is empty.
> > Allow the user to ignore any MCA param setting pointing to a default
> > hostfile by setting the param to "none" (via cmd line or whatever) -
> > this allows them to override a setting in the system default MCA
> param file.
> >
> > According to the summary of this patch, the openmpi-default-hostfile
> > is ignored if it is empty.
> > But, when I run my jobs with slurm + mpirun, I get the following
message:
> >
--------------------------------------------------------------------------

> > No nodes are available for this job, either due to a failure to
> > allocate nodes to the job, or allocated nodes being marked
> > as unavailable (e.g., down, rebooting, or a process attempting
> > to be relocated to another node when none are available).
> >
--------------------------------------------------------------------------

> >
> > I am able to run my job if:
> > - either I put my node(s) in the file etc/openmpi-default-hostfile
> > - or use "-mca orte_default_hostfile=none" in the mpirun command line

> > - or "export OMPI_MCA_orte_default_hostfile none" in my environment
> >
> > It appears that an empty openmpi-default-hostfile is not ignored.
> > This patch seems not be complete
> >
> > Or do I misunderstand something ?
> >
> > Pascal Devèze_______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel