Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Problem with the openmpi-default-hostfile (on the trunk)
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-02-28 09:30:04


Thanks - I'll fix that bug!

On Feb 28, 2012, at 6:48 AM, pascal.deveze_at_[hidden] wrote:

> devel-bounces_at_[hidden] a écrit sur 28/02/2012 10:54:15 :
>
> > De : Ralph Castain <rhc_at_[hidden]>
> > A : Open MPI Developers <devel_at_[hidden]>
> > Date : 28/02/2012 10:54
> > Objet : Re: [OMPI devel] Problem with the openmpi-default-hostfile
> > (on the trunk)
> > Envoyé par : devel-bounces_at_[hidden]
> >
> > I'll see what I can do when next I have access to a slurm machine -
> > hopefully in a day or two.
> >
> > Are you sure you are at the top of the trunk? I reviewed the code,
> > and it clearly detects that the default hostile is empty and ignores
> > it if so. Like I said, I'm not seeing this behavior, and neither are
> > the slurm machines on MTT.
>
> I ran with a version from Feb 12th (I had a synchronization problem).
> Now with the latest patches (Feb 27th), by default I have no more problem.
>
> But, ... it is no more possible to change the mca parameter "orte_default_hostfile".
> For example in $HOME/.openmpi/mca-params.conf I put:
> orte_default_hostfile=none
> Then, even with ompi_info, I get a segfault:
>
> [xxxx:17426] *** Process received signal ***
> [xxxx:17426] Signal: Segmentation fault (11)
> [xxxx:17426] Signal code: Address not mapped (1)
> [xxxx:17426] Failing at address: (nil)
> [xxxx:17426] [ 0] /lib64/libpthread.so.0() [0x327220f490]
> [xxxx:17426] [ 1] /lib64/libc.so.6() [0x3271f24676]
> [xxxx:17426] [ 2] /..../lib/libopen-rte.so.0(orte_register_params+0xaac) [0x7fa46989677a]
> [xxxx:17426] [ 3] mpirun(orterun+0xeb) [0x4039ed]
> [xxxx:17426] [ 4] mpirun(main+0x20) [0x4034b4]
> [xxxx:17426] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3271e1ec9d]
> [xxxx:17426] [ 6] mpirun() [0x4033d9]
> [xxxx:17426] *** End of error message ***
>
> After a look at orte/runtime/orte_mca_params.c, I propose the following patch :
> --- a/orte/runtime/orte_mca_params.c Mon Feb 27 15:53:14 2012 +0000
> +++ b/orte/runtime/orte_mca_params.c Tue Feb 28 14:44:11 2012 +0100
> @@ -301,7 +301,7 @@
> asprintf(&orte_default_hostfile, "%s/etc/openmpi-default-hostfile", opal_install_dirs.prefix);
> /* flag that nothing was given */
> orte_default_hostfile_given = false;
> - } else if (0 == strcmp(orte_default_hostfile, "none")) {
> + } else if (0 == strcmp(strval, "none")) {
> orte_default_hostfile = NULL;
> /* flag that it was given */
> orte_default_hostfile_given = true;
>
>
> >
> > On Feb 28, 2012, at 1:25 AM, pascal.deveze_at_[hidden] wrote:
> >
> >
> > devel-bounces_at_[hidden] a écrit sur 27/02/2012 15:53:06 :
> >
> > > De : Ralph Castain <rhc_at_[hidden]>
> > > A : Open MPI Developers <devel_at_[hidden]>
> > > Date : 27/02/2012 16:17
> > > Objet : Re: [OMPI devel] Problem with the openmpi-default-hostfile
> > > (on the trunk)
> > > Envoyé par : devel-bounces_at_[hidden]
> > >
> > > That's strange - I run on slurm frequently and never have this
> > > problem, and my default hostfile is present and empty. Do you have
> > > anything in your default mca param file that might be telling us to
> > > use the hostfile?
> > >
> > > The only way I can find to get that behavior is if your default mca
> > > param file includes the orte_default_hostfile value. In that case,
> > > you are telling us to use the default hostfile, and so we will enforce it.
> >
> > Hi Ralph,
> >
> > On my side, the default value of orte_default_hostfile is a pointer
> > to etc/openmpi-default-hostfile.
> > The command ompi_info -a gives :
> >
> > MCA orte: parameter "orte_default_hostfile" (current value: <..../
> > etc/openmpi-default-hostfile>, data source: default value)
> > Name of the default hostfile (relative or absolute path, "none" to
> > ignore environmental or default MCA param setting)
> >
> > The following files are empty:
> > - .../etc/openmpi-mca-params.conf
> > - $HOME/.openmpi/mca-params.conf
> > Another solution for me is to put "orte_default_hostfile=none" in
> > one of these files.
> >
> > Pascal
> >
> > >
> > > On Feb 27, 2012, at 5:57 AM, pascal.deveze_at_[hidden] wrote:
> > >
> > > Hi all,
> > >
> > > I have problems with the openmpi-default-hostfile since the
> > > following patch on the trunk
> > >
> > > changeset: 19874:088fc6c84a9f
> > > user: rhc
> > > date: Wed Feb 01 17:40:44 2012 +0000
> > > summary: In accordance with prior releases, we are supposed to
> > > default to looking at the openmpi-default-hostfile as a default
> > > hostfile. Restore that behavior, but ignore the file if it is empty.
> > > Allow the user to ignore any MCA param setting pointing to a default
> > > hostfile by setting the param to "none" (via cmd line or whatever) -
> > > this allows them to override a setting in the system default MCA
> > param file.
> > >
> > > According to the summary of this patch, the openmpi-default-hostfile
> > > is ignored if it is empty.
> > > But, when I run my jobs with slurm + mpirun, I get the following message:
> > > --------------------------------------------------------------------------
> > > No nodes are available for this job, either due to a failure to
> > > allocate nodes to the job, or allocated nodes being marked
> > > as unavailable (e.g., down, rebooting, or a process attempting
> > > to be relocated to another node when none are available).
> > > --------------------------------------------------------------------------
> > >
> > > I am able to run my job if:
> > > - either I put my node(s) in the file etc/openmpi-default-hostfile
> > > - or use "-mca orte_default_hostfile=none" in the mpirun command line
> > > - or "export OMPI_MCA_orte_default_hostfile none" in my environment
> > >
> > > It appears that an empty openmpi-default-hostfile is not ignored.
> > > This patch seems not be complete
> > >
> > > Or do I misunderstand something ?
> > >
> > > Pascal Devèze_______________________________________________
> > > devel mailing list
> > > devel_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > _______________________________________________
> > > devel mailing list
> > > devel_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel_______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel