Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] New MCA param: odls_base_exit_status_77_fatal
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2012-05-08 17:56:00


This commit adds a new MTT param that people should set in their MTT testing environments:

  MCA odls: parameter "odls_base_exit_status_77_fatal" (current
            value: <1>, data source: default value)
            Whether to kill an entire job if any process in that
            job exits normally with a status of 77 (exit status 77
            in the GNU testing standards means "this test was
            skipped", and therefore we wouldn't want to kill the
            entire job)

It defaults to 1, meaning that -- per a prior telecom discussion and subsequent email discussion (and RFC?) -- whenever a process exits with a nonzero status, the entire job is killed (recall that we polled other MPI implementations and found that they all adhere to this behavior, too).

However, in MTT, we have a lot of tests that adhere to the GNU standard of "if I exit(77), I'm just indicating that I skipped this test."

In this case, we don't want orte to kill the entire job, which MTT would interpret as a failure. Hence, we added this MCA param for the special case of testing: setting it to 1 means that if any proc calls exit(77), we won't kill the entire job.

In my MTT for the trunk, I now set this MCA param in my MPI Get section thusly:

-----
[MPI get: ompi-nightly-trunk]
mpi_details = OMPI trunk

module = OMPI_Snapshot
ompi_snapshot_url = http://www.open-mpi.org/nightly/trunk
ompi_snapshot_version_file = &getenv("HOME")/mtt-versions/trunk&getenv("MTT_VERSION_FILE_SUFFIX").txt

# Set this MCA param to 0 so that ORTE will not kill a job when a
# process exits cleanly with status 77 (which indicates that the test
# was simply skipped).
setenv = OMPI_MCA_odls_base_exit_status_77_fatal 0
-----

This environment variable is then carried through to the subsequent MPI Install, Test Build, and Test Run phases that derive from this MPI Get.

Begin forwarded message:

> From: <jsquyres_at_[hidden]>
> Subject: [OMPI svn-full] svn:open-mpi r26413
> Date: May 8, 2012 5:49:06 PM EDT
> To: <svn-full_at_[hidden]>
> Reply-To: <devel_at_[hidden]>
>
> Author: jsquyres
> Date: 2012-05-08 17:49:05 EDT (Tue, 08 May 2012)
> New Revision: 26413
> URL: https://svn.open-mpi.org/trac/ompi/changeset/26413
>
> Log:
> ORTE defaults to killing the entire job when any process exits with a
> nonzero status (we polled other MPI implementations since one one in
> the OMPI community had a concrete opinion on what behavior to do here
> -- all other MPI's seem to adhere to this behavior, too).
>
> This commit adds an MCA parameter that allows us to tell ORTE to
> ''not'' kill jobs when a process exits with a status of 77, meaning
> the GNU testing standard of "this test was skipped". In all the OMPI
> tests, all procs will either return 77 or not. So if they all return
> 77, mpirun won't consider it an error, but will still return an exit
> status of 77 (so that MTT can know that the test was cleanly skipped).
>
> Text files modified:
> trunk/orte/mca/odls/base/odls_base_default_fns.c | 13 ++++++++++++-
> trunk/orte/mca/odls/base/odls_base_open.c | 7 ++++++-
> trunk/orte/mca/odls/base/odls_private.h | 4 +++-
> 3 files changed, 21 insertions(+), 3 deletions(-)
>
> Modified: trunk/orte/mca/odls/base/odls_base_default_fns.c
> ==============================================================================
> --- trunk/orte/mca/odls/base/odls_base_default_fns.c (original)
> +++ trunk/orte/mca/odls/base/odls_base_default_fns.c 2012-05-08 17:49:05 EDT (Tue, 08 May 2012)
> @@ -13,7 +13,7 @@
> * Copyright (c) 2011 Oak Ridge National Labs. All rights reserved.
> * Copyright (c) 2011-2012 Los Alamos National Security, LLC.
> * All rights reserved.
> - * Copyright (c) 2011 Cisco Systems, Inc. All rights reserved.
> + * Copyright (c) 2011-2012 Cisco Systems, Inc. All rights reserved.
> * $COPYRIGHT$
> *
> * Additional copyrights may follow
> @@ -2073,6 +2073,17 @@
> state = ORTE_PROC_STATE_CALLED_ABORT;
> goto MOVEON;
> }
> +
> + /* If the exit status of this proc was 77 and the
> + odls_base_exit_status_77_fatal MCA param was set to false,
> + then don't kill the whole job. The rationale is that the
> + GNU testing standards specify that an exit status of 77
> + indicates that a test was skipped -- it should not be
> + treated as a fatal error (to the whole job). */
> + if (!orte_odls_globals.is_exit_status_77_fatal && 77 == proc->exit_code) {
> + state = ORTE_PROC_STATE_WAITPID_FIRED;
> + goto MOVEON;
> + }
>
> /* check to see if a sync was required and if it was received */
> if (proc->registered) {
>
> Modified: trunk/orte/mca/odls/base/odls_base_open.c
> ==============================================================================
> --- trunk/orte/mca/odls/base/odls_base_open.c (original)
> +++ trunk/orte/mca/odls/base/odls_base_open.c 2012-05-08 17:49:05 EDT (Tue, 08 May 2012)
> @@ -10,7 +10,7 @@
> * Copyright (c) 2004-2005 The Regents of the University of California.
> * All rights reserved.
> * Copyright (c) 2010-2011 Oracle and/or its affiliates. All rights reserved.
> - * Copyright (c) 2011 Cisco Systems, Inc. All rights reserved.
> + * Copyright (c) 2011-2012 Cisco Systems, Inc. All rights reserved.
> * Copyright (c) 2011-2012 Los Alamos National Security, LLC.
> * All rights reserved.
> * $COPYRIGHT$
> @@ -103,6 +103,11 @@
> "Time to wait for a process to die after issuing a kill signal to it",
> false, false, 1, &orte_odls_globals.timeout_before_sigkill);
>
> + mca_base_param_reg_int_name("odls", "base_exit_status_77_fatal",
> + "Whether to kill an entire job if any process in that job exits normally with a status of 77 (exit status 77 in the GNU testing standards means \"this test was skipped\", and therefore we wouldn't want to kill the entire job)",
> + false, false, 1, &i);
> + orte_odls_globals.is_exit_status_77_fatal = OPAL_INT_TO_BOOL(i);
> +
> /* initialize the global array of local children */
> orte_local_children = OBJ_NEW(opal_pointer_array_t);
> if (OPAL_SUCCESS != (rc = opal_pointer_array_init(orte_local_children,
>
> Modified: trunk/orte/mca/odls/base/odls_private.h
> ==============================================================================
> --- trunk/orte/mca/odls/base/odls_private.h (original)
> +++ trunk/orte/mca/odls/base/odls_private.h 2012-05-08 17:49:05 EDT (Tue, 08 May 2012)
> @@ -9,7 +9,7 @@
> * University of Stuttgart. All rights reserved.
> * Copyright (c) 2004-2005 The Regents of the University of California.
> * All rights reserved.
> - * Copyright (c) 2011 Cisco Systems, Inc. All rights reserved.
> + * Copyright (c) 2011-2012 Cisco Systems, Inc. All rights reserved.
> * Copyright (c) 2011 Los Alamos National Security, LLC. All rights
> * reserved.
> * $COPYRIGHT$
> @@ -62,6 +62,8 @@
> opal_list_t xterm_ranks;
> /* the xterm cmd to be used */
> char **xtermcmd;
> + /* whether to consider an exit code of 77 fatal to a job or not */
> + bool is_exit_status_77_fatal;
> } orte_odls_globals_t;
>
> ORTE_DECLSPEC extern orte_odls_globals_t orte_odls_globals;
> _______________________________________________
> svn-full mailing list
> svn-full_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/