Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] oshmem test suite errors
From: Brian Barrett (brian_at_[hidden])
Date: 2014-02-20 11:48:14


On Feb 20, 2014, at 7:10 AM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]> wrote:

> For all of these, I'm using the openshmem test suite that is now committed to the ompi-svn SVN repo. I don't know if the errors are with the tests or with oshmem itself.
>
> 1. I'm running the oshmem test suite at 32 processes across 2 16-core servers. I'm seeing a segv in "examples/shmem_2dheat.x 10 10". It seems to run fine at lower np values such as 2, 4, and 8; I didn't try to determine where the crossover to badness occurs.

My memory is bad and my notes are on a machine I no longer have access to, but I did this to the test suite run for Portals SHMEM:

Index: shmem_2dheat.c
===================================================================
--- shmem_2dheat.c (revision 270)
+++ shmem_2dheat.c (revision 271)
@@ -129,6 +129,11 @@
   p = _num_pes ();
   my_rank = _my_pe ();
 
+ if (p > 8) {
+ fprintf(stderr, "Ignoring test when run with more than 8 pes\n");
+ return 77;
+ }
+
   /* argument processing done by everyone */
   int c, errflg;
   extern char *optarg;

The commit comment was that there was a scaling issue in the code itself, I just wish I could remember exactly what it was.

> 2. "examples/adjacent_32bit_amo.x 10 10" seems to hang with both tcp and usnic BTLs, even when running at np=2 (I let it run for several minutes before killing it).

If atomics aren't fast, this test can run for a very long time (also, it takes no arguments, so the 10 10 is being ignored). It's essentially looking for a race by blasting 32-bit atomic ops at both parts of a 64 bit word.

> 3. Ditto for "example/ptp.x 10 10".
>
> 4. "examples/shmem_matrix.x 10 10" seems to run fine at np=32 on usnic, but hangs with TCP (i.e., I let it run for 8+ minutes before killing it -- perhaps it would have finished eventually?).
>
> ...there's more results (more timeouts and more failures), but they're not yet complete, and I've got to keep working on my own features for v1.7.5, so I need to move to other things right now.

These start to sound like issues in the code; those last two are pretty decent tests.

> I think I have oshmem running well enough to add these to Cisco's nightly MTT runs now, so the results will start showing up there without needing my manual attention.

Woot.

Brian

-- 
 Brian Barrett
 There is an art . . . to flying. The knack lies in learning how to
 throw yourself at the ground and miss.
     Douglas Adams, 'The Hitchhikers Guide to the Galaxy'