Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Hair depleting issue with Ompi143 and one program
From: Dave Goodell (goodell_at_[hidden])
Date: 2011-01-20 18:44:22


I can't speak to what OMPI might be doing to your program, but I have a few suggestions for looking into the Valgrind issues.

Valgrind's "--track-origins=yes" option is usually helpful for figuring out where the uninitialized values came from. However, if I understand you correctly and if you are correct in your assumption that _mm_setzero_ps is not actually zeroing your xEv variable for some reason, then this option will unhelpfully tell you that it was caused by a stack allocation at the entrance to the function where the variable is declared. But it's worth turning on because it's easy to do and it might show you something obvious that you are missing.

The next thing you can do is disable optimization when building your code in case GCC is taking a shortcut that is either incorrect or just doesn't play nicely with Valgrind. Valgrind might run pretty slow though, because -O0 code can be really verbose and slow to check.

After that, if you really want to dig in, you can try reading the assembly code that is generated for that _mm_setzero_ps line. The easiest way is to pass "-save-temps" to gcc and it will keep a copy of "sourcefile.s" corresponding to "sourcefile.c". Sometimes "-fverbose-asm" helps, sometimes it makes things harder to follow.

And the last semi-desperate step is to dig into what Valgrind thinks is going on. You'll want to read up on how memcheck really works [1] before doing this. Then read up on client requests [2,3]. You can then use the VALGRIND_GET_VBITS client request on your xEv variable in order to see which parts of the variable Valgrind thinks are undefined. If the vbits don't match with what you expect, there's a chance that you might have found a bug in Valgrind itself. It doesn't happen often, but the SSE code can be complicated and isn't exercised as often as the non-vector portions of Valgrind.

Good luck,
-Dave

[1] http://valgrind.org/docs/manual/mc-manual.html#mc-manual.machine
[2] http://valgrind.org/docs/manual/manual-core-adv.html#manual-core-adv.clientreq
[3] http://valgrind.org/docs/manual/mc-manual.html#mc-manual.clientreqs

On Jan 20, 2011, at 5:07 PM CST, David Mathog wrote:

> I have been working on slightly modifying a software package by Sean
> Eddy called Hmmer 3. The hardware acceleration was originally SSE2 but
> since most of our compute nodes only have SSE1 and MMX I rewrote a few
> small sections to just use those instructions. (And yes, as far as I
> can tell it invokes emms before any floating point operations are run
> after each MMX usage.) On top of that each binary has 3 options for
> running the programs: single threaded, threaded, or MPI (using
> Ompi143). For all other programs in this package everything works
> everywhere. For one called "jackhmmer" this table results (+=runs
> correctly, - = problems), where the exact same problem is run in each
> test (theoretically exercising exactly the same routines, just under
> different threading control):
>
> SSE2 SSE1
> Single + +
> Threaded + +
> Ompi143 + -
>
> The negative result for the SSE/Ompi143 combination happens whether the
> worker nodes are Athlon MP (SSE1 only) or Athlon64. The test machine
> for the single and threaded runs is a two CPU Opteron 280 (4 cores
> total). Ompi143 is 32 bit everywhere (local copies though). There have
> been no modifications whatsoever made to the main jackhmmer.c file,
> which is where the various run methods are implemented.
>
> Now if there was some intrinsic problem with my SSE1 code it should
> presumably manifest in both the Single and Threaded versions as well
> (the thread control is different, but they all feed through the same
> underlying functions), or in one of the other programs, which isn't
> seen. Running under valgrind using Single or Threaded produces no
> warnings. Using mpirun with valgrind on the SSE2 produces 3: two
> related to OMPI itself which are seen in every OMPI program run in
> valgrind, and one caused by an MPIsend operation where the buffer
> contains some uninitialized data (this is nothing toxic, just bytes in
> fixed length fields which which were never set because a shorter string
> is stored there).
>
> ==19802== Syscall param writev(vector[...]) points to uninitialised byte(s)
> ==19802== at 0x4C77AC1: writev (in /lib/libc-2.10.1.so)
> ==19802== by 0x8A069B5: mca_btl_tcp_frag_send (in
> /opt/ompi143.X32/lib/openmpi/mca_btl_tcp.so)
> ==19802== by 0x8A0626E: mca_btl_tcp_endpoint_send (in
> /opt/ompi143.X32/lib/openmpi/mca_btl_tcp.so)
> ==19802== by 0x8A01ADC: mca_btl_tcp_send (in
> /opt/ompi143.X32/lib/openmpi/mca_btl_tcp.so)
> ==19802== by 0x7FA24A9: mca_pml_ob1_send_request_start_prepare (in
> /opt/ompi143.X32/lib/openmpi/mca_pml_ob1.so)
> ==19802== by 0x7F98443: mca_pml_ob1_send (in
> /opt/ompi143.X32/lib/openmpi/mca_pml_ob1.so)
> ==19802== by 0x4A8530F: PMPI_Send (in
> /opt/ompi143.X32/lib/libmpi.so.0.0.2)
> ==19802== by 0x808D5F2: p7_oprofile_MPISend (mpi.c:101)
> ==19802== by 0x805762E: main (jackhmmer.c:1149)
> ==19802== Address 0x770bc9d is 15,101 bytes inside a block of size
> 15,389 alloc'd
> ==19802== at 0x49E3A12: realloc (vg_replace_malloc.c:476)
> ==19802== by 0x808D4E3: p7_oprofile_MPISend (mpi.c:88)
> ==19802== by 0x805762E: main (jackhmmer.c:1149)
>
> Do that for the SSE1 version and the same 3 errors are seen, plus many
> more like the following:
>
> ==9416== Conditional jump or move depends on uninitialised value(s)
> ==9416== at 0x807FE3E: forward_engine (fwdback.c:420)
> ==9416== by 0x8080051: p7_ForwardParser (fwdback.c:143)
> ==9416== by 0x806C3CC: p7_Pipeline (p7_pipeline.c:590)
> ==9416== by 0x80564F0: main (jackhmmer.c:1426)
>
> Unfortunately this makes absolutely no sense. Line 420 is
>
> if (xE > 1.0e4)
>
> which tells us that xE wasn't set (fine), so assaying uninitialized
> with statements like:
>
> fprintf(stderr,"DEBUG xEv %lld\n",xEv);fflush(stderr);
>
> (each of which generates its own uninitialized value message) the first
> uninitialized variable appears very early in the code after this
> _mm_setzero_ps:
>
> register __m128 xEv;
> //other stuff that does not touch xEv
> xEv = _mm_setzero_ps();
>
> Now this is hair pulling for many reasons. The first is that nothing of
> substance was changed in this file (just some #defines that
> resolve to the same values as they had originally). The second is that
> this is an SSE1 operation even in the original unmodified code. The
> third is that it just isn't possible for xEv to be uninitialized after
> that statement - yet it is. (Valgrind with --smc-check=all turns up
> nothing more than leaving out that parameter.) Here is the relevant
> section in xmmintrin.h:
>
> /* Create a vector of zeros. */
> extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__,
> __artificial__))
> _mm_setzero_ps (void)
> {
> return __extension__ (__m128){ 0.0f, 0.0f, 0.0f, 0.0f };
> }
>
> Of course all of this nonsense is happening on a worker node, which
> isn't making getting to the root of the problem any easier.
>
> The module where these uninitialized variables are seen was compiled like;
>
> mpicc -std=gnu99 -O1 -g -m32 -pthread -msse -mno-sse2 -DHAVE_CONFIG_H
> -I../../easel -I../../easel -I. -I.. -I. -I../../src -o fwdback.o -c
> fwdback.c
>
> Building it on a 64 bit machine (that's why the -m32 is there) or a 32
> bit machine gives the same result.
>
> If any of you have seen something like this before and can suggest a way
> to proceed I would be very grateful.
>
> Thanks,
>
> David Mathog
> mathog_at_[hidden]
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users