Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [EXTERNAL] OpenSHMEM round 2
From: Joshua Ladd (joshual_at_[hidden])
Date: 2013-08-14 19:02:32


Thanks, Ralph. We'll have a look. Admittedly, we've done little testing with the tcp BTL - I was under the impression that the yoda interface was capable of working with all BTLs, seems we need more testing. For sure it works with SM and OpenIB BTLs.

Josh

-----Original Message-----
From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of Ralph Castain
Sent: Wednesday, August 14, 2013 6:13 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] [EXTERNAL] OpenSHMEM round 2

Here's the backtrace:

(gdb) where
#0 0x0000000000000000 in ?? ()
#1 0x00007fac6b8d8921 in mca_bml_base_get (bml_btl=0x239a130, des=0x220e880) at ../../../../ompi/mca/bml/bml.h:326
#2 0x00007fac6b8db767 in mca_spml_yoda_get (src_addr=0x601500, size=4, dst_addr=0x7fff3b00b370, src=1) at spml_yoda.c:1091
#3 0x00007fac6f1ea56d in shmem_int_g (addr=0x601500, pe=1) at shmem_g.c:47
#4 0x0000000000400bc7 in main ()

On Aug 14, 2013, at 3:12 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> Hmmm...well, it works fine as long as the procs are on the same node. However, if they are on different nodes, it segfaults:
>
> [rhc_at_bend002 shmem]$ shmemrun -npernode 1 ./test_shmem running on
> bend001 running on bend002 [bend001:06590] *** Process received signal
> *** [bend001:06590] Signal: Segmentation fault (11) [bend001:06590]
> Signal code: Address not mapped (1) [bend001:06590] Failing at
> address: (nil) [bend001:06590] [ 0] /lib64/libpthread.so.0()
> [0x307d40f500] [bend001:06590] *** End of error message ***
> [bend002][[62090,1],1][btl_tcp_frag.c:219:mca_btl_tcp_frag_recv]
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> ----------------------------------------------------------------------
> ---- shmemrun noticed that process rank 0 with PID 6590 on node
> bend001 exited on signal 11 (Segmentation fault).
> ----------------------------------------------------------------------
> ----
>
> I would have thought it should work in that situation - yes?
>
>
> On Aug 14, 2013, at 2:52 PM, Joshua Ladd <joshual_at_[hidden]> wrote:
>
>> The following simple test code will exercise the following:
>>
>> start_pes()
>>
>> shmalloc()
>>
>> shmem_int_get()
>>
>> shmem_int_put()
>>
>> shmem_barrier_all()
>>
>> To compile:
>>
>> shmemcc test_shmem.c -o test_shmem
>>
>> To launch:
>>
>> shmemrun -np 2 test_shmem
>>
>> or for those who prefer to launch with SLURM
>>
>> srun -n 2 test_shmem
>>
>> Josh
>>
>>
>> -----Original Message-----
>> From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of Ralph
>> Castain
>> Sent: Wednesday, August 14, 2013 5:32 PM
>> To: Open MPI Developers
>> Subject: Re: [OMPI devel] [EXTERNAL] OpenSHMEM round 2
>>
>> Can you point me to a test program that would exercise it? I'd like to give it a try first.
>>
>> I'm okay with on by default as it builds its own separate library,
>> and with the RFC
>>
>> On Aug 14, 2013, at 2:03 PM, "Barrett, Brian W" <bwbarre_at_[hidden]> wrote:
>>
>>> Josh -
>>>
>>> In general, I don't have a strong opinion of whether OpenSHMEM is on
>>> by default or not. It might cause unexpected behavior for some
>>> users (like on Crays, where one should really use Cray's SHMEM), but
>>> maybe it's better on other platforms.
>>>
>>> I also would have no objection to the RFC, provided the segfaults I
>>> found get resolved.
>>>
>>> Brian
>>>
>>> On 8/14/13 2:08 PM, "Joshua Ladd" <joshual_at_[hidden]> wrote:
>>>
>>>> Ralph, and Brian
>>>>
>>>> Thanks a bunch for taking the time to review this. It is extremely
>>>> helpful. Let me comment of the building of OSHMEM and solicit some
>>>> feedback from you guys (along with the rest of the community.)
>>>> Originally we had planned to enable OSHMEM to build only if
>>>> '--with-oshmem' flag was passed at configure time. However,
>>>> (unbeknownst to me) this behavior was changed and now OSHMEM is built by default, i.e.
>>>> yes, Ralph this is the intended behavior now. I am wondering if
>>>> this is such a good idea. Do folks have a strong opinion on this
>>>> one way or the other? From my perspective I can see arguments for
>>>> both sides of the coin.
>>>>
>>>> Other than cleaning up warnings and resolving the segfault that
>>>> Brian observed are we on a good course to getting this upstream? Is
>>>> it reasonable to file an RFC for three weeks out?
>>>>
>>>> Josh
>>>>
>>>> -----Original Message-----
>>>> From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of
>>>> Barrett, Brian W
>>>> Sent: Sunday, August 11, 2013 1:42 PM
>>>> To: Open MPI Developers
>>>> Subject: Re: [OMPI devel] [EXTERNAL] OpenSHMEM round 2
>>>>
>>>> Ralph -
>>>>
>>>> I think those warnings are just because of when they last synced
>>>> with the trunk; it looks like they haven't updated in the last
>>>> week, when those (and some usnic fixes) went in.
>>>>
>>>> More concerning is the --enable-picky stuff and the disabling of
>>>> SHMEM in the right places.
>>>>
>>>> Brian
>>>>
>>>> On 8/11/13 11:24 AM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>>>
>>>>> Turning off the enable_picky, I get it to compile with the
>>>>> following
>>>>> warnings:
>>>>>
>>>>> pget_elements_x_f.c:70: warning: no previous prototype for
>>>>> 'ompi_get_elements_x_f'
>>>>> pstatus_set_elements_x_f.c:70: warning: no previous prototype for
>>>>> 'ompi_status_set_elements_x_f'
>>>>> ptype_get_extent_x_f.c:69: warning: no previous prototype for
>>>>> 'ompi_type_get_extent_x_f'
>>>>> ptype_get_true_extent_x_f.c:69: warning: no previous prototype for
>>>>> 'ompi_type_get_true_extent_x_f'
>>>>> ptype_size_x_f.c:69: warning: no previous prototype for
>>>>> 'ompi_type_size_x_f'
>>>>>
>>>>> I also found that OpenShmem is still building by default. Is that
>>>>> intended? I thought you were only going to build if --with-shmem
>>>>> (or whatever option) was given.
>>>>>
>>>>> Looks like some cleanup is required
>>>>>
>>>>> On Aug 10, 2013, at 8:54 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>>
>>>>>> FWIW, I couldn't get it to build - this is on a simple Xeon-based
>>>>>> system under CentOS 6.2:
>>>>>>
>>>>>> cc1: warnings being treated as errors
>>>>>> spml_yoda_getreq.c: In function 'mca_spml_yoda_get_completion':
>>>>>> spml_yoda_getreq.c:98: error: pointer targets in passing argument
>>>>>> 1 of 'opal_atomic_add_32' differ in signedness
>>>>>> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note:
>>>>>> expected 'volatile int32_t *' but argument is of type 'uint32_t *'
>>>>>> spml_yoda_getreq.c:98: error: signed and unsigned type in
>>>>>> conditional expression
>>>>>> cc1: warnings being treated as errors
>>>>>> spml_yoda_putreq.c: In function 'mca_spml_yoda_put_completion':
>>>>>> spml_yoda_putreq.c:81: error: pointer targets in passing argument
>>>>>> 1 of 'opal_atomic_add_32' differ in signedness
>>>>>> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note:
>>>>>> expected 'volatile int32_t *' but argument is of type 'uint32_t *'
>>>>>> spml_yoda_putreq.c:81: error: signed and unsigned type in
>>>>>> conditional expression
>>>>>> make[2]: *** [spml_yoda_getreq.lo] Error 1
>>>>>> make[2]: *** Waiting for unfinished jobs....
>>>>>> make[2]: *** [spml_yoda_putreq.lo] Error 1
>>>>>> cc1: warnings being treated as errors
>>>>>> spml_yoda.c: In function 'mca_spml_yoda_put_internal':
>>>>>> spml_yoda.c:725: error: pointer targets in passing argument 1 of
>>>>>> 'opal_atomic_add_32' differ in signedness
>>>>>> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note:
>>>>>> expected 'volatile int32_t *' but argument is of type 'uint32_t *'
>>>>>> spml_yoda.c:725: error: signed and unsigned type in conditional
>>>>>> expression
>>>>>> spml_yoda.c: In function 'mca_spml_yoda_get':
>>>>>> spml_yoda.c:1107: error: pointer targets in passing argument 1 of
>>>>>> 'opal_atomic_add_32' differ in signedness
>>>>>> ../../../../opal/include/opal/sys/amd64/atomic.h:174: note:
>>>>>> expected 'volatile int32_t *' but argument is of type 'uint32_t *'
>>>>>> spml_yoda.c:1107: error: signed and unsigned type in conditional
>>>>>> expression
>>>>>> make[2]: *** [spml_yoda.lo] Error 1
>>>>>> make[1]: *** [all-recursive] Error 1
>>>>>>
>>>>>> Only configure arguments:
>>>>>>
>>>>>> enable_picky=yes
>>>>>> enable_debug=yes
>>>>>>
>>>>>>
>>>>>> gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3)
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Aug 10, 2013, at 7:21 PM, "Barrett, Brian W"
>>>>>> <bwbarre_at_[hidden]>
>>>>>> wrote:
>>>>>>
>>>>>>> On 8/6/13 10:30 AM, "Joshua Ladd" <joshual_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> Dear OMPI Community,
>>>>>>>>
>>>>>>>> Please find on Bitbucket the latest round of OSHMEM changes
>>>>>>>> based on community feedback. Please git and test at your leisure.
>>>>>>>>
>>>>>>>> https://bitbucket.org/jladd_math/mlnx-oshmem.git
>>>>>>>
>>>>>>> Josh -
>>>>>>>
>>>>>>> In general, I think everything looks ok. However, the "right"
>>>>>>> thing doesn't happen if the CM PML is used (at least, when using
>>>>>>> the Portals
>>>>>>> 4
>>>>>>> MTL). When configured with:
>>>>>>>
>>>>>>> ./configure
>>>>>>> --enable-mca-no-build=pml-ob1,pml-bfo,pml-v,btl,bml,mpool
>>>>>>>
>>>>>>> The build segfaults trying to run a SHMEM program:
>>>>>>>
>>>>>>> mpirun -np 2 ./bcast
>>>>>>> [shannon:90397] *** Process received signal *** [shannon:90397]
>>>>>>> Signal: Segmentation fault (11) [shannon:90397] Signal code:
>>>>>>> Address not mapped (1) [shannon:90397] Failing at address: (nil)
>>>>>>> [shannon:90398] *** Process received signal *** [shannon:90398]
>>>>>>> Signal: Segmentation fault (11) [shannon:90398] Signal code:
>>>>>>> Address not mapped (1) [shannon:90398] Failing at address: (nil)
>>>>>>> [shannon:90397] [ 0] /lib64/libpthread.so.0() [0x38b7a0f4a0]
>>>>>>> [shannon:90397] *** End of error message *** [shannon:90398] [
>>>>>>> 0]
>>>>>>> /lib64/libpthread.so.0() [0x38b7a0f4a0] [shannon:90398] *** End
>>>>>>> of error message ***
>>>>>>>
>>>>>>> ----------------------------------------------------------------
>>>>>>> --
>>>>>>> ---
>>>>>>> ---
>>>>>>> --
>>>>>>> mpirun noticed that process rank 1 with PID 90398 on node
>>>>>>> shannon exited on signal 11 (Segmentation fault).
>>>>>>>
>>>>>>> ----------------------------------------------------------------
>>>>>>> --
>>>>>>> ---
>>>>>>> ---
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Brian
>>>>>>>
>>>>>>> --
>>>>>>> Brian W. Barrett
>>>>>>> Scalable System Software Group
>>>>>>> Sandia National Laboratories
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>
>>>>
>>>> --
>>>> Brian W. Barrett
>>>> Scalable System Software Group
>>>> Sandia National Laboratories
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>>
>>> --
>>> Brian W. Barrett
>>> Scalable System Software Group
>>> Sandia National Laboratories
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> <test_shmem.c>_______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

_______________________________________________
devel mailing list
devel_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/devel