Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.7 rc4 compilation error
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-10-30 15:06:02


Sure - I can do that.

On Oct 30, 2012, at 11:29 AM, Edgar Gabriel <gabriel_at_[hidden]> wrote:

> glad to hear that. However, since we are also having the problem with
> the lustre-fs module for static builds, I think it would still make
> sense to disable fs/lustre/ for 1.7.0
>
> Edgar
>
> On 10/30/2012 12:34 PM, Ralph Castain wrote:
>> I hate odin :-(
>>
>> FWIW: it all works fine today, no matter how I configure it. No earthly idea what happened.
>>
>> Ignore these droids....
>>
>>
>> On Oct 30, 2012, at 7:28 AM, Edgar Gabriel <gabriel_at_[hidden]> wrote:
>>
>>> ok, so a couple of things.
>>>
>>> I still think it is the same issue that I observed 1-2 days ago. Could
>>> you try to remove the fs/lustre component from your compilation, e.g. by
>>> adding an .ompi_ignore file into that directory, and see whether this
>>> fixes the issue?
>>>
>>> I tried on my machine (no lustre, no ib) compilations with
>>> --disable-mpi-io *or* --disable-io-romio, and both worked correctly and
>>> I could run things. Note, that the flags are truly different meanwhile,
>>> since the second flag is now equivalent to --enable-mca-no-build=io:romio
>>> The first flag disables the io, fcoll, fs and sharedfp frameworks.
>>> (prior to ompio they had basically the same effect).
>>>
>>> In your particular case this means, that you disabled romio, but the
>>> entire ompio stack is still compiled, and error must come from that
>>> portion. If my suspecion is correct, it is still liblustre
>>> messing around with the malloc hooks, and that causes the stack frame to
>>> be completely broken. I thought I fixed that since we did not have the
>>> issue on trunk, but we did observe that in the 1.7 branch 1-2 days back
>>> as well, and I was looking into that.
>>>
>>> That being said, there is another malloc-hooks issue that makes me a bit
>>> nervous. The compilation of the otf stuff produced a ton of warnings on
>>> my machine with gcc4.6.2 also with respect to the _malloc_hooks and
>>> _realloc_hooks. Not sure whether this contributed to the problem as
>>> well, just thought I bring it up since we seem to have a corrupted stack
>>> frame problem.
>>>
>>> Thanks
>>> Edgar
>>>
>>>
>>> On 10/30/2012 8:29 AM, Edgar Gabriel wrote:
>>>> ok, I'll look into this. I noticed a problem with static builds on
>>>> lustre file systems recently, and I was wandering whether its the same
>>>> issue or not. But I'll check what's going on.
>>>>
>>>> THanks
>>>> Edgar
>>>>
>>>> On 10/30/2012 7:22 AM, Ralph Castain wrote:
>>>>> No to Lustre, and I didn't build static
>>>>>
>>>>> I'm not sure what, if any, parallel file system might be present. In the case that works, I just built with no configure args other than prefix. ompi_info shows both romio and mpio built, but nothing more about what support they built internally.
>>>>>
>>>>>
>>>>> On Oct 30, 2012, at 4:14 AM, Edgar Gabriel <gabriel_at_[hidden]> wrote:
>>>>>
>>>>>> Ralph,
>>>>>>
>>>>>> just out curiosity: is there a lustre file system on the machine and is
>>>>>> this a static build ?
>>>>>>
>>>>>> Thanks
>>>>>> Edgar
>>>>>>
>>>>>> On 10/29/2012 9:17 PM, Ralph Castain wrote:
>>>>>>> Hmmm...I added that directory and tried this on odin (which is an IB-based machine). Any MPI proc segfaults:
>>>>>>>
>>>>>>> Core was generated by `./hello'.
>>>>>>> Program terminated with signal 11, Segmentation fault.
>>>>>>> w#0 _sysio_p_validate (pno=0x0, intnt=0x0, path=0x0) at src/inode.c:574
>>>>>>> 574 src/inode.c: No such file or directory.
>>>>>>> in src/inode.c
>>>>>>> (gdb) where
>>>>>>> #0 _sysio_p_validate (pno=0x0, intnt=0x0, path=0x0) at src/inode.c:574
>>>>>>> #1 0x00002aaaabd3f3e9 in _sysio_path_walk (parent=0x0, nd=0x7fffffffd8e0) at src/namei.c:216
>>>>>>> #2 0x00002aaaabd3faad in _sysio_namei (parent=0x0, path=<value optimized out>, flags=0, intnt=0x7fffffffd950, pnop=0x7fffffffd970) at src/namei.c:505
>>>>>>> #3 0x00002aaaabd3fd98 in open (path=0x2aaaac24280f "/sys/devices/system/node", flags=<value optimized out>) at src/open.c:179
>>>>>>> #4 0x00002aaaabd43d5b in opendir (name=0x2aaaac24280f "/sys/devices/system/node") at src/stddir.c:60
>>>>>>> #5 0x00002aaaac241825 in numa_max_node () from /usr/lib64/libnuma.so.1
>>>>>>> #6 0x00002aaaac241d13 in numa_init () from /usr/lib64/libnuma.so.1
>>>>>>> #7 0x00002aaaaaab845b in call_init () from /lib64/ld-linux-x86-64.so.2
>>>>>>> #8 0x00002aaaaaab8565 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
>>>>>>> #9 0x00002aaaaaaabaaa in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
>>>>>>> #10 0x0000000000000001 in ?? ()
>>>>>>> #11 0x00007fffffffe03c in ?? ()
>>>>>>> #12 0x0000000000000000 in ?? ()
>>>>>>>
>>>>>>> I got the same thing whether I excluded openib or not. I then ran on my Linux cluster, which doesn't have IB at all - and it ran fine. Also runs clean on the Mac. However, in both those cases, I had left IO romio enabled.
>>>>>>>
>>>>>>> Now on odin, I always disable-io-romio. So I tried deliberately enabling it, and everything works. So this appears to be something that the IO work has broken.
>>>>>>>
>>>>>>> Edgar: can you please fix --disable-io-romio?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ralph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Oct 29, 2012, at 11:55 AM, Edgar Gabriel <gabriel_at_[hidden]> wrote:
>>>>>>>
>>>>>>>> I'm sorry to add one more thing to the list, but beyond this file, it
>>>>>>>> looks like also the entire ompi/mca/common/verbs/ directory is also
>>>>>>>> missing in the 1.7 branch, but is required to compile the bcoll
>>>>>>>> framework. It is there in the trunk, but missing in the 1.7 branch...
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Edgar
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/26/2012 5:31 PM, Ralph Castain wrote:
>>>>>>>>> Okay, I'll fix for tonights tarball.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> On Oct 26, 2012, at 3:28 PM, "Shamis, Pavel" <shamisp_at_[hidden]> wrote:
>>>>>>>>>
>>>>>>>>>> There is a bug in makefile. The file existing in svn, but it is not listed in the Makefile.am. As a result, it wasn't pulled to the tarball.
>>>>>>>>>>
>>>>>>>>>> Pavel (Pasha) Shamis
>>>>>>>>>> ---
>>>>>>>>>> Computer Science Research Group
>>>>>>>>>> Computer Science and Math Division
>>>>>>>>>> Oak Ridge National Laboratory
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Oct 26, 2012, at 2:33 PM, Edgar Gabriel wrote:
>>>>>>>>>>
>>>>>>>>>> we have trouble compiling the 1.7 series on a machine in Dresden.
>>>>>>>>>> Specifically, we receive an error message when compiling the
>>>>>>>>>> bcol/iboffload component (other infiniband components compile fine).
>>>>>>>>>>
>>>>>>>>>> Any idea/suggestions what we might be doing wrong or what to look for?
>>>>>>>>>>
>>>>>>>>>> make[2]: Entering directory
>>>>>>>>>> `/home/h2/gabriel/openmpi-1.7rc4/ompi/mca/bcol/iboffload'
>>>>>>>>>> CC bcol_iboffload_module.lo
>>>>>>>>>> CC bcol_iboffload_mca.lo
>>>>>>>>>> CC bcol_iboffload_endpoint.lo
>>>>>>>>>> CC bcol_iboffload_frag.lo
>>>>>>>>>> In file included from bcol_iboffload_frag.c:16:0:
>>>>>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No such
>>>>>>>>>> file or directory
>>>>>>>>>> compilation terminated.
>>>>>>>>>> make[2]: *** [bcol_iboffload_frag.lo] Error 1
>>>>>>>>>> make[2]: *** Waiting for unfinished jobs....
>>>>>>>>>> In file included from bcol_iboffload_mca.c:18:0:
>>>>>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No such
>>>>>>>>>> file or directory
>>>>>>>>>> compilation terminated.
>>>>>>>>>> make[2]: *** [bcol_iboffload_mca.lo] Error 1
>>>>>>>>>> In file included from bcol_iboffload_endpoint.c:23:0:
>>>>>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No such
>>>>>>>>>> file or directory
>>>>>>>>>> compilation terminated.
>>>>>>>>>> make[2]: *** [bcol_iboffload_endpoint.lo] Error 1
>>>>>>>>>> In file included from bcol_iboffload_module.c:39:0:
>>>>>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No such
>>>>>>>>>> file or directory
>>>>>>>>>> compilation terminated.
>>>>>>>>>> make[2]: *** [bcol_iboffload_module.lo] Error 1
>>>>>>>>>> make[2]: Leaving directory
>>>>>>>>>> `/home/h2/gabriel/openmpi-1.7rc4/ompi/mca/bcol/iboffload'
>>>>>>>>>> make[1]: *** [all-recursive] Error 1
>>>>>>>>>> make[1]: Leaving directory `/home/h2/gabriel/openmpi-1.7rc4/ompi'
>>>>>>>>>> make: *** [all-recursive] Error 1
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Edgar
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Edgar Gabriel
>>>>>>>>>> Associate Professor
>>>>>>>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>>>>>>>>> Department of Computer Science University of Houston
>>>>>>>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>>>>>>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>>>>>>>>>
>>>>>>>>>> <signature.asc>_______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]<mailto:devel_at_[hidden]>
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Edgar Gabriel
>>>>>>>> Associate Professor
>>>>>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>>>>>>> Department of Computer Science University of Houston
>>>>>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>>>>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Edgar Gabriel
>>>>>> Associate Professor
>>>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>>>>> Department of Computer Science University of Houston
>>>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>> --
>>> Edgar Gabriel
>>> Associate Professor
>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>> Department of Computer Science University of Houston
>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> --
> Edgar Gabriel
> Associate Professor
> Parallel Software Technologies Lab http://pstl.cs.uh.edu
> Department of Computer Science University of Houston
> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel