Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.7 rc4 compilation error
From: Edgar Gabriel (gabriel_at_[hidden])
Date: 2012-10-30 10:28:13


ok, so a couple of things.

I still think it is the same issue that I observed 1-2 days ago. Could
you try to remove the fs/lustre component from your compilation, e.g. by
adding an .ompi_ignore file into that directory, and see whether this
fixes the issue?

I tried on my machine (no lustre, no ib) compilations with
--disable-mpi-io *or* --disable-io-romio, and both worked correctly and
I could run things. Note, that the flags are truly different meanwhile,
since the second flag is now equivalent to --enable-mca-no-build=io:romio
The first flag disables the io, fcoll, fs and sharedfp frameworks.
(prior to ompio they had basically the same effect).

In your particular case this means, that you disabled romio, but the
entire ompio stack is still compiled, and error must come from that
portion. If my suspecion is correct, it is still liblustre
messing around with the malloc hooks, and that causes the stack frame to
be completely broken. I thought I fixed that since we did not have the
issue on trunk, but we did observe that in the 1.7 branch 1-2 days back
as well, and I was looking into that.

That being said, there is another malloc-hooks issue that makes me a bit
nervous. The compilation of the otf stuff produced a ton of warnings on
my machine with gcc4.6.2 also with respect to the _malloc_hooks and
_realloc_hooks. Not sure whether this contributed to the problem as
well, just thought I bring it up since we seem to have a corrupted stack
frame problem.

Thanks
Edgar

On 10/30/2012 8:29 AM, Edgar Gabriel wrote:
> ok, I'll look into this. I noticed a problem with static builds on
> lustre file systems recently, and I was wandering whether its the same
> issue or not. But I'll check what's going on.
>
> THanks
> Edgar
>
> On 10/30/2012 7:22 AM, Ralph Castain wrote:
>> No to Lustre, and I didn't build static
>>
>> I'm not sure what, if any, parallel file system might be present. In the case that works, I just built with no configure args other than prefix. ompi_info shows both romio and mpio built, but nothing more about what support they built internally.
>>
>>
>> On Oct 30, 2012, at 4:14 AM, Edgar Gabriel <gabriel_at_[hidden]> wrote:
>>
>>> Ralph,
>>>
>>> just out curiosity: is there a lustre file system on the machine and is
>>> this a static build ?
>>>
>>> Thanks
>>> Edgar
>>>
>>> On 10/29/2012 9:17 PM, Ralph Castain wrote:
>>>> Hmmm...I added that directory and tried this on odin (which is an IB-based machine). Any MPI proc segfaults:
>>>>
>>>> Core was generated by `./hello'.
>>>> Program terminated with signal 11, Segmentation fault.
>>>> w#0 _sysio_p_validate (pno=0x0, intnt=0x0, path=0x0) at src/inode.c:574
>>>> 574 src/inode.c: No such file or directory.
>>>> in src/inode.c
>>>> (gdb) where
>>>> #0 _sysio_p_validate (pno=0x0, intnt=0x0, path=0x0) at src/inode.c:574
>>>> #1 0x00002aaaabd3f3e9 in _sysio_path_walk (parent=0x0, nd=0x7fffffffd8e0) at src/namei.c:216
>>>> #2 0x00002aaaabd3faad in _sysio_namei (parent=0x0, path=<value optimized out>, flags=0, intnt=0x7fffffffd950, pnop=0x7fffffffd970) at src/namei.c:505
>>>> #3 0x00002aaaabd3fd98 in open (path=0x2aaaac24280f "/sys/devices/system/node", flags=<value optimized out>) at src/open.c:179
>>>> #4 0x00002aaaabd43d5b in opendir (name=0x2aaaac24280f "/sys/devices/system/node") at src/stddir.c:60
>>>> #5 0x00002aaaac241825 in numa_max_node () from /usr/lib64/libnuma.so.1
>>>> #6 0x00002aaaac241d13 in numa_init () from /usr/lib64/libnuma.so.1
>>>> #7 0x00002aaaaaab845b in call_init () from /lib64/ld-linux-x86-64.so.2
>>>> #8 0x00002aaaaaab8565 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
>>>> #9 0x00002aaaaaaabaaa in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
>>>> #10 0x0000000000000001 in ?? ()
>>>> #11 0x00007fffffffe03c in ?? ()
>>>> #12 0x0000000000000000 in ?? ()
>>>>
>>>> I got the same thing whether I excluded openib or not. I then ran on my Linux cluster, which doesn't have IB at all - and it ran fine. Also runs clean on the Mac. However, in both those cases, I had left IO romio enabled.
>>>>
>>>> Now on odin, I always disable-io-romio. So I tried deliberately enabling it, and everything works. So this appears to be something that the IO work has broken.
>>>>
>>>> Edgar: can you please fix --disable-io-romio?
>>>>
>>>> Thanks
>>>> Ralph
>>>>
>>>>
>>>>
>>>>
>>>> On Oct 29, 2012, at 11:55 AM, Edgar Gabriel <gabriel_at_[hidden]> wrote:
>>>>
>>>>> I'm sorry to add one more thing to the list, but beyond this file, it
>>>>> looks like also the entire ompi/mca/common/verbs/ directory is also
>>>>> missing in the 1.7 branch, but is required to compile the bcoll
>>>>> framework. It is there in the trunk, but missing in the 1.7 branch...
>>>>>
>>>>> Thanks
>>>>> Edgar
>>>>>
>>>>>
>>>>> On 10/26/2012 5:31 PM, Ralph Castain wrote:
>>>>>> Okay, I'll fix for tonights tarball.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On Oct 26, 2012, at 3:28 PM, "Shamis, Pavel" <shamisp_at_[hidden]> wrote:
>>>>>>
>>>>>>> There is a bug in makefile. The file existing in svn, but it is not listed in the Makefile.am. As a result, it wasn't pulled to the tarball.
>>>>>>>
>>>>>>> Pavel (Pasha) Shamis
>>>>>>> ---
>>>>>>> Computer Science Research Group
>>>>>>> Computer Science and Math Division
>>>>>>> Oak Ridge National Laboratory
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Oct 26, 2012, at 2:33 PM, Edgar Gabriel wrote:
>>>>>>>
>>>>>>> we have trouble compiling the 1.7 series on a machine in Dresden.
>>>>>>> Specifically, we receive an error message when compiling the
>>>>>>> bcol/iboffload component (other infiniband components compile fine).
>>>>>>>
>>>>>>> Any idea/suggestions what we might be doing wrong or what to look for?
>>>>>>>
>>>>>>> make[2]: Entering directory
>>>>>>> `/home/h2/gabriel/openmpi-1.7rc4/ompi/mca/bcol/iboffload'
>>>>>>> CC bcol_iboffload_module.lo
>>>>>>> CC bcol_iboffload_mca.lo
>>>>>>> CC bcol_iboffload_endpoint.lo
>>>>>>> CC bcol_iboffload_frag.lo
>>>>>>> In file included from bcol_iboffload_frag.c:16:0:
>>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No such
>>>>>>> file or directory
>>>>>>> compilation terminated.
>>>>>>> make[2]: *** [bcol_iboffload_frag.lo] Error 1
>>>>>>> make[2]: *** Waiting for unfinished jobs....
>>>>>>> In file included from bcol_iboffload_mca.c:18:0:
>>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No such
>>>>>>> file or directory
>>>>>>> compilation terminated.
>>>>>>> make[2]: *** [bcol_iboffload_mca.lo] Error 1
>>>>>>> In file included from bcol_iboffload_endpoint.c:23:0:
>>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No such
>>>>>>> file or directory
>>>>>>> compilation terminated.
>>>>>>> make[2]: *** [bcol_iboffload_endpoint.lo] Error 1
>>>>>>> In file included from bcol_iboffload_module.c:39:0:
>>>>>>> bcol_iboffload.h:46:36: fatal error: bcol_iboffload_qp_info.h: No such
>>>>>>> file or directory
>>>>>>> compilation terminated.
>>>>>>> make[2]: *** [bcol_iboffload_module.lo] Error 1
>>>>>>> make[2]: Leaving directory
>>>>>>> `/home/h2/gabriel/openmpi-1.7rc4/ompi/mca/bcol/iboffload'
>>>>>>> make[1]: *** [all-recursive] Error 1
>>>>>>> make[1]: Leaving directory `/home/h2/gabriel/openmpi-1.7rc4/ompi'
>>>>>>> make: *** [all-recursive] Error 1
>>>>>>>
>>>>>>> Thanks
>>>>>>> Edgar
>>>>>>>
>>>>>>> --
>>>>>>> Edgar Gabriel
>>>>>>> Associate Professor
>>>>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>>>>>> Department of Computer Science University of Houston
>>>>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>>>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>>>>>>
>>>>>>> <signature.asc>_______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]<mailto:devel_at_[hidden]>
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>> --
>>>>> Edgar Gabriel
>>>>> Associate Professor
>>>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>>>> Department of Computer Science University of Houston
>>>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>> --
>>> Edgar Gabriel
>>> Associate Professor
>>> Parallel Software Technologies Lab http://pstl.cs.uh.edu
>>> Department of Computer Science University of Houston
>>> Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
>>> Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab      http://pstl.cs.uh.edu
Department of Computer Science          University of Houston
Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335