Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi configuration error?
From: Gus Correa (gus_at_[hidden])
Date: 2014-05-21 17:02:50


Hi Ben

One of the ranks (52) called MPI_Abort.
This may be a bug in the code, or a problem with the setup
(e.g. a missing or incorrect input file).
For instance, the CCTM Wiki says:
"AERO6 expects emissions inputs for 13 new PM species. CCTM will crash
if any emitted PM species is not included in the emissions input file"
I am not familiar to CCTM, so these are just guesses.

It doesn't look like an MPI problem, though.

You may want to check any other logs that the CCTM code may
produce, for any clue on where it fails.
Otherwise, you could compile with -g -traceback (and remove any
optimization options in FFLAGS, FCFLAGS, CFLAGS, etc.)
It may also have a -DDEBUG or similar that can be turned on
in the CPPFLAGS, which in many models makes a more verbose log.
This *may* tell you where it fails (source file, subroutine and line),
and may help understand why it fails.
If it dumps a core file, you can trace the failure point with
a debugger.

I hope this helps,
Gus

On 05/21/2014 03:20 PM, Ben Lash wrote:
> I used a different build of netcdf 4.1.3, and the code seems to run now.
> I have a totally different, non-mpi related error in part of it, but
> there's no way for the list to help, I mostly just wanted to report that
> this particular problem seems to be solved for the record. It doesn't
> seem to fail quite as gracefully anymore, but I'm still getting enough
> of the error messages to know what's going on.
>
> MPI_ABORT was invoked on rank 52 in communicator MPI_COMM_WORLD
> with errorcode 0.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> [cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
> [[63355,0],4]-[[63355,1],52] mca_oob_tcp_msg_recv: readv failed:
> Connection reset by peer (104)
> [cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
> [[63355,0],4]-[[63355,1],54] mca_oob_tcp_msg_recv: readv failed:
> Connection reset by peer (104)
> [cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
> [[63355,0],4]-[[63355,1],55] mca_oob_tcp_msg_recv: readv failed:
> Connection reset by peer (104)
> [cn-158.davinci.rice.edu:12459 <http://cn-158.davinci.rice.edu:12459>]
> [[63355,0],1]-[[63355,1],15] mca_oob_tcp_msg_recv: readv failed:
> Connection reset by peer (104)
> [cn-158.davinci.rice.edu:12459 <http://cn-158.davinci.rice.edu:12459>]
> [[63355,0],1]-[[63355,1],17] mca_oob_tcp_msg_recv: readv failed:
> Connection reset by peer (104)
> [cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
> [[63355,0],4]-[[63355,1],56] mca_oob_tcp_msg_recv: readv failed:
> Connection reset by peer (104)
> [cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
> [[63355,0],4]-[[63355,1],53] mca_oob_tcp_msg_recv: readv failed:
> Connection reset by peer (104)
> [cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
> [[63355,0],4]-[[63355,1],51] mca_oob_tcp_msg_recv: readv failed:
> Connection reset by peer (104)
> [cn-099.davinci.rice.edu:26185 <http://cn-099.davinci.rice.edu:26185>]
> [[63355,0],4]-[[63355,1],57] mca_oob_tcp_msg_recv: readv failed:
> Connection reset by peer (104)
> forrtl: error (78): process killed (SIGTERM)
> Image PC Routine Line Source
>
> ....
>
> [cn-158.davinci.rice.edu:12459 <http://cn-158.davinci.rice.edu:12459>]
> [[63355,0],1]-[[63355,1],16] mca_oob_tcp_msg_recv: readv failed:
> Connection reset by peer (104)
> --------------------------------------------------------------------------
> mpiexec has exited due to process rank 49 with PID 26187 on
> node cn-099 exiting improperly. There are two reasons this could occur:
>
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
>
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
>
> This may have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
> --------------------------------------------------------------------------
> forrtl: error (78): process killed (SIGTERM)
> Image PC Routine Line Source
> CCTM_V5g_Linux2_x 00000000007FEA29 Unknown Unknown Unknown
> CCTM_V5g_Linux2_x 00000000007FD3A0 Unknown Unknown Unknown
> CCTM_V5g_Linux2_x 00000000007BA9A2 Unknown Unknown Unknown
> CCTM_V5g_Linux2_x 0000000000759288 Unknown Unknown Unknown
>
> ...
>
>
>
> On Wed, May 21, 2014 at 2:08 PM, Gus Correa <gus_at_[hidden]
> <mailto:gus_at_[hidden]>> wrote:
>
> Hi Ben
>
> My guess is that your sys admins may have built NetCDF
> with parallel support, pnetcdf, and the latter with OpenMPI,
> which could explain the dependency.
> Ideally, they should have built it again with the latest default
> OpenMPI (1.6.5?)
>
> Check if there is a NetCDF module that either doesn't have any
> dependence on MPI, or depends on the current Open MPI that
> you are using (1.6.5 I think).
> A 'module show netcdf/bla/bla'
> on the available netcdf modules will tell.
>
> If the application code is old as you said, it probably doesn't use
> any pnetcdf. In addition, it should work even with NetCDF 3.X.Y,
> which probably doesn't have any pnetcdf built in.
> Newer netcdf (4.Z.W > 4.1.3) should also work, and in this case
> pick one that requires the default OpenMPI, if available.
>
> Just out of curiosity, besides netcdf/4.1.3, did you load openmpi/1.6.5?
> Somehow the openmpi/1.6.5 should have been marked
> to conflict with 1.4.4.
> Is it?
> Anyway, you may want to do a 'which mpiexec' to see which one is
> taking precedence in your environment (1.6.5 or 1.4.4)
> Probably 1.6.5.
>
> Does the code work now, or does it continue to fail?
>
>
> I hope this helps,
> Gus Correa
>
>
>
> On 05/21/2014 02:36 PM, Ben Lash wrote:
>
> Yep, there is is.
>
> [bl10_at_login2 USlogsminus10]$ module show netcdf/4.1.3
> ------------------------------__------------------------------__-------
> /opt/apps/modulefiles/netcdf/__4.1.3:
>
> module load openmpi/1.4.4-intel
> prepend-path PATH
> /opt/apps/netcdf/4.1.3/bin:/__opt/apps/netcdf/4.1.3/deps/__hdf5/1.8.7/bin
> prepend-path LD_LIBRARY_PATH
> /opt/apps/netcdf/4.1.3/lib:/__opt/apps/netcdf/4.1.3/deps/__hdf5/1.8.7/lib:/opt/apps/__netcdf/4.1.3/deps/szip/2.1/lib
>
> prepend-path MANPATH /opt/apps/netcdf/4.1.3/share/__man
> ------------------------------__------------------------------__-------
>
>
>
> On Wed, May 21, 2014 at 1:34 PM, Douglas L Reeder
> <dlr1_at_[hidden] <mailto:dlr1_at_[hidden]>
> <mailto:dlr1_at_[hidden] <mailto:dlr1_at_[hidden]>>> wrote:
>
> Ben,
>
> The netcdf/4.1.3 module maybe loading the openmpi/1.4.4
> module. Can
> you do module show the netcdf module file to to see if
> there is a
> module load openmpi command.
>
> Doug Reeder
>
> On May 21, 2014, at 12:23 PM, Ben Lash <bl10_at_[hidden]
> <mailto:bl10_at_[hidden]>
> <mailto:bl10_at_[hidden] <mailto:bl10_at_[hidden]>>> wrote:
>
> I just wanted to follow up for anyone else who got a
> similar
> problem - module load netcdf/4.1.3 *also* loaded
> openmpi/1.4.4. <http://1.4.4.>
> <http://1.4.4./> Don't ask me why. My code doesn't seem
> to fail as
>
> gracefully but otherwise works now. Thanks.
>
>
> On Sat, May 17, 2014 at 6:02 AM, Jeff Squyres (jsquyres)
> <jsquyres_at_[hidden] <mailto:jsquyres_at_[hidden]>
> <mailto:jsquyres_at_[hidden] <mailto:jsquyres_at_[hidden]>>> wrote:
>
> Ditto -- Lmod looks pretty cool. Thanks for the
> heads up.
>
>
> On May 16, 2014, at 6:23 PM, Douglas L Reeder
> <dlr1_at_[hidden] <mailto:dlr1_at_[hidden]>
> <mailto:dlr1_at_[hidden] <mailto:dlr1_at_[hidden]>>>
> wrote:
>
> > Maxime,
> >
> > I was unaware of Lmod. Thanks for bringing it to
> my attention.
> >
> > Doug
> > On May 16, 2014, at 4:07 PM, Maxime Boissonneault
> <maxime.boissonneault@__calculquebec.ca
> <mailto:maxime.boissonneault_at_[hidden]>
> <mailto:maxime.boissonneault@__calculquebec.ca
> <mailto:maxime.boissonneault_at_[hidden]>>> wrote:
> >
> >> Instead of using the outdated and not maintained
> Module
> environment, why not use Lmod :
> https://www.tacc.utexas.edu/__tacc-projects/lmod
> <https://www.tacc.utexas.edu/tacc-projects/lmod>
> >>
> >> It is a drop-in replacement for Module
> environment that
> supports all of their features and much, much more,
> such as :
> >> - module hierarchies
> >> - module properties and color highlighting (we
> use it to
> higlight bioinformatic modules or tools for example)
> >> - module caching (very useful for a parallel
> filesystem
> with tons of modules)
> >> - path priorities (useful to make sure personal
> modules
> take precendence over system modules)
> >> - export module tree to json
> >>
> >> It works like a charm, understand both TCL and
> Lua modules
> and is actively developped and debugged. There are
> litteraly
> new features every month or so. If it does not do
> what you
> want, odds are that the developper will add it
> shortly (I've
> had it happen).
> >>
> >> Maxime
> >>
> >> Le 2014-05-16 17:58, Douglas L Reeder a écrit :
> >>> Ben,
> >>>
> >>> You might want to use module (source forge) to
> manage
> paths to different mpi implementations. It is
> fairly easy to
> set up and very robust for this type of problem.
> You would
> remove contentious application paths from you
> standard PATH
> and then use module to switch them in and out as
> needed.
> >>>
> >>> Doug Reeder
> >>> On May 16, 2014, at 3:39 PM, Ben Lash
> <bl10_at_[hidden] <mailto:bl10_at_[hidden]>
> <mailto:bl10_at_[hidden] <mailto:bl10_at_[hidden]>>> wrote:
> >>>
> >>>> My cluster has just upgraded to a new version
> of MPI, and
> I'm using an old one. It seems that I'm having trouble
> compiling due to the compiler wrapper file moving
> (full error
> here: http://pastebin.com/EmwRvCd9)
> >>>> "Cannot open configuration file
>
> /opt/apps/openmpi/1.4.4-intel/__share/openmpi/mpif90-wrapper-__data.txt"
> >>>>
> >>>> I've found the file on the cluster at
>
> /opt/apps/openmpi/retired/1.4.__4-intel/share/openmpi/mpif90-__wrapper-data.txt
> >>>> How do I tell the old mpi wrapper where this
> file is?
> >>>> I've already corrected one link to mpich ->
> /opt/apps/openmpi/retired/1.4.__4-intel/, which is
> in the
> software I'm trying to recompile's lib folder
> (/home/bl10/CMAQv5.0.1/lib/__x86_64/ifort). Thanks
> for any
> ideas. I also tried changing $pkgdatadir based on
> what I read
> here:
> >>>>
> http://www.open-mpi.org/faq/?__category=mpi-apps#default-__wrapper-compiler-flags
> <http://www.open-mpi.org/faq/?category=mpi-apps#default-wrapper-compiler-flags>
> >>>>
> >>>> Thanks.
> >>>>
> >>>> --Ben L
> >>>> _________________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
>
> >>>>
> http://www.open-mpi.org/__mailman/listinfo.cgi/users
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> >>>
> >>>
> >>>
> >>> _________________________________________________
> >>> users mailing list
> >>>
> >>> users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
>
> >>>
> http://www.open-mpi.org/__mailman/listinfo.cgi/users
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> >>
> >>
> >> --
> >> ------------------------------__---
> >> Maxime Boissonneault
> >> Analyste de calcul - Calcul Québec, Université Laval
> >> Ph. D. en physique
> >>
> >> _________________________________________________
> >> users mailing list
> >> users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
>
> >>
> http://www.open-mpi.org/__mailman/listinfo.cgi/users
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> >
> > _________________________________________________
> > users mailing list
> > users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
>
> >
> http://www.open-mpi.org/__mailman/listinfo.cgi/users
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden] <mailto:jsquyres_at_[hidden]>
> <mailto:jsquyres_at_[hidden] <mailto:jsquyres_at_[hidden]>>
>
> For corporate legal information go to:
> http://www.cisco.com/web/__about/doing_business/legal/__cri/
> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>
> _________________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
>
> http://www.open-mpi.org/__mailman/listinfo.cgi/users
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>
>
>
>
> --
> --Ben L
> _________________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
> http://www.open-mpi.org/__mailman/listinfo.cgi/users
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>
>
>
> _________________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> <mailto:users_at_[hidden] <mailto:users_at_[hidden]>>
>
> http://www.open-mpi.org/__mailman/listinfo.cgi/users
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>
>
>
>
> --
> --Ben L
>
>
> _________________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> http://www.open-mpi.org/__mailman/listinfo.cgi/users
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>
>
> _________________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> http://www.open-mpi.org/__mailman/listinfo.cgi/users
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>
>
>
>
> --
> --Ben L
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>