Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi configuration error?
From: Ben Lash (bl10_at_[hidden])
Date: 2014-05-21 15:20:23


I used a different build of netcdf 4.1.3, and the code seems to run now. I
have a totally different, non-mpi related error in part of it, but there's
no way for the list to help, I mostly just wanted to report that this
particular problem seems to be solved for the record. It doesn't seem to
fail quite as gracefully anymore, but I'm still getting enough of the error
messages to know what's going on.

MPI_ABORT was invoked on rank 52 in communicator MPI_COMM_WORLD
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[cn-099.davinci.rice.edu:26185] [[63355,0],4]-[[63355,1],52]
mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[cn-099.davinci.rice.edu:26185] [[63355,0],4]-[[63355,1],54]
mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[cn-099.davinci.rice.edu:26185] [[63355,0],4]-[[63355,1],55]
mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[cn-158.davinci.rice.edu:12459] [[63355,0],1]-[[63355,1],15]
mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[cn-158.davinci.rice.edu:12459] [[63355,0],1]-[[63355,1],17]
mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[cn-099.davinci.rice.edu:26185] [[63355,0],4]-[[63355,1],56]
mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[cn-099.davinci.rice.edu:26185] [[63355,0],4]-[[63355,1],53]
mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[cn-099.davinci.rice.edu:26185] [[63355,0],4]-[[63355,1],51]
mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[cn-099.davinci.rice.edu:26185] [[63355,0],4]-[[63355,1],57]
mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source

....

[cn-158.davinci.rice.edu:12459] [[63355,0],1]-[[63355,1],16]
mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpiexec has exited due to process rank 49 with PID 26187 on
node cn-099 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
CCTM_V5g_Linux2_x 00000000007FEA29 Unknown Unknown Unknown
CCTM_V5g_Linux2_x 00000000007FD3A0 Unknown Unknown Unknown
CCTM_V5g_Linux2_x 00000000007BA9A2 Unknown Unknown Unknown
CCTM_V5g_Linux2_x 0000000000759288 Unknown Unknown Unknown

...

On Wed, May 21, 2014 at 2:08 PM, Gus Correa <gus_at_[hidden]> wrote:

> Hi Ben
>
> My guess is that your sys admins may have built NetCDF
> with parallel support, pnetcdf, and the latter with OpenMPI,
> which could explain the dependency.
> Ideally, they should have built it again with the latest default OpenMPI
> (1.6.5?)
>
> Check if there is a NetCDF module that either doesn't have any
> dependence on MPI, or depends on the current Open MPI that
> you are using (1.6.5 I think).
> A 'module show netcdf/bla/bla'
> on the available netcdf modules will tell.
>
> If the application code is old as you said, it probably doesn't use
> any pnetcdf. In addition, it should work even with NetCDF 3.X.Y,
> which probably doesn't have any pnetcdf built in.
> Newer netcdf (4.Z.W > 4.1.3) should also work, and in this case
> pick one that requires the default OpenMPI, if available.
>
> Just out of curiosity, besides netcdf/4.1.3, did you load openmpi/1.6.5?
> Somehow the openmpi/1.6.5 should have been marked
> to conflict with 1.4.4.
> Is it?
> Anyway, you may want to do a 'which mpiexec' to see which one is
> taking precedence in your environment (1.6.5 or 1.4.4)
> Probably 1.6.5.
>
> Does the code work now, or does it continue to fail?
>
>
> I hope this helps,
> Gus Correa
>
>
>
> On 05/21/2014 02:36 PM, Ben Lash wrote:
>
>> Yep, there is is.
>>
>> [bl10_at_login2 USlogsminus10]$ module show netcdf/4.1.3
>> -------------------------------------------------------------------
>> /opt/apps/modulefiles/netcdf/4.1.3:
>>
>> module load openmpi/1.4.4-intel
>> prepend-path PATH
>> /opt/apps/netcdf/4.1.3/bin:/opt/apps/netcdf/4.1.3/deps/hdf5/1.8.7/bin
>> prepend-path LD_LIBRARY_PATH
>> /opt/apps/netcdf/4.1.3/lib:/opt/apps/netcdf/4.1.3/deps/
>> hdf5/1.8.7/lib:/opt/apps/netcdf/4.1.3/deps/szip/2.1/lib
>>
>> prepend-path MANPATH /opt/apps/netcdf/4.1.3/share/man
>> -------------------------------------------------------------------
>>
>>
>>
>> On Wed, May 21, 2014 at 1:34 PM, Douglas L Reeder <dlr1_at_[hidden]
>> <mailto:dlr1_at_[hidden]>> wrote:
>>
>> Ben,
>>
>> The netcdf/4.1.3 module maybe loading the openmpi/1.4.4 module. Can
>> you do module show the netcdf module file to to see if there is a
>> module load openmpi command.
>>
>> Doug Reeder
>>
>> On May 21, 2014, at 12:23 PM, Ben Lash <bl10_at_[hidden]
>> <mailto:bl10_at_[hidden]>> wrote:
>>
>> I just wanted to follow up for anyone else who got a similar
>>> problem - module load netcdf/4.1.3 *also* loaded openmpi/1.4.4.
>>> <http://1.4.4./> Don't ask me why. My code doesn't seem to fail as
>>>
>>> gracefully but otherwise works now. Thanks.
>>>
>>>
>>> On Sat, May 17, 2014 at 6:02 AM, Jeff Squyres (jsquyres)
>>> <jsquyres_at_[hidden] <mailto:jsquyres_at_[hidden]>> wrote:
>>>
>>> Ditto -- Lmod looks pretty cool. Thanks for the heads up.
>>>
>>>
>>> On May 16, 2014, at 6:23 PM, Douglas L Reeder
>>> <dlr1_at_[hidden] <mailto:dlr1_at_[hidden]>> wrote:
>>>
>>> > Maxime,
>>> >
>>> > I was unaware of Lmod. Thanks for bringing it to my attention.
>>> >
>>> > Doug
>>> > On May 16, 2014, at 4:07 PM, Maxime Boissonneault
>>> <maxime.boissonneault_at_[hidden]
>>> <mailto:maxime.boissonneault_at_[hidden]>> wrote:
>>> >
>>> >> Instead of using the outdated and not maintained Module
>>> environment, why not use Lmod :
>>> https://www.tacc.utexas.edu/tacc-projects/lmod
>>> >>
>>> >> It is a drop-in replacement for Module environment that
>>> supports all of their features and much, much more, such as :
>>> >> - module hierarchies
>>> >> - module properties and color highlighting (we use it to
>>> higlight bioinformatic modules or tools for example)
>>> >> - module caching (very useful for a parallel filesystem
>>> with tons of modules)
>>> >> - path priorities (useful to make sure personal modules
>>> take precendence over system modules)
>>> >> - export module tree to json
>>> >>
>>> >> It works like a charm, understand both TCL and Lua modules
>>> and is actively developped and debugged. There are litteraly
>>> new features every month or so. If it does not do what you
>>> want, odds are that the developper will add it shortly (I've
>>> had it happen).
>>> >>
>>> >> Maxime
>>> >>
>>> >> Le 2014-05-16 17:58, Douglas L Reeder a écrit :
>>> >>> Ben,
>>> >>>
>>> >>> You might want to use module (source forge) to manage
>>> paths to different mpi implementations. It is fairly easy to
>>> set up and very robust for this type of problem. You would
>>> remove contentious application paths from you standard PATH
>>> and then use module to switch them in and out as needed.
>>> >>>
>>> >>> Doug Reeder
>>> >>> On May 16, 2014, at 3:39 PM, Ben Lash <bl10_at_[hidden]
>>> <mailto:bl10_at_[hidden]>> wrote:
>>> >>>
>>> >>>> My cluster has just upgraded to a new version of MPI, and
>>> I'm using an old one. It seems that I'm having trouble
>>> compiling due to the compiler wrapper file moving (full error
>>> here: http://pastebin.com/EmwRvCd9)
>>> >>>> "Cannot open configuration file
>>> /opt/apps/openmpi/1.4.4-intel/share/openmpi/mpif90-wrapper-
>>> data.txt"
>>> >>>>
>>> >>>> I've found the file on the cluster at
>>> /opt/apps/openmpi/retired/1.4.4-intel/share/openmpi/mpif90-
>>> wrapper-data.txt
>>> >>>> How do I tell the old mpi wrapper where this file is?
>>> >>>> I've already corrected one link to mpich ->
>>> /opt/apps/openmpi/retired/1.4.4-intel/, which is in the
>>> software I'm trying to recompile's lib folder
>>> (/home/bl10/CMAQv5.0.1/lib/x86_64/ifort). Thanks for any
>>> ideas. I also tried changing $pkgdatadir based on what I read
>>> here:
>>> >>>>
>>> http://www.open-mpi.org/faq/?category=mpi-apps#default-
>>> wrapper-compiler-flags
>>> >>>>
>>> >>>> Thanks.
>>> >>>>
>>> >>>> --Ben L
>>> >>>> _______________________________________________
>>> >>>> users mailing list
>>> >>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>>
>>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> >>>
>>> >>>
>>> >>>
>>> >>> _______________________________________________
>>> >>> users mailing list
>>> >>>
>>> >>> users_at_[hidden] <mailto:users_at_[hidden]>
>>>
>>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> >>
>>> >>
>>> >> --
>>> >> ---------------------------------
>>> >> Maxime Boissonneault
>>> >> Analyste de calcul - Calcul Québec, Université Laval
>>> >> Ph. D. en physique
>>> >>
>>> >> _______________________________________________
>>> >> users mailing list
>>> >> users_at_[hidden] <mailto:users_at_[hidden]>
>>>
>>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > users_at_[hidden] <mailto:users_at_[hidden]>
>>>
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden] <mailto:jsquyres_at_[hidden]>
>>>
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>
>>>
>>> --
>>> --Ben L
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden] <mailto:users_at_[hidden]>
>>
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>>
>> --
>> --Ben L
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

-- 
--Ben L