Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] OMPI-MIGRATE error
From: Hugo Meyer (meyer.hugo_at_[hidden])
Date: 2011-01-31 06:47:48


Hi Joshua.

I've tried the migration again, and i get the next (running process where
mpirun is running):

Terminal 1:

*[hmeyer_at_clus9 whoami]$
/home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am
ft-enable-cr-recovery --mca orte_base_help_aggregate 0 ./whoami 10 10*
*Antes de MPI_Init*
*Antes de MPI_Init*
*--------------------------------------------------------------------------*
*Warning: Could not find any processes to migrate on the nodes specified.*
* You provided the following:*
*Nodes: node9*
*Procs: (null)*
*--------------------------------------------------------------------------*
*Soy el número 1 (100000000)*
*Terminando, una instrucción antes del finalize*
*Soy el número 0 (100000000)*
*Terminando, una instrucción antes del finalize*

Terminal 2:

*[hmeyer_at_clus9 build]$
/home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t
node3 11724*
*--------------------------------------------------------------------------*
*Error: The Job identified by PID (11724) was not able to migrate processes
in this*
* job. This could be caused by any of the following:*
* - Invalid node or rank specified*
* - No processes on the indicated node can by migrated*
* - Process migration was not enabled for this job. Make sure to
indicate*
* the proper AMCA file: "-am ft-enable-cr-recovery".*
*--------------------------------------------------------------------------*

Then i try another way, and i get the next:

Terminal 1:

*[hmeyer_at_clus9 whoami]$
/home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 3 -am
ft-enable-cr-recovery ./whoami 10 10*
*Antes de MPI_Init*
*Antes de MPI_Init*
*Antes de MPI_Init*
*--------------------------------------------------------------------------*
*Notice: A migration of this job has been requested.*
* The processes below will be migrated.*
* Please standby.*
* **[[40382,1],1] Rank 1 on Node clus9*
*
*
*--------------------------------------------------------------------------*
*--------------------------------------------------------------------------*
*Error: The process below has failed. There is no checkpoint available for*
* this job, so we are terminating the application since automatic*
* recovery cannot occur.*
*Internal Name: [[40382,1],1]*
*MCW Rank: 1*
*
*
*--------------------------------------------------------------------------*
*Soy el número 0 (100000000)*
*Terminando, una instrucción antes del finalize*
*Soy el número 2 (100000000)*
*Terminando, una instrucción antes del finalize*
*
*

Terminal 2:

*[hmeyer_at_clus9 build]$
/home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -r 1 -t node3
11784*
*[clus9:11795] *** Process received signal ****
*[clus9:11795] Signal: Segmentation fault (11)*
*[clus9:11795] Signal code: Address not mapped (1)*
*[clus9:11795] Failing at address: (nil)*
*[clus9:11795] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b9d40]*
*[clus9:11795] *** End of error message ****
*Segmentation fault*
*
*

I'm using the ompi-migrate command in the right way? or i am missing
something? Because the first attempt didn't find any process.

Best Regards.

Hugo Meyer

2011/1/28 Hugo Meyer <meyer.hugo_at_[hidden]>

> Thanks to you Joshua.
>
> I will try the procedure with this modifications and i will let you know
> how it goes.
>
> Best Regards.
>
> Hugo Meyer
>
> 2011/1/27 Joshua Hursey <jjhursey_at_[hidden]>
>
> I believe that this is now fixed on the trunk. All the details are in the
>> commit message:
>> https://svn.open-mpi.org/trac/ompi/changeset/24317
>>
>> In my testing yesterday, I did not test the scenario where the node with
>> mpirun also contains processes (the test cluster I was using does not by
>> default run this way). So I was able to reproduce by running on a single
>> node. There were a couple bugs that emerged that are fixed in the commit.
>> The two bugs that were hurting you were the TCP socket cleanup (which caused
>> the looping of the automatic recovery), and the incorrect accounting of
>> local process termination (which caused the modex errors).
>>
>> Let me know if that fixes the problems that you were seeing.
>>
>> Thanks for the bug report and your patience while I pursued a fix.
>>
>> -- Josh
>>
>> On Jan 27, 2011, at 11:28 AM, Hugo Meyer wrote:
>>
>> > Hi Josh.
>> >
>> > Thanks for your reply. I'll tell you what i'm getting now from the
>> executions in the next lines.
>> > When i run without doing a checkpoint i get this output, and the process
>> don' finish:
>> >
>> > [hmeyer_at_clus9 whoami]$
>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am
>> ft-enable-cr-recovery ./whoami 10 10
>> > Antes de MPI_Init
>> > Antes de MPI_Init
>> > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file
>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>> > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file
>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>> > Soy el número 1 (100000000)
>> > Terminando, una instrucción antes del finalize
>> > Soy el número 0 (100000000)
>> > Terminando, una instrucción antes del finalize
>> >
>> --------------------------------------------------------------------------
>> > Error: The process below has failed. There is no checkpoint available
>> for
>> > this job, so we are terminating the application since automatic
>> > recovery cannot occur.
>> > Internal Name: [[41167,1],0]
>> > MCW Rank: 0
>> >
>> >
>> --------------------------------------------------------------------------
>> > [clus9:04985] 1 more process has sent help message
>> help-orte-errmgr-hnp.txt / autor_failed_to_recover_proc
>> > [clus9:04985] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>> all help / error messages
>> >
>> > If i make a checkpoint in another terminal of the mpirun process, during
>> the execution, i get this output:
>> >
>> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
>> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c
>> at line 350
>> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
>> end of buffer in file
>> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323
>> > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() =
>> -26
>> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past
>> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c
>> at line 350
>> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past
>> end of buffer in file
>> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323
>> > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() =
>> -26
>> >
>> --------------------------------------------------------------------------
>> > Notice: The job has been successfully recovered from the
>> > last checkpoint.
>> >
>> --------------------------------------------------------------------------
>> > Soy el número 1 (100000000)
>> > Terminando, una instrucción antes del finalize
>> > Soy el número 0 (100000000)
>> > Terminando, una instrucción antes del finalize
>> > [clus9:06105] 1 more process has sent help message
>> help-orte-errmgr-hnp.txt / autor_recovering_job
>> > [clus9:06105] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>> all help / error messages
>> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
>> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c
>> at line 350
>> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
>> end of buffer in file
>> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323
>> > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() =
>> -26
>> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past
>> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c
>> at line 350
>> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past
>> end of buffer in file
>> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323
>> > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() =
>> -26
>> > [clus9:06105] 1 more process has sent help message
>> help-orte-errmgr-hnp.txt / autor_recovery_complete
>> > Soy el número 0 (100000000)
>> > Terminando, una instrucción antes del finalize
>> > Soy el número 1 (100000000)
>> > Terminando, una instrucción antes del finalize
>> > [clus9:06105] 1 more process has sent help message
>> help-orte-errmgr-hnp.txt / autor_recovering_job
>> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
>> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c
>> at line 350
>> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
>> end of buffer in file
>> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323
>> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past
>> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c
>> at line 350
>> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past
>> end of buffer in file
>> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323
>> > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() =
>> -26
>> > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() =
>> -26
>> >
>> > As you can see, it keeps looping on the recover. Then when i try to
>> migrate this processes using ompi-migrate, i get this:
>> >
>> > [hmeyer_at_clus9 ~]$
>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t
>> node3 18082
>> >
>> --------------------------------------------------------------------------
>> > Error: The Job identified by PID (18082) was not able to migrate
>> processes in this
>> > job. This could be caused by any of the following:
>> > - Invalid node or rank specified
>> > - No processes on the indicated node can by migrated
>> > - Process migration was not enabled for this job. Make sure to
>> indicate
>> > the proper AMCA file: "-am ft-enable-cr-recovery".
>> >
>> --------------------------------------------------------------------------
>> > But, in the terminal where is running the application i get this:
>> >
>> > [hmeyer_at_clus9 whoami]$
>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am
>> ft-enable-cr-recovery ./whoami 10 10
>> > Antes de MPI_Init
>> > Antes de MPI_Init
>> > [clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file
>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>> > [clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file
>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>> >
>> --------------------------------------------------------------------------
>> > Warning: Could not find any processes to migrate on the nodes specified.
>> > You provided the following:
>> > Nodes: node9
>> > Procs: (null)
>> >
>> --------------------------------------------------------------------------
>> >
>> --------------------------------------------------------------------------
>> > Notice: The processes have been successfully migrated to/from the
>> specified
>> > machines.
>> >
>> --------------------------------------------------------------------------
>> > Soy el número 1 (100000000)
>> > Terminando, una instrucción antes del finalize
>> > Soy el número 0 (100000000)
>> > Terminando, una instrucción antes del finalize
>> >
>> --------------------------------------------------------------------------
>> > Error: The process below has failed. There is no checkpoint available
>> for
>> > this job, so we are terminating the application since automatic
>> > recovery cannot occur.
>> > Internal Name: [[62740,1],0]
>> > MCW Rank: 0
>> >
>> >
>> --------------------------------------------------------------------------
>> > [clus9:18082] 1 more process has sent help message
>> help-orte-errmgr-hnp.txt / autor_failed_to_recover_proc
>> > [clus9:18082] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>> all help / error messages
>> >
>> > I asume that the orte_get_job_data_object is the problem, because it is
>> not obtaining the proper value.
>> >
>> > If you need more data, just let me know.
>> >
>> > Best Regards.
>> >
>> > Hugo Meyer
>> >
>> >
>> >
>> >
>> > 2011/1/26 Joshua Hursey <jjhursey_at_[hidden]>
>> > I found a few more bugs after testing the C/R functionality this
>> morning. I just committed some more C/R fixes in r24306 (things are now
>> working correctly on my test cluster).
>> > https://svn.open-mpi.org/trac/ompi/changeset/24306
>> >
>> > One thing I just noticed in your original email was that you are
>> specifying the wrong parameter for migration (it is different than the
>> standard C/R functionality for backwards compatibility reasons). You need to
>> use the 'ft-enable-cr-recovery' AMCA parameter:
>> > mpirun -np 2 -am ft-enable-cr-recovery ./whoami 10 10
>> >
>> > If you still get the segmentation fault after upgrading to the current
>> trunk, can you send me a backtrace from the core file? That will help me
>> narrow down on the problem.
>> >
>> > Thanks,
>> > Josh
>> >
>> >
>> > On Jan 26, 2011, at 8:40 AM, Hugo Meyer wrote:
>> >
>> > > Josh.
>> > >
>> > > The ompi-checkpoint with his restart now are working great, but the
>> same error persist with ompi-migrate. I've also tried using "-r", but i get
>> the same error.
>> > >
>> > > Best regards.
>> > >
>> > > Hugo Meyer
>> > >
>> > > 2011/1/26 Hugo Meyer <meyer.hugo_at_[hidden]>
>> > > Thanks Josh.
>> > >
>> > > I've already check te prelink and is set to "no".
>> > >
>> > > I'm going to try with the trunk head, and then i'll let you know how
>> it goes.
>> > >
>> > > Best regards.
>> > >
>> > > Hugo Meyer
>> > >
>> > > 2011/1/25 Joshua Hursey <jjhursey_at_[hidden]>
>> > >
>> > > Can you try with the current trunk head (r24296)?
>> > > I just committed a fix for the C/R functionality in which restarts
>> were getting stuck. This will likely affect the migration functionality, but
>> I have not had an opportunity to test just yet.
>> > >
>> > > Another thing to check is that prelink is turned off on all of your
>> machines.
>> > > https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink
>> > >
>> > > Let me know if the problem persists, and I'll dig into a bit more.
>> > >
>> > > Thanks,
>> > > Josh
>> > >
>> > > On Jan 24, 2011, at 11:37 AM, Hugo Meyer wrote:
>> > >
>> > > > Hello @ll
>> > > >
>> > > > I've got a problem when i try to use the ompi-migrate command.
>> > > >
>> > > > What i'm doing is execute for example the next application in one
>> node of a cluster (both process wil run on the same node):
>> > > >
>> > > > mpirun -np 2 -am ft-enable-cr ./whoami 10 10
>> > > >
>> > > > Then in the same node i try to migrate the processes to another
>> node:
>> > > >
>> > > > ompi-migrate -x node9 -t node3 14914
>> > > >
>> > > > And then i get this message:
>> > > >
>> > > > [clus9:15620] *** Process received signal ***
>> > > > [clus9:15620] Signal: Segmentation fault (11)
>> > > > [clus9:15620] Signal code: Address not mapped (1)
>> > > > [clus9:15620] Failing at address: (nil)
>> > > > [clus9:15620] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b8d40]
>> > > > [clus9:15620] *** End of error message ***
>> > > > Segmentation fault
>> > > >
>> > > > I assume that maybe there is something wrong with the thread level,
>> but i have configured the open-mpi like this:
>> > > >
>> > > > ../configure --prefix=/home/hmeyer/desarrollo/ompi-code/binarios/
>> --enable-debug --enable-debug-symbols --enable-trace --with-ft=cr
>> --disable-ipv6 --enable-opal-multi-threads --enable-ft-thread
>> --without-hwloc --disable-vt --with-blcr=/soft/blcr-0.8.2/
>> --with-blcr-libdir=/soft/blcr-0.8.2/lib/
>> > > >
>> > > > The checkpoint and restart works fine, but when i restore an
>> application that has more than one process, this one is restored and
>> executed until the last line before MPI_FINALIZE(), but the processes never
>> finalize, i assume that they never call the MPI_FINALIZE(), but with one
>> process ompi-checkpoint and ompi-restart work great.
>> > > >
>> > > > Best regards.
>> > > >
>> > > > Hugo Meyer
>> > > > _______________________________________________
>> > > > devel mailing list
>> > > > devel_at_[hidden]
>> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > >
>> > > ------------------------------------
>> > > Joshua Hursey
>> > > Postdoctoral Research Associate
>> > > Oak Ridge National Laboratory
>> > > http://users.nccs.gov/~jjhursey
>> > >
>> > >
>> > > _______________________________________________
>> > > devel mailing list
>> > > devel_at_[hidden]
>> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > >
>> > >
>> > > _______________________________________________
>> > > devel mailing list
>> > > devel_at_[hidden]
>> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >
>> > ------------------------------------
>> > Joshua Hursey
>> > Postdoctoral Research Associate
>> > Oak Ridge National Laboratory
>> > http://users.nccs.gov/~jjhursey
>> >
>> >
>> > _______________________________________________
>> > devel mailing list
>> > devel_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >
>> > _______________________________________________
>> > devel mailing list
>> > devel_at_[hidden]
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> ------------------------------------
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>