MPIEXEC and Intel MPI library 1.0.1

Anton Starikov A.Starikov at utwente.nl
Mon May 2 17:33:37 EDT 2005


It seems that I was a bit too optimistic.
Problem not only in HOSTNAME variable.
It seems that Intel changed also something else in init protocol.
So, currently it works with Intel MPI with "shm", "rdma", "rdssm" 
drivers. But doesn't work with "sock".
I'm trying to debug problem.

Anton.


Thomas Zeiser wrote:
> Hello Anton,
> 
> On Mon, May 02, 2005 at 06:06:52PM +0200, Anton Starikov wrote:
> 
>>Can you describe in detais:
>>1) Which fabric do you use?
> 
> 
> the default TCP socket device on IA32 or EM64T nodes.
> 
> 
>>2) You wrote that version is 1.0.1, does it  mean actaully version 
>>1.0.035? So far it is the last version available form Intel. I've tested 
>>my patch with this version.
> 
> 
> the Package ID is l_mpi_pu_1.0.035; the releasenotes call it v1.0.1
> (it's available on premier.intel since April 15)
> 
> probably we are talking about the same version
> 
> 
>>Finalize hang itself can be related to general problem with MPICH2, it's 
>>described on the mpiexec site. Try tp add "-kill" into mpiexec options.
> 
> 
> the -kill option does not help in my case (except that I only have
> to kill one process manually to terminate the program); this
> probaby means NO tasks tries to exit!?
> 
> It also might be related to the strange hostname output I see when
> starting the program with mpiexec. (When I use the Intel-MPI
> mpdboot/mpiexec/mpdallexit sequence, the output is correct and the
> program terminates normally.)
> 
> 
>>Anton Starikov.
> 
> 
> Regards,
> 
> thomas
> 
> 
> 
>>Thomas Zeiser wrote:
>>
>>>Dear All!
>>>
>>>I just tested the patch to get Intel MPI running.  
>>>
>>>With version 1.0 of Intel MPI everything is fine. 
>>>
>>>However, when I try the recent upate (Intel MPI 1.0.1) I get very
>>>strange results:
>>>
>>>- I start the MPI program with
>>> mpiexec -comm pmi [-verbose] ./test-f-g77-intelmpi101
>>>
>>>- the processes are correctly started on all nodes (twice on
>>> snode164 and snode164; veryfied with "ps")
>>>
>>> mpiexec: resolve_exe: using absolute exe "./test-f-g77-intelmpi101".
>>> mpiexec: accept_pmi_conn: got request: cmd=initack pmiid=0.
>>> mpiexec: accept_pmi_conn: rank 0 checks in.
>>> mpiexec: accept_pmi_conn: got request: cmd=init pmi_version=1.1.
>>> mpiexec: accept_pmi_conn: got request: cmd=initack pmiid=1.
>>> mpiexec: accept_pmi_conn: rank 1 checks in.
>>> mpiexec: accept_pmi_conn: got request: cmd=init pmi_version=1.1.
>>> mpiexec: accept_pmi_conn: got request: cmd=initack pmiid=2.
>>> mpiexec: accept_pmi_conn: rank 2 checks in.
>>> mpiexec: accept_pmi_conn: got request: cmd=init pmi_version=1.1.
>>> mpiexec: accept_pmi_conn: got request: cmd=initack pmiid=3.
>>> mpiexec: accept_pmi_conn: rank 3 checks in.
>>> mpiexec: accept_pmi_conn: got request: cmd=init pmi_version=1.1.
>>> mpiexec: All 4 tasks started.
>>>
>>>- in the test program all MPI processes get their hostname using
>>> MPI_GET_PROCESSOR_NAME and send it withMPI_SEND to the master.
>>> The master receives the messages with MPI_RECV and outputs them
>>> (it's the simple test.f* program form the Intel MPI test
>>> directory). The outut is the following:
>>>
>>> Hello world: rank  0 of  4 running on snode164
>>> Hello world: rank  1 of  4 running on snode164
>>> Hello world: rank  2 of  4 running on snode164
>>> Hello world: rank  3 of  4 running on snode164
>>>
>>> All processes seem to be running on the same node!
>>>
>>>- now MPI_FINALIZE comes in the program. However, the processes
>>> hang. When I now kill step by step all processes I get
>>>
>>> accept_pmi_conn: waiting for info
>>> accept_pmi_conn: waiting for info
>>> accept_pmi_conn: waiting for info
>>> accept_pmi_conn: waiting for info
>>> wait_one_task_start: evt = 2, task 0 host snode164
>>> wait_one_task_start: evt = 3, task 1 host snode164
>>> wait_one_task_start: evt = 4, task 2 host snode163
>>> wait_one_task_start: evt = 5, task 3 host snode163
>>> wait_tasks: waiting for snode164 snode164 snode163 snode163
>>> wait_tasks: waiting for snode164 snode163 snode163
>>> wait_tasks: waiting for snode164 snode163
>>> wait_tasks: waiting for snode163
>>>
>>> Killed
>>> ABORT - process 3: failure: Other MPI error
>>> mpiexec: wait_tasks: numspawned = 4, got evt 6 for tid 2 host snode164 
>>> status 1.
>>> mpiexec: wait_tasks: numspawned = 3, got evt 9 for tid 5 host snode163 
>>> status 13.
>>> Killed
>>> mpiexec: wait_tasks: numspawned = 2, got evt 7 for tid 3 host snode164 
>>> status 1.
>>> Killed
>>> mpiexec: wait_tasks: numspawned = 1, got evt 8 for tid 4 host snode163 
>>> status 1.
>>> mpiexec: Warning: tasks 0-2 exited with status 1.
>>> mpiexec: Warning: task 3 exited with status 13.
>>>
>>>
>>>
>>>
>>>Any ideas (exept using totalview to get a better insight)?
>>>
>>>
>>>Kind regards,
>>>
>>>Thomas Zeiser
> 
> _______________________________________________
> mpiexec mailing list
> mpiexec at osc.edu
> http://email.osc.edu/mailman/listinfo/mpiexec
> 
> 



More information about the mpiexec mailing list