Wierd GM/MPIexec error

Christopher D. Maestas cdmaest at sandia.gov
Tue Aug 26 19:46:26 EDT 2003


Hello,

I've figured out what's happening within the mpirun and mpiexec calls against 
an mpich compiled for gm2 usage.  Here's what mpirun.ch_gm.pl will show for 
the data it gets on those ports (I made the printing prettier :)
===============================================================================
[examples]$ mpirun.ch_gm -v -np 1 cpi
Program binary is: /usr/local/scratch/mpich-gm2-intel/examples/cpi
Machines file is /usr/local/scratch/mpich-gm2-intel/share/machines.ch_gm.LINUX
Shared memory for intra-nodes coms is enabled.
GM receive mode used: polling.
1 processes will be spawned:
        Process 0 (/usr/local/scratch/mpich-gm2-intel/examples/cpi ) on dell530
Open a socket on dell530...
Got a first socket opened on port 8000.
Shared memory file: /tmp/gmpi_shmem-8266279:[0-9]*.tmp

ssh dell530 cd /usr/local/scratch/mpich-gm2-intel/examples ; env  GMPI_MASTER=dell530 GMPI_PORT=8000 GMPI_SHMEM=1 LD_LIBRARY_PATH=/usr/local/scratch/mpich-gm2-intel/lib:/opt/gm/lib:/usr/java/jdk1.3.1/lib:/usr/local/totalview/lib:/usr/local/pbs/lib:/usr/local/mpich/lib GMPI_MAGIC=8266279 GMPI_ID=0 GMPI_NP=1 GMPI_BOARD=-1 GMPI_SLAVE=134.253.175.61 /usr/local/scratch/mpich-gm2-intel/examples/cpi
MPI Id 0 is using GM port 2, board 0, GM_id 3716117495.
------------------------------
MAGIC           8266279
MPI ID          0
PORT            2 (GM port)
BOARD           0
NODE            3716117495 (GM_id)
NUMANODE        0
PID             1572
REM PORT        8001
------------------------------
Received data from all 1 MPI processes.
Sending mapping to MPI Id 0.
Data sent to all processes.
Process 0 on dell530.sandia.gov
pi is approximately 3.1416009869231254, Error is 0.0000083333333323
wall clock time = 0.000144
All remote MPI processes have exited.
Reap remote processes:
        ssh dell530 -n kill -9 1572 2>/dev/null
===============================================================================

As you can see the GM_id is some whacked out number.  This is because of the gm 
global to local schema Myricom now has in mpich-gm when compiled against gm2.
The "What a mess comment".

When I run with a modified mpiexec to print out the information like above we get:
===============================================================================
[examples]$ mpiexec -v -np 1 cpi
resolve_exe: prefixing dot to executable: "./cpi"
node  0: name = dell530, mpname = dell530, cpu = 1
wait_one_task_start: evt = 2, task 0 host dell530
All 1 task started.
read_gm_startup_ports: waiting for info
MAGIC           4
MPI ID          0
PORT            2
BOARD           0
NODE            2147483647
NUMANODE        0
PID             1592
REM PORT        8000
read_gm_startup_ports: mpich gm version 12510
read_gm_startup_ports: id 0 port 2 board 0 gm_node_id 2147483647
  numanode 0 pid  1592 remote_port  8000
[0] Error: Unable to translate GM global node id (2147483647)to local node id for the MPI id 0 !
mpiexec: Warning: accept_abort_conn: MPI_Abort from IP 134.253.175.61, killing all.
wait_tasks: got evt 0, did not match any
wait_tasks: numspawned = 1, got evt 4 for tid 2 host dell530 status 0
===============================================================================

So we are somehow getting the wrong GM global node id here ... I haven't figured out why.
In gm 1.X the gm_ids where the same all around.  For example on all nodes you had:

gmID MAC Address                                 gmName Route
---- ----------------- -------------------------------- ---------------------
   1 00:60:dd:7f:1a:0f                            node1 80 (this node)
   2 00:60:dd:7f:41:6c                            node2 bc b9 83 8e 85

This was the same on the gm_board_info mapping displayed on node2 as well (1 and 2).
node2 was gmID 2 as well.

However in GM2 you have something like:
gmID MAC Address                                 gmName Route
---- ----------------- -------------------------------- ---------------------
   1 00:60:dd:7f:74:04                            node1 (this node) (mapper)
   2 00:60:dd:7f:73:f7                            node2

and on node2 you have:
gmID MAC Address                                 gmName Route
---- ----------------- -------------------------------- ---------------------
   1 00:60:dd:7f:73:f7                            node2 (this node)
   2 00:60:dd:7f:74:04                            node1 (mapper)

Here node2 is gmID 1 and node1 is gmID 2!

I'll keep pounding my head, but if you have any further insight I'd appreciate it. :-)

On Thu, 14 Aug 2003, Pete Wyckoff wrote:

> cdmaest at sandia.gov said on Wed, 13 Aug 2003 21:36 -0600:
> > On Wed, 13 Aug 2003, Pete Wyckoff wrote:
> > > cdmaest at sandia.gov said on Wed, 13 Aug 2003 14:12 -0600:
> > > > [0] Error: Unable to translate GM global node id (2147483647)to local node id for the MPI id
> > > 
> > > It looks like GM is reporting that the node id is -1.  It gives that to
> > > mpiexec, which dutifully returns it back to the process, which then
> > > complains that the number is not found.  All else looks okay.
> > > 
> > > If you do:
> > > 	gm_board_info | grep "This is"
> > > do you see something other than 2147483647 for the node id?
> > > Can you get the code to run using mpirun?
> [..]
> > I guess I forgot that this is against gm 2.0.5 as well. :-)
> 
> Now he tells us. :)  The mpich code I looked at yesterday to try to
> figure out from whence that error message had come includes an #ifdef
> section for GM2 that I ignored.  (mpich-1.2.5..10/mpid/ch_gm/gmpi_priv.c
> line 392 where the comment reads "What a mess".)
> 
> I'm not to interested in trying to run gm 2.0.5 here just yet.  If you
> manage to figure out what's wrong, we can certainly fix mpiexec to work
> around whatever changed in going to gm 2.0.5.
> 
> 		-- Pete
> 

-- Chris







More information about the mpiexec mailing list