NETWORK MANAGEMENT TOOLS

Roger Moore Vice President I.P. Sharp Associates Limited

Toronto, Ontario

APL is often used from terminals. This raises a requirement for a method of connecting terminals to an APL system. Various methods for connecting terminals to an APL system exist. When the distances from terminal to APL system become large, some scheme for sharing of communication links becomes economically important. The requirement of shared communication links reduces the number of network technology choices. The I.P. Sharp network also has the constraint that one terminal must be able to access several APL systems. The terminals used on the I.P. Sharp network are asynchronous terminals with erratic bandwidth requirements. This combination of requirements is normally met either by packet-switching or sophisticated statistical multiplexing systems. Both technologies can meet the requirements of shared communication links and multiple hosts. Immunity from communication link errors and system overload is standard in both systems. The I.P. Sharp network uses packet-switching. Most public networks such as Datapac, Telenet, Datex-P, etc., also use packet-switching.

A packet-switched system has several types of components. Internal communication within the network is via packets. Packets are transmitted between network nodes. Communication links are the media used to transmit packets between adjacent nodes. A packet can be transmitted between two non-adjacent nodes by packet forwarding. A packet is forwarded over several communication links from the originating node to the destination node. Most traffic within the network is output from a T-task to a terminal. Two different kinds of computers are used as network nodes. The majority of the nodes are Alpha computers from Computer Automation Inc. These serve as originating nodes and have terminals connected to them. The IBM 3705 is used for connection to an APL system. Destination nodes are normally 3705s.

The network comprises 143 links and 131 nodes. With this scale of operation, a support system is required. Support is provided by 18 people in four cities and various programs and data bases. Except for the programs resident in the network nodes, all of the software to support the network is written in APL. All of the data bases are stored as APL files. Use of APL for network support has been fairly successful and has contributed to the steady growth of the network.

The data bases which describe the network and its operation can be divided into two categories. The "offline" data bases are maintained from terminals. The terminals are usually connected to the network and there may be provision for simultaneous update

222

from multiple terminals. The important aspect of these "offline" data bases is that they have no special connection to the network. The "online" data bases are fed by the network. Events within the network result in changes to these "online" data bases.

Program preparation and loading

The oldest network support programs deal with program preparation for network nodes. Independent systems exist for the two types of computer used in the network. This allows program preparation to be performed from any APL terminal. The Alpha software is written in a conventional line-oriented assembly language. The 3705 software is written in a medium-level language which is processed by a simple compiler rather than an assembler. Both programming systems include schemes for managing source and object programs. The final 3705 object program is eventually transferred to an MVS load library. A simple MVS program moves the object program across the channel interface to the 3705.

In the very early days of the network, several clumsy methods were used for loading programs into the Alphas. Reading a hundred feet of paper tape at 10 cps was the worst. Some of the others involving floppy disc or changing a printed circuit card were not much better. All of these methods depended on a rather cumbersome system for sending a program across a single communication link. The need for a network-oriented system such that an Alpha in Stockholm could be easily reloaded was painfully obvious. After some discussion, a method for moving the object program image through the network while sustaining normal terminal traffic, except to the node being reloaded, was specified.

Some method for moving the object program from an APL file to the network was required. An N-task with a special connection to the network is used to link the APL files involved in loading to the network. The down line load task exists today as )PORT DLL in approximately its original specification. It is responsible for converting the object program into a sequence of load packets which are sent to the node adjacent to the node being reloaded and thence to the node being reloaded. It also reformats the returned packets into a core dump. The core dump is written to file for possible analysis. Some other duties include deciding which node has requested reloading and object program customization. The exact version of the loader which is receiving the load packets is sometimes of interest. If the APL task decides that the wrong loader is being used, it loads the preferred loader and forces use of the new loader.

Down line load has been a fairly satisfactory APL system. The interface with the network is via the 370 byte multiplexor channel and a four thousand byte buffer in main store. One block of statements fills the buffer with load packets and a channel program. The channel program sends the load packets to the network and fills the buffer with dump packets. This process takes around ten seconds (exact time depends upon network delays). A few more statements extract the dump packets from the buffer and reformat the information as 16 bit words. The ability to process many object program words with a few APL statements results in a reasonable execution cost. The task is usually waiting for input/output completion. The most common state is to wait for a load request to arrive from the network. In this "wait for work" state, the loader uses about the same resources as a T-task connected to an abandoned terminal.

A minor drawback to the load scheme is that it produces a core dump of every Alpha which it reloads. These are occasionally useful for analysis of hardware or software problems. In practice most of the three megabytes per week of dumps are useless and have to be discarded to conserve file storage.

223

Network parameters

One problem associated with loading Alphas is that the nodes are not strictly identical. Every node has a unique number assigned to it. Two different types of communication hardware might be installed in the same Alpha. Some customization of software is required for the different types of hardware. Sundry other parameters control logging, low speed line configuration and some special features which are not present in all nodes. The original solution to customization was to link a slightly different object program for every node. APL software to describe phase customization was introduced in 1976. The following quotation from the user documentation explains the need:

The growth of the IPSA/ITS concentrator network from two nodes to more than twenty has been possible only by centralized configuration control. The satisfactory operation of the network requires that all nodes be loaded with globally consistent route tables. Convenient maintenance of the software requires that the number of custom modules and phases be kept to a minimum. Local requirements sometimes dictate special features (the American TTY problem is a good example).

To meet the twin goals of minimizing the number of phases in the system and allowing local requirements (especially route tables) to be satisfied, the solution of patching a phase during loading has been adopted for the Alpha nodes. This solution has the advantages of late binding and separation of most site dependent material from site independent material (such as the executable code). It has the drawback of being in a different format than the executable code and thus requiring specialized display and update functions. This document attempts to describe the functions which have been provided.

Some of the original network control parameters have vanished. The Teletype problem was circumvented by software modifications which have made the concentrator immune to failures in the Teletype interface. Route table calculation was a very important part of configuration control until 1981. The original routing algorithm had a strong dependence upon globally consistent route tables. As the network topology became more complex, the APL functions to compute consistent route tables became more complex. In 1981, the routing algorithm was drastically changed and the need for route tables evaporated.

The central network control file remains as a convenient repository of network parameters. About twenty people are allowed to alter it; all users have read access. For a particular node, the following items are stored:

1) Node name (usually geographical location)

2) Name of the APL file which contains the program to be loaded into the node

3) Destination for logging messages

4) Hardware used on every network communications link of this node (a binary-valued parameter)

5) Baud rate and hardware type for every asynchronous communication line

6) Destination node for optional Tally printer

7) Public network type for X.25 interface nodes

The above lists all the node parameters which are in use at the present time (summer 1982). These node parameters and the applicable link parameters are used by the down

224

line load task to customize the object program when it is loaded into an Alpha. One component in the file is an integer matrix with one row for every communication link in the network. The parameters which describe a single link are:

1) The node numbers for the two endpoints of the link

2) The line numbers within the endnodes of the link

3) The approximate delay time imposed by the link (normal, submarine, or satellite)

4) The link speed in bits per second

5) Theoretical worst case acknowledgement delay in milliseconds (computed from previous two parameters)

6) Class of service (used in alarming system but not the online network)

Adding a new node

Addition of a new node requires that the network control parameters be specified in the network control file. The parameters required by the 1 TS workspace are also entered at this time. A new node usually implies a new communication link. Some confirmation that the communication link is usable is desirable before attempting to proceed with the installation. The usual testing method is to connect one end of the new link to the network in its permanent location. The link termination for the new node is then placed in state called "loopback". When the link is looped back upon itself, the existing network node should receive its own transmissions. If the node detects receipt of its own transmissions an event message is sent to the logging system indicating that a particular link is in loopback. With this assurance that the link is operational, the link can be connected to the new node. It is possible to ship a node with the proper object program loaded into core storage. In this case the node will be in communication with the network shortly after it is attached to the communication link and switched on. If the machine was not shipped with the proper program, a simple console procedure can be used to initiate a reload from the APL down line load task. (If the console is defective or absent, the load can be started with a judiciously applied paper clip). The progress of the load can be monitored from the console lights. Program loading normally requires from two to five minutes. When loading is complete, the connection to the network is automatically initialized and usable for data transmission.

The node installer will usually attach a terminal to the node at this time to confirm that it does indeed support normal traffic. A node may have between four and twenty-eight terminals connected to it either directly or via dial-up modems. Each of these requires a cable from the Alpha to the terminal or modem. All of these connections have to be tested by attempting to sign-on to APL. Testing of the terminal connections may reveal some boards in the node to be faulty and replacement might be required. After all of the terminal connections have been tested, sundry "paperwork" remains. This takes the form of signing on and updating several data bases which further describe the node. None of these are used in the online network but they are rather useful in the day to day administration of the network. These administrative data bases are fairly simple and specialized. They include the following kinds of information:

1) Communication link repair: Most of the links in the network use circuits leased from a telephone company or PTT. The provider of the circuit has a serial number for the link which must be used when reporting a fault on the link. One data base provides a circuit number and trouble reporting phone number for every network link which terminates in a particular city.

225

Trouble reporting numbers are also provided for the dialup circuits which connect terminals to the node.

2) Replaceable parts: A typical node has about twelve field replaceable parts. The serial numbers and exact modification level of these capital goods must be recorded in a data base. Defective parts detected during installation must also be recorded in the data base. (Some of this work is often done before the node is shipped.)

3) Low speed documentation: The connections of terminals to the node must be documented. Every network port has a unique number which is visible to the APL user as (2 quadWS 3)[ quadIO+9]. There is a data base which relates that port number to a specific telephone circuit or hardwired terminal identification. Updating this data base is part of the installation job. This data base is used for two purposes. When a fault is reported in a specific terminal or telephone line, knowledge of the associated port number is useful in problem diagnosis and repair. Statistical information about port usage is maintained and analyzed. The primary purpose is to monitor usage of dial-in facilities. If all dial-in ports in a particular city are often in use/extra ports should be ordered and installed. Similarly an unused dial-in port may indicate excess capacity (or a defective port). Both overuse and underuse are conditions which should be monitored for efficient management of the network. This requires accurate documentation of the cabling so that hardwired terminals in an I.P. Sharp branch office are not confused with the dial-in ports.

4) Pending installs: Installation of a new node usually implies installation of new telephone lines. A small data base lists pending installations and removals of telephone lines. The new lines arc marked installed for control of telephone company invoices.

Network logging

One major problem in 1976 was ascertaining whether a particular node was operational. The desperate solution of sampling )PORTS was used for several months. The original concentrator had some provisions for generating event messages and logging them on a Teletype connected to some node. This scheme was slightly modified by replacing the Teletype with an APL T-task. Logging messages originating in various network nodes are forwarded to the logging task. The network logging task analyzes and stores these messages. Storage is in APL files which can be read by any user. The logging messages fall into three categories:

1) Event messages are emitted when a node detects an event worth logging.

2) Statistical messages are generated at regular intervals by all nodes.

3) Some messages are replies to query messages emitted by the logging task.

An event message often records the failure or restoration of a network link. Event messages are normally written to file within ten seconds of the event. This is almost as fast as Teletype logging. It has the additional advantage of not being tied to a specific workstation. Any terminal can examine event messages which have been recorded in the file. Distributed access to the central event data base is quite useful. A substantial amount of fault analysis is possible simply by examining stored event messages. If all of the communication links connecting a particular node to the network are out of

226

service, a reasonable inference is that the node itself has failed. The ability to obtain this information from any terminal with a connection to the APL system greatly assists in repair of faults.

Statistical messages record link and network behaviour. Link measurements are made by incrementing counters. The counters are periodically sampled and zeroed. Received and transmitted packets are counted. Packet retransmissions and line errors are also counted. All of the statistics are formatted into numeric matrices and appended to a file. To avoid particularly small components, data is buffered in the workspace until about ten thousand bytes of data have been accumulated. This buffering may delay logging up to eight minutes. The logging task examines the statistical data to detect links with particularly high error rates. These links and the corresponding error rates are flagged by a special status variable.

Statistical data accumulates at over half a megabyte per day. Some of the detailed statistics are useful for investigating specific problems. Much of the data is quite boring and worthless. A daily B-task attempts to preserve the interesting data and drop the detail. Detailed data is retained for two days. Two different methods of identifying "interesting" data are used. Medium and high error rate data is collected for the entire year. Peak traffic information is also preserved. Little software for subsequent processing of the statistical data has been provided. Functions for extracting subsets from the file and displaying the matrices constitute the bulk of the support. Simple APL manipulations of the data allow the user to arrange data according to his current needs. APL expressions to find measurements or combinations of measurements which exceed a threshold are easily constructed. Selected data can be sorted or otherwise massaged with trivial APL statements.

The logging task attempts to analyze certain event messages to determine current network topology. Logging within the network uses a pyramid of nodes to reduce the number of logging calls terminated in individual nodes. The logging task is at the apex of the pyramid. The logging calls are established from bottom to top. When a call is established, status reports from all nodes in the sub-pyramid are forwarded upward. Sign-on of the logging task allows calls to be established to the highest level of the pyramid. This results in status reports from all nodes in the network. A status report lists the network links terminating in the node including the name of the adjacent node and the link status. The actual network topology can be derived from these status reports. Comparison of the observed topology with the theoretical topology from the network control file is often useful. Links which are missing in the actual topology can be assumed to be out of service. A cabling error at a particular node may have permuted the line number to adjacent node correspondence for the node. This miscabling is not particularly serious and does not interfere with normal network operation. It does interfere rather seriously with down line load as the APL program searches the link parameter matrix to determine which node is being reloaded. The search uses the line number within the node adjacent to the node being reloaded as the search argument. The error can be corrected by simply changing the network control file to reflect the actual topology.

Some network control capabilities are embedded within the logging task. These operate upon command from some task within the APL system. Examples include alteration of minor node parameters and inspection of tables within a node. All control tasks read a request from file or shared variable and report a result back to the request source. To service a request, the logging task establishes a call to the node and examines various storage locations in the node. For some types of requests, the exact locations examined in the later stages of request service may be determined by the results of

227

prior examinations. Requests which alter node parameters sometimes require examination of the current state of the node to refine the subsequent commands. There was also a requirement to process several requests simultaneously.

A crude multi-programming system is used to service these requests. The variables associated with a specific request are tucked away in a package when a request is awaiting input. When input arrives, a function which depends on request type is called with the current input packet for the request as an argument. The function analyzes the input and alters the variables associated with the request. The function may emit certain types of packets to the node being examined. The function may also indicate successful completion of the request to the supervisory system. At request completion the package containing the variables is returned to the request source.

Other support software

The logging task is supported by several other workspaces. There are two different workspaces for presenting link status information. The MONITOR workspace attempts to display current and recent status. The emphasis is on displaying conditions which might require manual action. Examples would be links or nodes which are not currently operational. A CRT is normally used for the monitor display. With a finite number of lines on the screen, conservation is desirable. An early step was to introduce a "service status" for every link in the network. Service status roughly corresponds to geographical area. The principal divisions are Europe and North America. A special status of "test" is used for certain links. Links with service status of test are links whose condition is of relatively little interest. Some of these links connect hardware or software test nodes to the network; others represent planned links which have not yet been installed. The current network contains fifteen links with test status. The network monitor never displays the status of the test links. There are provisions for further selection by service status so that display of European link status can be suppressed on North American screens. One universal need in an alarming system is some method of acknowledging alarms. Any authorized user of the monitor system can enter a line of text to provide extra information about some event on the screen. Examples include estimated time for repair, phone company reference number for the trouble, scheduled outages and various other things. The brief notes which are sometimes amplified by mailbox messages provide an adequate alarm acknowledgement system.

The other scheme for displaying link status is oriented towards hard copy reports. When a link fails the nodes at both ends of the link generate link failure messages. When the link becomes operational again, both links signal the improved status. Thus one link outage can generate four different event messages. The reporting workspace attempts to gather all messages referring to a single link in a single day and build a link by link report of incidents. The report includes failure codes and outage duration. The reason for failure may be different at the two ends of the link. Both codes are useful in problem diagnosis. The three most common codes represent: reset request from other node, timeout with loss of carrier, timeout with good carrier. A code pair such as: reset request/timeout with good carrier suggests one-way transmission difficulties. The node which received the reset request could receive properly but its transmissions were not received by the other node. A pseudo-code indicates link reset due to a power failure in a node. Examination of this report provides a daily summary of network faults. Selective reporting to examine the behaviour of a particular node in a specific time period is also possible.

228

Sundry other support tools exist. There are workspaces for analysis of coredumps from network nodes. An attempt is made to match the observed contents of storage with the expected contents of storage. This is often useful in identifying failures in the core storage system. Certain other hardware errors with known "dump signatures" are also flagged by this workspace.

A somewhat fragile workspace attempts to draw a complete diagram of network topology. The preferred display device is an APL terminal or printer. This tends to restrict the number of angles at which links can be drawn to eight. The present result is a rather precarious tangle which bears no relation to geography. It does manage to present the detailed topology of the network in a form which some people consider usable.

Private workspaces for various special reports also exist. Many of these look at various network parameters and logs from the previous 24 hours and select features of interest to the workspace author. Examples include reloads, topology changes, high retransmission rates and various other things.

Another source of network statistics is from the APL systems rather than from network nodes and terminals operated by communication department staff. At every sign-off a record of the APL session is written to a file called the "sign-off history file". The record includes the network port from which the session originated, sign-off time and session duration. Other information such as characters transmitted and received and billing information is also included. By merging the sign-off history records from all inhouse systems, the occupancy of network ports over a time interval can be obtained. This dynamic data when combined with the static node cabling data described above allows usage of a specific group of ports to be monitored. Another use of the merged sign-off history file is to compute traffic between an originating node and a specific APL system. This information is used for network balancing purposes.

229