NETWORK MANAGEMENT TOOLS
Roger Moore Vice President I.P. Sharp
Associates Limited
Toronto, Ontario
APL is often used from terminals. This raises a requirement for a method
of connecting terminals to an APL system. Various methods for connecting
terminals to an APL system exist. When the distances from terminal to APL
system become large, some scheme for sharing of communication links becomes
economically important. The requirement of shared communication links
reduces the number of network technology choices. The I.P. Sharp network also
has the constraint that one terminal must be able to access several APL
systems. The terminals used on the I.P. Sharp network are asynchronous
terminals with erratic bandwidth requirements. This combination of requirements
is normally met either by packet-switching or sophisticated statistical
multiplexing systems. Both technologies can meet the requirements of shared
communication links and multiple hosts. Immunity from communication link
errors and system overload is standard in both systems. The I.P. Sharp network
uses packet-switching. Most public networks such as Datapac, Telenet, Datex-P,
etc., also use packet-switching.
A packet-switched
system has several types of components. Internal communication within the
network is via packets. Packets are transmitted between network nodes.
Communication links are the media used to transmit packets between adjacent
nodes. A packet can be transmitted between two non-adjacent nodes by packet
forwarding. A packet is forwarded over several communication links from the
originating node to the destination node. Most traffic within the network is
output from a T-task to a terminal. Two different kinds of computers are used
as network nodes. The majority of the nodes are Alpha computers from Computer
Automation Inc. These serve as originating nodes and have terminals connected
to them. The IBM 3705 is used for connection to an APL system. Destination
nodes are normally 3705s.
The network comprises
143 links and 131 nodes. With this scale of operation, a support system is
required. Support is provided by 18 people in four cities and various programs
and data bases. Except for the programs resident in the network nodes, all of
the software to support the network is written in APL. All of the data bases
are stored as APL files. Use of APL for network support has been fairly
successful and has contributed to the steady growth of the network.
The data bases which
describe the network and its operation can be divided into two categories. The
"offline" data bases are maintained from terminals. The terminals are
usually connected to the network and there may be provision for simultaneous
update
222
from multiple
terminals. The important aspect of these "offline" data bases is that
they have no special connection to the network. The "online" data
bases are fed by the network. Events within the network result in changes to
these "online" data bases.
Program
preparation and loading
The oldest network
support programs deal with program preparation for network nodes. Independent
systems exist for the two types of computer used in the network. This allows
program preparation to be performed from any APL terminal. The Alpha software
is written in a conventional line-oriented assembly language. The 3705 software
is written in a medium-level language which is processed by a simple compiler
rather than an assembler. Both programming systems include schemes for managing
source and object programs. The final 3705 object program is eventually
transferred to an MVS load library. A simple MVS program moves the object
program across the channel interface to the 3705.
In
the very early days of the network, several clumsy methods were used for
loading programs into the Alphas. Reading a hundred feet of paper tape at 10
cps was the worst. Some of the others involving floppy disc or changing a
printed circuit card were not much better. All of these methods depended on a
rather cumbersome system for sending a program across a single communication
link. The need for a network-oriented system such that an Alpha in Stockholm
could be easily reloaded was painfully obvious. After some discussion, a method
for moving the object program image through the network while sustaining normal
terminal traffic, except to the node being reloaded, was specified.
Some
method for moving the object program from an APL file to the network was
required. An N-task with a special connection to the network is used to link
the APL files involved in loading to the network. The down line load task exists
today as )PORT DLL in
approximately its original specification. It is responsible for converting the
object program into a sequence of load packets which are sent to the node
adjacent to the node being reloaded and thence to the node being reloaded. It
also reformats the returned packets into a core dump. The core dump is written
to file for possible analysis. Some other duties include deciding which node
has requested reloading and object program customization. The exact version of
the loader which is receiving the load packets is sometimes of interest. If the
APL task decides that the wrong loader is being used, it loads the preferred
loader and forces use of the new loader.
Down
line load has been a fairly satisfactory APL system. The interface with the
network is via the 370 byte multiplexor channel and a four thousand byte buffer
in main store. One block of statements fills the buffer with load packets and a
channel program. The channel program sends the load packets to the network and
fills the buffer with dump packets. This process takes around ten seconds
(exact time depends upon network delays). A few more statements extract the
dump packets from the buffer and reformat the information as 16 bit words. The
ability to process many object program words with a few APL statements results
in a reasonable execution cost. The task is usually waiting for input/output
completion. The most common state is to wait for a load request to arrive from
the network. In this "wait for work" state, the loader uses about the
same resources as a T-task connected to an abandoned terminal.
A
minor drawback to the load scheme is that it produces a core dump of every
Alpha which it reloads. These are occasionally useful for analysis of hardware
or software problems. In practice most of the three megabytes per week of dumps
are useless and have to be discarded to conserve file storage.
223
Network parameters
One problem associated
with loading Alphas is that the nodes are not strictly identical. Every node
has a unique number assigned to it. Two different types of communication
hardware might be installed in the same Alpha. Some customization of software
is required for the different types of hardware. Sundry other parameters
control logging, low speed line configuration and some special features which
are not present in all nodes. The original solution to customization was to
link a slightly different object program for every node. APL software to
describe phase customization was introduced in 1976. The following quotation from
the user documentation explains the need:
The
growth of the IPSA/ITS concentrator network from two nodes to more than twenty
has been possible only by centralized configuration control. The satisfactory
operation of the network requires that all nodes be loaded with globally
consistent route tables. Convenient maintenance of the software requires
that the number of custom modules and phases be kept to a minimum. Local
requirements sometimes dictate special features (the American TTY problem is a
good example).
To meet the twin goals of minimizing the number of phases
in the system and allowing local requirements (especially route tables) to be
satisfied, the solution of patching a phase during loading has been adopted for
the Alpha nodes. This solution has the advantages of late binding and
separation of most site dependent material from site independent material
(such as the executable code). It has the drawback of being in a different
format than the executable code and thus requiring specialized display and
update functions. This document attempts to describe the functions which
have been provided.
Some of the original network control parameters have vanished. The
Teletype problem was circumvented by software modifications which have made the
concentrator immune to failures in the Teletype interface. Route table
calculation was a very important part of configuration control until 1981. The
original routing algorithm had a strong dependence upon globally
consistent route tables. As the network topology became more complex, the APL
functions to compute consistent route tables became more complex. In 1981, the
routing algorithm was drastically changed and the need for route tables
evaporated.
The central network control file remains as a convenient repository of
network parameters. About twenty people are allowed to alter it; all users
have read access. For a particular node, the following items are stored:
1) Node name (usually geographical location)
2) Name of the APL file which contains the program to be
loaded into the node
3) Destination for logging messages
4) Hardware used on every network communications link of
this node (a binary-valued parameter)
5) Baud rate and hardware type for every
asynchronous communication line
6) Destination node for optional Tally printer
7) Public network type for X.25 interface nodes
The above lists all the node parameters which are in use at the present
time (summer 1982). These node parameters and the applicable link parameters
are used by the down
224
line load task to
customize the object program when it is loaded into an Alpha. One component in
the file is an integer matrix with one row for every communication link in the
network. The parameters which describe a single link are:
1) The node numbers for the two endpoints of the link
2) The line numbers within the endnodes of the
link
3) The approximate delay time imposed by the
link (normal, submarine, or satellite)
4) The link speed in bits per second
5) Theoretical worst case acknowledgement delay
in milliseconds (computed from previous two parameters)
6) Class of service (used in alarming system but
not the online network)
Adding
a new node
Addition of a new node
requires that the network control parameters be specified in the network
control file. The parameters required by the 1 TS workspace are also entered at this time. A new
node usually implies a new communication link. Some confirmation that the
communication link is usable is desirable before attempting to proceed with the
installation. The usual testing method is to connect one end of the new link to
the network in its permanent location. The link termination for the new node is
then placed in state called "loopback". When the link is looped back
upon itself, the existing network node should receive its own transmissions. If
the node detects receipt of its own transmissions an event message is sent to
the logging system indicating that a particular link is in loopback. With
this assurance that the link is operational, the link can be connected to
the new node. It is possible to ship a node with the proper object program
loaded into core storage. In this case the node will be in communication with
the network shortly after it is attached to the communication link and switched
on. If the machine was not shipped with the proper program, a simple console
procedure can be used to initiate a reload from the APL down line load task.
(If the console is defective or absent, the load can be started with a
judiciously applied paper clip). The progress of the load can be monitored from
the console lights. Program loading normally requires from two to five minutes.
When loading is complete, the connection to the network is automatically
initialized and usable for data transmission.
The
node installer will usually attach a terminal to the node at this time to
confirm that it does indeed support normal traffic. A node may have between
four and twenty-eight terminals connected to it either directly or via dial-up
modems. Each of these requires a cable from the Alpha to the terminal or modem.
All of these connections have to be tested by attempting to sign-on to APL.
Testing of the terminal connections may reveal some boards in the node to be
faulty and replacement might be required. After all of the terminal connections
have been tested, sundry "paperwork" remains. This takes the form of
signing on and updating several data bases which further describe the node.
None of these are used in the online network but they are rather useful in the
day to day administration of the network. These administrative data bases are
fairly simple and specialized. They include the following kinds of information:
1)
Communication link repair: Most of the links in the network use circuits leased
from a telephone company or PTT. The provider of the circuit has a serial
number for the link which must be used when reporting a fault on the link. One
data base provides a circuit number and trouble reporting phone number for
every network link which terminates in a particular city.
225
Trouble reporting numbers are also provided for the dialup
circuits which connect terminals to the node.
2) Replaceable parts: A typical node has about
twelve field replaceable parts. The serial numbers and exact modification level
of these capital goods must be recorded in a data base. Defective parts
detected during installation must also be recorded in the data base. (Some of
this work is often done before the node is shipped.)
3) Low speed documentation: The connections of
terminals to the node must be documented. Every network port has a unique
number which is visible to the APL user as (2 quadWS 3)[ quadIO+9]. There is a data base which relates that port number to
a specific telephone circuit or hardwired terminal identification. Updating
this data base is part of the installation job. This data base is used for two
purposes. When a fault is reported in a specific terminal or telephone line,
knowledge of the associated port number is useful in problem diagnosis and
repair. Statistical information about port usage is maintained and analyzed.
The primary purpose is to monitor usage of dial-in facilities. If all dial-in
ports in a particular city are often in use/extra ports should be ordered and
installed. Similarly an unused dial-in port may indicate excess capacity (or a
defective port). Both overuse and underuse are conditions which should be
monitored for efficient management of the network. This requires accurate
documentation of the cabling so that hardwired terminals in an I.P. Sharp
branch office are not confused with the dial-in ports.
4) Pending installs: Installation of a new node usually
implies installation of new telephone lines. A small data base lists pending
installations and removals of telephone lines. The new lines arc marked
installed for control of telephone company invoices.
Network
logging
One major problem in
1976 was ascertaining whether a particular node was operational. The
desperate solution of sampling )PORTS was used for several months. The original concentrator had
some provisions for generating event messages and logging them on a Teletype
connected to some node. This scheme was slightly modified by replacing the
Teletype with an APL T-task. Logging messages originating in various network
nodes are forwarded to the logging task. The network logging task analyzes and
stores these messages. Storage is in APL files which can be read by any user.
The logging messages fall into three categories:
1)
Event messages are emitted when a node detects an event worth logging.
2)
Statistical messages are generated at regular intervals by all nodes.
3)
Some messages are replies to query messages emitted by the logging task.
An
event message often records the failure or restoration of a network link. Event
messages are normally written to file within ten seconds of the event. This is
almost as fast as Teletype logging. It has the additional advantage of not
being tied to a specific workstation. Any terminal can examine event messages
which have been recorded in the file. Distributed access to the central event
data base is quite useful. A substantial amount of fault analysis is possible
simply by examining stored event messages. If all of the communication links
connecting a particular node to the network are out of
226
service, a reasonable
inference is that the node itself has failed. The ability to obtain this
information from any terminal with a connection to the APL system greatly
assists in repair of faults.
Statistical messages
record link and network behaviour. Link measurements are made by incrementing
counters. The counters are periodically sampled and zeroed. Received and
transmitted packets are counted. Packet retransmissions and line errors are
also counted. All of the statistics are formatted into numeric matrices and
appended to a file. To avoid particularly small components, data is buffered in
the workspace until about ten thousand bytes of data have been accumulated.
This buffering may delay logging up to eight minutes. The logging task examines
the statistical data to detect links with particularly high error rates. These
links and the corresponding error rates are flagged by a special status
variable.
Statistical
data accumulates at over half a megabyte per day. Some of the detailed
statistics are useful for investigating specific problems. Much of the data is
quite boring and worthless. A daily B-task attempts to preserve the interesting
data and drop the detail. Detailed data is retained for two days. Two different
methods of identifying "interesting" data are used. Medium and high
error rate data is collected for the entire year. Peak traffic information is
also preserved. Little software for subsequent processing of the
statistical data has been provided. Functions for extracting subsets from the
file and displaying the matrices constitute the bulk of the support. Simple APL
manipulations of the data allow the user to arrange data according to his
current needs. APL expressions to find measurements or combinations of
measurements which exceed a threshold are easily constructed. Selected data can
be sorted or otherwise massaged with trivial APL statements.
The
logging task attempts to analyze certain event messages to determine current
network topology. Logging within the network uses a pyramid of nodes to reduce
the number of logging calls terminated in individual nodes. The logging task is
at the apex of the pyramid. The logging calls are established from bottom to
top. When a call is established, status reports from all nodes in the
sub-pyramid are forwarded upward. Sign-on of the logging task allows calls to
be established to the highest level of the pyramid. This results in status
reports from all nodes in the network. A status report lists the network links
terminating in the node including the name of the adjacent node and the link
status. The actual network topology can be derived from these status reports.
Comparison of the observed topology with the theoretical topology from the
network control file is often useful. Links which are missing in the actual
topology can be assumed to be out of service. A cabling error at a particular
node may have permuted the line number to adjacent node correspondence for the
node. This miscabling is not particularly serious and does not interfere with
normal network operation. It does interfere rather seriously with down line
load as the APL program searches the link parameter matrix to determine which
node is being reloaded. The search uses the line number within the node
adjacent to the node being reloaded as the search argument. The error can be
corrected by simply changing the network control file to reflect the actual
topology.
Some network control
capabilities are embedded within the logging task. These operate upon command
from some task within the APL system. Examples include alteration of minor node
parameters and inspection of tables within a node. All control tasks read a
request from file or shared variable and report a result back to the request
source. To service a request, the logging task establishes a call to the node
and examines various storage locations in the node. For some types of requests,
the exact locations examined in the later stages of request service may be
determined by the results of
227
prior examinations.
Requests which alter node parameters sometimes require examination of the
current state of the node to refine the subsequent commands. There was also a
requirement to process several requests simultaneously.
A
crude multi-programming system is used to service these requests. The variables
associated with a specific request are tucked away in a package when a request
is awaiting input. When input arrives, a function which depends on request type
is called with the current input packet for the request as an argument. The
function analyzes the input and alters the variables associated with the
request. The function may emit certain types of packets to the node being
examined. The function may also indicate successful completion of the request
to the supervisory system. At request completion the package containing the
variables is returned to the request source.
Other support software
The logging task is
supported by several other workspaces. There are two different workspaces for
presenting link status information. The MONITOR workspace attempts to display current and recent status.
The emphasis is on displaying conditions which might require manual action.
Examples would be links or nodes which are not currently operational. A
CRT is normally used for the monitor display. With a finite number of lines on
the screen, conservation is desirable. An early step was to introduce a
"service status" for every link in the network. Service status
roughly corresponds to geographical area. The principal divisions are Europe
and North America. A special status of "test" is used for certain links.
Links with service status of test are links whose condition is of relatively
little interest. Some of these links connect hardware or software test nodes to
the network; others represent planned links which have not yet been installed.
The current network contains fifteen links with test status. The network
monitor never displays the status of the test links. There are provisions for
further selection by service status so that display of European link status can
be suppressed on North American screens. One universal need in an alarming
system is some method of acknowledging alarms. Any authorized user of the
monitor system can enter a line of text to provide extra information about some
event on the screen. Examples include estimated time for repair, phone company
reference number for the trouble, scheduled outages and various other things.
The brief notes which are sometimes amplified by mailbox messages provide an
adequate alarm acknowledgement system.
The other scheme for displaying link status is oriented towards hard
copy reports. When a link fails the nodes at both ends of the link generate
link failure messages. When the link becomes operational again, both links
signal the improved status. Thus one link outage can generate four different
event messages. The reporting workspace attempts to gather all messages
referring to a single link in a single day and build a link by link report of
incidents. The report includes failure codes and outage duration. The reason
for failure may be different at the two ends of the link. Both codes are useful
in problem diagnosis. The three most common codes represent: reset request from
other node, timeout with loss of carrier, timeout with good carrier. A code
pair such as: reset request/timeout with good carrier suggests one-way
transmission difficulties. The node which received the reset request could
receive properly but its transmissions were not received by the other
node. A pseudo-code indicates link reset due to a power failure in a node.
Examination of this report provides a daily summary of network faults.
Selective reporting to examine the behaviour of a particular node in a specific
time period is also possible.
228
Sundry other support
tools exist. There are workspaces for analysis of coredumps from network nodes.
An attempt is made to match the observed contents of storage with the expected
contents of storage. This is often useful in identifying failures in the core
storage system. Certain other hardware errors with known "dump
signatures" are also flagged by this workspace.
A
somewhat fragile workspace attempts to draw a complete diagram of network
topology. The preferred display device is an APL terminal or printer. This
tends to restrict the number of angles at which links can be drawn to eight.
The present result is a rather precarious tangle which bears no relation to
geography. It does manage to present the detailed topology of the network in a
form which some people consider usable.
Private workspaces for
various special reports also exist. Many of these look at various network
parameters and logs from the previous 24 hours and select features of interest
to the workspace author. Examples include reloads, topology changes, high
retransmission rates and various other things.
Another
source of network statistics is from the APL systems rather than from network
nodes and terminals operated by communication department staff. At every
sign-off a record of the APL session is written to a file called the
"sign-off history file". The record includes the network port from
which the session originated, sign-off time and session duration. Other
information such as characters transmitted and received and billing information
is also included. By merging the sign-off history records from all inhouse
systems, the occupancy of network ports over a time interval can be obtained.
This dynamic data when combined with the static node cabling data described
above allows usage of a specific group of ports to be monitored. Another use of
the merged sign-off history file is to compute traffic between an originating
node and a specific APL system. This information is used for network balancing
purposes.
229