Network Management Protocols
There were several sub-protocols in IPSANET which were inaccessible to the normal user. They were used for network administration or special applications and sometimes required hardware other than an asynchronous terminal.
Network Administration
There were two online network administration applications. These were DLL and logging.
DOWNLINE LOAD
DLL was used to load software from an APL file into an Alpha. Loading one Alpha required the use of four different computers.
1] Lazarus ran on the Alpha which was being reloaded. The initial version was stored in the bootstrap ROM. An improved version of Lazarus was loaded rather early in the process. The name was inspired by John XI:44: “And he that was dead came forth.”
2] Neighbour ran on the adjacent node. It used an uninitiated network link to communicate with Lazarus. A virtual call connected it to the Librarian Interface running on node one.
3] Librarian Interface ran on the network administration version of the 3705 software. It was intermediate between the VC to Neighbour and a command sequence on a 360 byte multiplexor channel.
4] DLL Librarian was a SHARP APL N-task running on the IPSA PROD system. (An N-task is an APL task with no attached terminal. When started it loads a workspace which must have an autostart to begin execution of some APL function.) This N-task was privileged in the APL\360 sense. It had access to a dedicated 4096 byte buffer in 370 mainstore and access to a dedicated byte multiplexer channel address.
These four processes acted together to reload one Alpha. As a by-product of the load, a partial dump of Alpha storage was obtained (The dump was partial as it excluded areas such as the buffer load which were not initialised by DLL.) The time required to reload one Alpha could be as short as 90 seconds (this assumes Neighbour and the Librarian Interface are connected via 9600 bps links or else co-resident in node one). It also assumes that the link between Neighbour and Lazarus is fast and free of delay such as a trans-Atlantic cable link. A load could take as long as ten minutes in harsh conditions.
The process operated as follows:
1] DLL Librarian task on startup or after finishing a prior load would execute a initial channel program on the DLL ESC address. The channel program waited for a DLL request. Execution time of this channel program could be hours or seconds.
2] Librarian Interface accepted the first command of the channel program and then waited for a call request from Neighbour (via normal IPSANET call establishment).
3] Lazarus program began execution in an Alpha. It send a short packet from Lazarus out via the first high order line to the network.
4] Neighbour link control software detected a packet from Lazarus on an uninitiated line. Either the special short packet or a normal dump packet would suffice. Receipt of this packet triggered an attempt to establish a network call to the Librarian Interface. If this attempt was rejected by the network (DLL busy or not running), a future packet from Lazarus would trigger a retry of call establishment.
5] Librarian Interface in the state 2] above would accept the network call and an initial data packet from Neighbour. The initial data packet specified the node number of Neighbour and the HO line number that had received the packet from Lazarus. These two words were passed to 3705 mainstore via the channel program and the channel program (CCW chain) was ended normally.
6] DLL Librarian resumed execution when the channel program completed. The two words describing the Neighbour task were fetched (via 0 ibeam). By reading one component of the network control file and applying dyadic iota, the identity of Lazarus could be determined. The control file provided parameters describing the Lazarus node such as the version of the software to be loaded and sundry parameters to slightly customize this software.
7] Librarian controlled the overall process of loading segments ranging in size from one to many words and sometimes examining the returned dump information. This was used to determine whether the ROM or core version of Lazarus was in use, Alpha storage size etc.
8] The final segment was the IPSANET software followed by a GOTO packet to start execution of the Alpha program. The dump was written to file, successful completion was logged, the virtual call was terminated. Librarian then returned to its initial state to await a new DLL request.
Segment loading protocol
The Librarian split the load text into load packets. Each load packet began with a load address in Lazarus followed by 25 or less 16-bit words. The librarian task issued write commands with load text and read commands to acquire dump text. The protocol required that a write command be paired with a read command with the same byte count. These two commands could strictly alternate in the channel program: W1 R1 W2 R2 W3 R3. If there was a significant network delay between Neighbour and node one, this half duplex scheme was a little slow. For large segments Librarian used a slightly more complex arrangement:
W1 W2 W3 R1 W4 {Rn Wn+3} Rf-1 Rf.
This is the CCW sequence that the 3705 saw. Due to finite buffer size, this CCW sequence was broken into multiple channel programs. The break point was after a read CCW. The initial write commands allowed Neighbour Lazarus operation to continue despite minor network delays. Librarian could also be sluggish in responding to channel program completion. It was an ordinary APL N-task is most respects and subject to swapping delay, file system delay and competition with other APL users for resources.
The alternation of read and write commands was the only flow control used on the Neighbour to 3705 virtual call. If Librarian had presented a long sequence of write commands to the 3705, some computer might have exhausted its buffer pool and crashed.
The 3705 Librarian Interface processing of read and write commands was rather simple. A write command generated a data packet to Neighbour. A Read command caused a wait for the next dump packet from Neighbour. These packets had normal interend sequence numbers and so they were processed in emission order rather than receipt order.
Communication between Lazarus and Neighbour was strictly half-duplex. Neighbour sent a load packet to Lazarus. The framing of packets to Lazarus differed slightly from normal network framing. The initial byte has an even parity SOH (hex 81) rather than the normal STX. This meant that Lazarus ignored normal network packets. These packets might be sent towards Lazarus before Neighbour received the first packet from Lazarus. After Neighbour sent a dump packet to Lazarus it expected a dump packet with the same address in response. If the expected packet did not arrive within a reasonable time a special Lazarus enquiry frame was sent. Lazarus would answer with a copy of the recent dump packet.
In hindsight this scheme for remote reloading was fairly satisfactory. At the peak of Alpha population there was sometimes contention for use of the single downlineload port on the ISPA Prod system. The pressure to allow multiple simultaneous downline loads was eased when Beta nodes supplanted the Alphas. It provided central control of software levels and configuration. This might have been difficult to achieve with a removable storage media environment such as floppy disc. The Northern Telecom Datapac node had a tape drive for reloading. There were cost and simplicity benefits in the Alpha associated with not having an external storage device.
The Beta node which was based on an IBM PC-AT (Intel 286) came with a hard drive and so real-time software loading was unnecessary. There was some cleverness in the start-up software so that default program chosen from the hard drive was the most recent version to have had a successful shutdown rather than the most recently loaded version.
LOGGING
Operating a network requires information about events taking place far from the network operations centres. Sending event and statistical information to a data analysis and display centre is the usual solution.
In the very first implementation with two Alphas and no 3705, logging was quite simple. Local log messages were printed on the attached Model 33 TTY. These messages could also be sent as datagrams to another node where they were printed on the TTY.
This scheme grew into a network logging system. Node parameters were set to send all log messages to node one. Node one attempted to start a logging task when APL started (or the log task failed). The log task signon and initial workspace )LOAD command were supplied by the 3705. Log messages were transferred through node one to the log task. This task was almost an ordinary SHARP APL T-task. It differed from a normal T-task in that it performed 8-bit ARBIN/ARBOUT without byte reversal. (The R-task used in Bisync has the same property.)
The log task separated the statistical messages from the event messages. Event messages were converted to hex and logged. There was also an analysis of those event message which described state changes in network links. This allowed the log task to publish the status of network links as new event messages arrived. Statistical messages from the Alpha, Beta and 3705 nodes had slightly different formats and information content. The three statistical message types were processed separately into three APL matrices. When these reached a certain size, they were written to a statistical log file.
One problem with passing log messages to an APL task is that there can be a lot of them. Sharp APL T-tasks operated according to strict half-duplex rules. An input stream was assumed to have a line turnaround event every now and then. I knew that buffer limit on the APL side was about 4000 characters. Therefore the 3705 inserted an artificial turnaround before this limit was reached. The APL task could not process any input until the turnaround event occurred. If character count had been the only criteria for inserting a turnaround, processing of event messages could have been delayed for minutes. It was undesirable to send log messages one at a time to APL. The reason is that APL programs work better on a large chunk of data than a tiny scalar. The solution used in the 3705 was to record the time when the first event message of an ARBIN was transferred to APL. One second later, turnaround was forced (unless the character count criteria has already forced turnaround). This provided a good compromise between timely processing of event messages and reducing the resource consumption of the APL log task.
Tracking of event messages doesn’t provide a complete picture of network status. The problem is that when the log task starts operation, many of the network links are already operational. Log task initialisation solved this problem by interrogating all the nodes which it could find. As every link control table in a node recorded the number of the adjacent node as well as the uninitiated/initiated status of the link. Three inspect datagrams (remote PEEK) per node were required to generate a matrix showing actual network topology. These inspect datagrams had to be sent in a half-duplex fashion as the results of one inspection were used to compute storage addresses for the subsequent inspection. The process was for up to five nodes at a time. I think it took about a minute to determine the topology of a hundred node network.
Alpha nodes had the option of sending log messages (perhaps event only) to a second log destination. This was always an Alpha which printed them on the TTY. Europe favoured this and I believe both London and Amsterdam nodes were used for this purpose.
Some of this had to change when the routing protocol was altered and datagram forwarding capability was lost. Log messages were transmitted via flow controlled virtual calls. To avoid swamping node one’s capacity to terminate virtual calls, a log fan-in node was introduced. A log fanin node was an Alpha which could terminate about twenty log VCs. Received packets were printed and sent out on the single outgoing log virtual call of the fanin node. This outgoing call was terminated in another fan-in node or node one. Node one terminated about five log calls (there was no hard limit).
A minor benefit to this was that if the network log task was temporarily inaccessible, flow control would cause a finite number of messages to accumulate in log fan-in nodes. When a downstream log VC was established, these messages were be passed on via the flow controlled logging call. (It was only by accident that I discovered this caused frequent restarts of node five in Amsterdam when the log task was unavailable. Node five was improperly configured and didn’t really have enough storage to act as a log fan-in node).
The topology determination procedures was changed. The active inspection of nodes by the log task was replaced by new messages. A high order status message was introduced. For every high order link, it indicated whether the link was initiated and the number of the adjacent node. (I think this was done by entering zero for an inoperative link as zero was an illegal node number.) This HO status message was emitted in two circumstances:
1] When the downstream (towards node one) log call was established.
2] When the downstream log fan-in node sent an HO status request.
After the HO status message was sent, a log fan-in node iterated through its upstream log calls and sent HO status requests. This iteration proved troublesome in some fan-in nodes. All of these calls were allowed to send one message in response to the status request. They tended to arrive simultaneously at the fan-in node and cause buffer depletion problems which were sometimes serious. The empirical solution was to slow down the iteration. Introducing a half second delay between upstream status requests made the problem go away.
A third network management protocol was remote debug. This allowed inspection of storage in a remote node. It initially operated via datagrams but this easily converted to a virtual call protocol. The original use was TTY oriented and only allowed inspection of a single location per inquiry. As the time to process the inspect request was usually less than 100 ms, the time to print a five character response on the 10cps Model 33 dominated. After old value was displayed, the user would invited to type a single character to do one of the following:
Prepare to alter the storage location
Inspect the next location
Inspect the previous location
Inspect the same location again.
For programmers the sequential display had obvious uses. I found that repeating the inspect was a useful operational tool. Consider the sequence:
N4. /* select node four as the inspect target */
I108. /* learn the value X of location 108 (pointer to second high order line) */
Manually compute Y:=X+18 :18 is hex)
/*Y is the received error count for HO line two */
IY. And repeat every second. This gave a hint as to the error rate on this link.
When the logtask started generating inspect requests it was obvious that the capability to inspect many locations in the same node was required. A multiple inspect packet which specified several addresses to inspect (up to 25 words) worked well on the Alpha and 3705. The reason was the 15bit address size matched the 16bit data size. The extra bit (Alpha HO or 3705 LO) was used in an indirect addressing scheme. Addresses in a multi-inspect packet were offset from a base. Base had an initial value of zero. Until after the first inspect address with the indirect bit set, addresses were absolute. Subsequent addresses were treated as offsets from the base. A complex inspect involved a sequence of inspects with the indirect bit set to find some specific area such as the second data buffer in a transmit queue. Then as an address sequence of (1, 2, 3,.. for Alpha; 2, 4, 6,… for 3705) would display the buffer contents.
Some modification to the scheme would have been required for the Beta node which had larger storage.
There was a hook in the logtask so that users who were privileged with respect to the logtask could generate inspect and even patch requests if super-privileged. IBNL recipes (see Routing essay) calculated by a traffic analysis N-task were distributed by the logtask.
These were the true online support protocols. The files maintained by the two APL tasks (downline load and logging) were examined by various other tasks. A particularly important one was communications monitor task which displayed network status on an HDS 108 video terminal. This task was designed by Gary Follows and maintained by his group. It indicated non-test network links which were inoperative. It also gave an indication of downline load activity.
There was also a B-task which ran everyday at 0030 UTC to purge the statistics file. The intent was to preserve two full days of data on the raw collection file. ‘Interesting’ statistical data message from the most recent day was captured to long term file. ‘Interesting’ was defined for a 3705 link as a load in excess of some pre-determined threshold. For Alpha links the heaviest load of the 24 hour period was preserved. High error rate periods were also of interest although the data reduction discarded most info from the sample other than timestamp and error count.
Following a suggestion from a Datamation article, I wrote an APL function to give a crude 3D plot of the link error data of one line over a month. The horizontal axis was 48 half-hour periods. Each line represented a different day. For each point in this 31 x 48 diagram I printed a letter representing the two logarithm of the error count. For a link only a medium load, the expected correlation between time and error rate could be observed. (Telecommunication legend has it that error rates increase during daylight hours and decrease at night and weekends.) For reasons discussed under Framing Protocol, IPSANET’s ability to detect line errors varied with traffic. Under heavy traffic the error counts dropped due to the limitations of the detection scheme.
I assume all of this is academic in 2005 where transmission channels use fibre optics with an expected error rate which is far lower than the voice grade lines which IPSANET used.