DEFINITION
BYNET, acronym for "BanYan NETwork," is a folded banyan switching network built upon the capacity of the YNET. It acts as a distributed multi-fabric inter-connect to link PEs, AMPs and nodes on a Massively Parallel Processing (MPP) system.
OVERVIEW
Interconnect technology is important for parallel computing. The BYNET is Teradata's "system interconnect for high-speed, fault tolerant warehouse-optimized messaging between nodes."[11] As an indispensable part of the Teradata MPP system, it can be understood better with its predecessor "YNET" in the background.
In 1982, the YNET interconnecting technology used on the DBC 1012 was patented for the parallelism. As a broadcast-based hardware solution, it linked all the IFPs, COPs, and AMPs together with circuit boards and cables in a dual bus architecture. Two costom-built busses operated concurrently within the interconnect framework: YNET A to connect the IFPs and COPs on one side, and YNET B to connect the AMPs on the other. The YNET was featured by 1) its 6MB/sec bandwidth, 2) hardware-based sorting mechanism, 3) binary tree structure with dual channels, and 4) global synchronization. Teradata V1 enabled its parallel processing by balancing the workload across all AMPs via the YNET.
The YNET had its own weaknesses. The assumption underlying the YNET design was that messages would be broadcasted to the processors that owned the data, but in fact, many messages turned out to be point-to-point. Furthermore, When the AMPs have data to return, the YNET actually moved the data together from the relevant AMPs. This overhead could be expensive - e.g., extra message traffic, task initiations and disk accesses. Although the YNET allowed hundreds of processors to share the same bandwidth, its bandwidth was not scalable to the maximum configuration of 1024 processors.
The BYNET interconnect was designed to address the YNET's weaknesses, especially that of scalability limitation. The BYNET handles the inter-vproc messaging via shared memory. Unlike the YNET actually transporting the data for a join, it changes the AMP-based object ownership of the memory location to that of the destination AMP. By minimizing the data traffic in this way, it preserves the interconnect bandwidth effectively. It is noteworthy that the BYNET is linearly scalable to the system configuration of 4096 processor modules.[8]
With the BYNET, the YNET hardware-based sorting mechanism has been replaced with a software-based one. The software-based sorting offers more flexibility* in selecting the sorting key and the sorting order. This redesign does not affect the application performance due to the reasons listed below:
- The hardware develops rapidly: the sorting function can be left to larger memories and faster processors;*
- Each BYNET interface controller is equipped with a dedicated SPARC processor.
Therefore, a simple high-speed interconnect subsystem with enhanced scalability has become more significant.
NOTICE
- More flexibility: The YNET hardware-based sorting mechanism limited the sorting key to 512 bytes.[8]
- Larger memories and faster processors: In 1980s, the Intel 8086 processors were quite slow, and the memories were not large enough. Hence it was meaningful to distribute processing between processors and the interconnect component. However, such clever but complex distributed-processing designs seem unnecessary in an era of ever faster and cheaper CPUs and memories.
FUNCTIONS
Physical Level
- Linking all the vprocs in an SMP node;
- Linking all the nodes in a multinode system.
Signaling / Messaging Level
- Carrying bi-directional* point-to-point, multicast and broadcast messages among AMPs and PEs;
- Carrying bi-directional point-to-point and broadcast messages among the nodes;
Application Level
- Merging answer sets back to PEs;
- Transporting data.
NOTICE
- Bi-directional signaling / messaging: The BYNET transmission channel comprises two concurrent subchannels: a high capacity forward channel for executing the main BYNET activities and a low capacity back-channel for monitoring and signaling the status of those activities. The BYNET-driven semaphores (semaphore count or semaphore flags) are passed through the back-channel to signal the BYNET activity status (progress, failure or success). The bi-directional signaling aims to "minimize message flow within the database software by offering simpler alternatives to intense message-passing when parallel units require coordination." [10]
KEY FEATURES
- Linear Scalability: The BYNET's folded banyan architecture was designed to address the linear performance scalability limitation of its predecessor - the YNET. It allows the overall network bandwidth to scale linearly with each node added to the configuration. Hence performance penalty for system expansion can be avoided.
- Fault Tolerance: Firstly, the BYNET Banyan topology delivers multiple connection paths for each network. A Teradata MPP system is typically equipped with two BYNET networks (BYNET 0 and BYNET 1). Secondly, the BYNET can automatically detect faults and then reconfigure the network. If an unworkable connection path is detected in a certain network, that particular network will be automatically reconfigured so that the unworkable path will be avoided by all tasks. Furthermore, if that particular "BYNET 0" fails to be reconfigured - an unusual case, the hardware will be disabled on BYNET 0, and tasks will be re-routed around the failed components to BYNET 1.
- Enhanced Performance: By default, a Teradata MPP system is equipped with two BYNET networks. Since both BYNET networks in a system are active, the system performance can be enhanced by using the combined bandwidth of the two networks.
- Load Balancing: Workload or data traffic is distributed automatically and dynamically between the two BYNET networks.
HARDWARE AND SOFTWARE
The BYNET is the combination of hardware and software that enables the high speed communication inside and between the nodes.
Hardware
The BYNET hardware is used to connect nodes on the MPP system, including the following components:
- BYNET switches (e.g., 8-port BYNET Ethernet switch, BYA4M switch board, BYA32 switch chassis and BYA64/BYC64 switch cabinet);
- BYNET interface adapter boards (e.g., Node BIC Adapter like PCIe BIC2E);
- BYNET network cables (e.g., BYNET Ethernet Switch-to-Node Cables, BYA32-to-Node Cables and BYA64-to-BYC64 Cables).
Software
The BYNET software consists of the following:
- The BYNET driver: It is installed on all the nodes of the MPP system, and used as an interface between the BYNET hardware and the PDE software.
- Boardless BYNET: It is installed on the SMP system or the single node platform to emulate the activity of the BYNET hardware. SMP systems do not use the BYNET hardware, and that explains why the software is named "Boardless BYNET".
MESSAGING TYPES
Among Nodes
The following types of messaging are carried out among nodes via BYNET hardware and software (i.e., the BYNET driver):
- Broadcast messaging from one node to all nodes;
- Point-to-point messaging between two nodes.
Among Vprocs
The following types of messaging are carried out among vprocs via the PDE (Parallel Database Extensions) and BYNET software (i.e., the BYNET driver on the MPP system or the Boardless BYNET on the SMP system):
- Point-to-point messaging:
- Point-to-Point messaging among vprocs from the same node via the PDE and the BYNET software;
- Point-to-Point messaging among vprocs from different nodes: Firstly, the message is sent to the recipient node using the inter-node point-to-point messaging via the BYNET hardware and software; then the message is delivered to the recipient vproc using the inter-vproc point-to-point messaging via the PDE and the BYNET software.
- Multicast messaging: Firstly, the message is sent to all nodes using the inter-node broadcast messaging via the BYNET hardware and software; then the PDE and the BYNET software identify the recipient vprocs and deliver the message to them.
- Broadcast messaging: Firstly, the message is sent to all nodes using the inter-node broadcast messaging via the BYNET hardware and software; then the message is delivered to the recipient vprocs using the inter-vproc broadcast messaging via the PDE and the BYNET software.
I/O TYPES
BYNET I/O can be classified into three types:
- Point-to-point;
- Broadcast;
- Merge.
The BYNET I/O statistics can be collected and analyzed in support of performance management.
Point-to-point
The point-to-point I/O comes from the inter-vproc point-to-point messaging. This messaging type involves a sender and a recipient.
Used for:
- Row redistribution between AMPs;
- Communication between PEs and AMPs on a single AMP operation.
Caused by:
- Aggregation
- Create USI
- Create fallback tables
- Create Referential Integrity (RI) relationship
- FastExports
- FastLoad
- INSERT SELECTs
- Joins, including merge joins and exclusion merge joins, some large table/small table joins, and nested joins
- MulitLoads
- Updates to fallback tables
- USI access
Broadcast
The broadcast I/O comes from the inter-vproc broadcast messaging. This messaging type involves multiple recipients - one message to multiple AMPs simultaneously. It has a subtype called "Multicast messaging", where only a subset of all vprocs (i.e., Dynamic Group) are passed to the BYNET messaging task. This is to eliminate or reduce the expensive but unneeded all-AMP operation: point-to-point messages will not be sent to many irrelevant vprocs, and thus those vprocs will not be involved in processing the message.
In practice, to send a message to multiple AMPs, an AMP can send a broadcast message to all nodes. The BYNET software on the recipient node identifies the AMPs on the node that are involved with the message; and only those involved AMPs can receive the message. Broadcasting messaging can be restricted when traffic is high and limited to point-to-point messaging.
Used for:
- Broadcasting an all-AMP step to all AMPs;
- Multicasting a multi-AMP step to a dynamic group of AMPs;
- Row duplication.
Merge
The merge I/O comes from the BYNET merge.
Used for:
- Returning a single answer set of a single SELECT statement.
COMMENT FROM EXPERTS
Teradata | Tech Tips
The BYNET works like a phone switch, quite different from the typical network. Its switched "folded banyan" architecture delivers additional network bandwidth for each node added to the configuration. Each connection to a node delivers 120 MB/sec. A 2-node system has a 240 MB/sec. interconnect; 4 nodes, 480 MB/sec.; 8 nodes, 960 MB/sec.; 100 nodes, 12 GB/sec.
The BYNET can broadcast-deliver a single message to some or all of the nodes in the MPP configuration. There are many database functions that need to be performed on all nodes at once. With broadcast, the database has to send and manage only one message and one response, lowering the cost and increasing the performance.
The BYNET guarantees delivery of every message and ensures that broadcasts get to every target node. So the database isn't plagued by communication errors or network failures and does not have to pay the price of acknowledgements or other error-detection protocols.
The BYNET performs all of these functions using low-level communication protocols. It is a circuit-switched network, so the large messages that the database sends get through quickly.
This post is probably where I got the most useful information for my research. Thanks for posting, maybe we can see more on this.
ReplyDeleteAre you aware of any other websites on this subject.
GOOD TERADATA ONLINE TRAINING
A minor nit...
ReplyDelete"YNET A to connect the IFPs and COPs on one side, and YNET B to connect the AMPs on the other." isn't quite correct.
There was no distinction between YNET A and B in terms of which type of nodes used which YNET to communicate with other types of nodes. Every node (AMP, IFP/COP) had full access to both YNETs and used either to communicate with every other node. If one of the YNETs was offline, all the communications between nodes was of course done on the remaining one.
In practice, there was relatively little communication between IFPs and COPs. I believe most traffic in typical installations was between AMPs doing row redistribution while executing a query plan and the (distant?) second heaviest traffic was between IFP/COPs and AMPs in the form of steps/responses and data to/from clients (merge results, dump/restore and Fast/Multi loads). It would have made little sense to leave one YNET idle while multiple concurrently running queries competed for use of the other for row redistribution.
I was looking for the Teradata Online Training courses and your website really help me in finding my needs. This site contains all the stuff which i was looking . Thanks for this great work and i hope this will help a lots of users to achieve their goals.
ReplyDelete