ODBC AND JDBC

JDBC: Java Database Connectivity (JDBC) provides Java developers with a
standard API (application Programming Interfaces) that is used to access
databases regardless of the driver and database product. To use JDBC you'll
need at least JDK 1.1 a database and a JDBC driver. There are several
characteristics of JDBC:ODBC : ODBC is the acronym for Open DataBase
Connectivity a Microsoft Universal Data Access standard that started life as
the Windows implementation of the X/Open SQL Call Level Interface specification.

ODBC Driver Manager - an application binds to this generic library which is
responsible for loading the requested ODBC Driver.

ODBC Driver - dynamically loaded by the ODBC Driver manager for making
connection to target Database.

Difference b/w them is

1.ODBC is for Microsoft and JDBC is for java applications.

2.ODBC can't be directly used with Java because it uses a C interface.

3.ODBC makes use of pointers which have been removed totally from java.

4.ODBC mixes simple and advanced features together and has complex options for
simple queries But JDBC is designed to keep things simple while allowing
advanced capabilities when required.

5.ODBC requires manual installation of the ODBC driver manager and driver on all
client machines. JDBC drivers are written in java and JDBC code is automatically
installable secure and portable on all platforms.

6. JDBC API is a natural Java Interface and is built on ODBC. JDBC retains some
of the basic feature of ODBC


You can use Squirrel SQL client available here to connect to databases via JDBC, it works with Teradata as well.

SET VS MULTISET

Table Type Specifications of SET VS MULTISET

There are two different table type philosophies so there are two different type tables. They are SET and MULTISET. It has been said, “A man with one watch knows the time, but a man with two watches is never sure”. When Teradata was originally designed it did not allow duplicate rows in a table. If any row in the same table had the same values in every column Teradata would throw one of the rows out. They believed a second row was a mistake. Why would someone need two watches and why would someone need two rows exactly the same? This is SET theory and a SET table kicks out duplicate rows. The ANSI standard believed in a different philosophy. If two rows are entered into a table that are exact duplicates then this is acceptable. If a person wants to wear two watches then they probably have a good reason. This is a MULTISET table and duplicate rows are allowed. If you do not specify SET or MULTISET, one is used as a default. Here is the issue: the default in Teradata mode is SET and the default in ANSI mode is MULTISET.
Therefore, to eliminate confusion it is important to explicitly define which one is desired. Otherwise, you must know in which mode the CREATE TABLE will execute in so that the correct type is used for each table. The implication of using a SET or MULTISET table is discussed further.

SET and MULTISET Tables

A SET table does not allow duplicate rows so Teradata checks to ensure that no two rows in a table are exactly the same. This can be a burden. One way around the duplicate row check is to have a column in the table defined as UNIQUE. This could be a Unique Primary Index (UPI), Unique Secondary Index (USI) or even a column with a UNIQUE or PRIMARY KEY constraint. Since all must be unique, a duplicate row may never exist. Therefore, the check on either the index or constraint eliminates the need for the row to be examined for uniqueness. As a result, inserting new rows can be much faster by eliminating the duplicate row check.
However, if the table is defined with a NUPI and the table uses SET as the table type, now a duplicate row check must be performed. Since SET tables do not allow duplicate rows a check must be performed every time a NUPI DUP (duplicate of an existing row NUPI value) value is inserted or updated in the table. Do not be fooled! A duplicate row check can be a very expensive operation in terms of processing time. This is because every new row inserted must be checked to see if it is a duplicate of any existing row with the same NUPI Row Hash value. The number of checks increases exponentially as each new row is added to the table.
What is the solution? There are two: either make the table a MULTISET table (only if you want duplicate rows to be possible) or define at least one column or composite columns as UNIQUE. If neither is an option then the SET table with no unique columns will work, but inserts and updates will take more time because of the mandatory duplicate row check.
Below is an example of creating a SET table:
image\img00604.gif
Notice the UNIQUE PRIMARY INDEX on the column emp. Because this is a SET table it is much more efficient to have at least one unique key so the duplicate row check is eliminated.
The following is an example of creating the same table as before, but this time as a MULTISET table:

Notice also that the PI is now a NUPI because it does not use the word UNIQUE. This is important! As mentioned previously, if the UPI is requested, no duplicate rows can be inserted. Therefore, it acts more like a SET table. This MULTISET example allows duplicate rows. Inserts will take longer because of the mandatory duplicate row check.

Teradata Sql Assistant (TSA)


Definition:
Teradata SQL Assistant (TSA), as part of Teradata Tools and Utilities (TTU), is an ODBC-based client utility used to access and manipulate data on ODBC-compliant database servers. It has two editions:
1) Teradata SQL Assistant for Microsoft Windows
2) Teradata SQL Assistant/Web Edition

Teradata SQL Assistant is an information discovery tool designed for Windows XP and Windows 2000. Teradata SQL Assistant retrieves data from any ODBC-compliant database server. The data can then be manipulated and stored on the desktop PC.

Teradata SQL Assistant/Web Edition is a web-based query tool that allows you to compose queries, submit them to the Teradata Database, and then view the results in a web browser.

Overview:
Teradata SQL Assistant for Microsoft Windows, originally called "Queryman" (before V. 6.0)or "QueryMan" (V. 6.0 and up), is also known as "SQLA" among programmers. It supports import / export tasks, but not the serious ones. With its user-friendly GUI, TSA for Windows is generally oriented to business users and casual data consumers. (Refer to BTEQ for comparison.) TSA / Web Edition has fewer functions than TSA for Windows. The key features of each are presented below.

The key features of Teradata SQL Assistant are:
    * Send queries to any ODBC database or the same query to many different databases;
    * Export data from the database to a file on a PC;
    * Create reports from any RDBMS that provides an ODBC interface;
    * Import data from a PC file directly to the database;
    * Use an import file to create many similar reports (query results or Answersets). For example, display the DDL (SQL) that was used to create a list of tables;
    * Create a historical record of the submitted SQL with timings and status information such as success or failure;
    * Use SQL syntax examples to tailor statements;
    * Use the Database Explorer Tree to easily view database objects;
    * Use a procedure builder that provides a list of valid statements for building the logic of a stored procedure;
    * Limit data returned to prevent runaway queries.

The key features of SQL Assistant / Web Edition are:
    * Send queries to any Teradata Database;
    * View and sort the results by column and save them to a file;
    * Save queries to a file so you can run them automatically from the same file in the future;
    * Load queries from a file and make changes before submitting them to the database;
    * View historical records of the queries you have submitted with timings and status information. You can sort the history rows by column, update them with your notes, and save the records to a file;
    * View a hierarchical listing of databases with object names to reduce the time required to create SQL statements;
    * Use the browser’s built-in find features to do a string search on a query, an answer set, or any information in the historical table visible in the browser window;
    * Abort a query in progress if you change your mind about running a query or if a query is taking longer than you expected;
    * Client browser’s connection ends after 30 minutes of inactivity.

PE - Parsing Engine


PE, acronym for "Parsing Engine," is the type of vproc (Virtual Processor) for session control, task dispatching and SQL parsing in the multi-tasking and possibly parallel-processing environment of the Teradata Database.


DEFINITION


PE, acronym for "Parsing Engine," is the type of vproc (Virtual Processor) for session control, task dispatching and SQL parsing in the multi-tasking and possibly parallel-processing environment of the Teradata Database. Vproc is a software process running in an SMP (Symmetric Multiprocessing) environment or a node.


COMPONENTS

The components of a PE vproc can be classified as the following:

   1. Parser: It desolves SQL statements into RDBMS processing steps;
   2. Query Optimizer: It decides the optimal way to access data;
   3. Step Generator: It 1) produces processing steps, and 2) encapsulates them into packages;
   4. Dispatcher: It 1) transmits the encapsulated steps from the parser to the pertinent AMPs, and 2) performs monitoring and error-handling functionalities during step processing;
   5. Session Controller: It 1) manipulates session-control activities (e.g., logon, authentication, and logoff), and 2) restores sessions after client or server failures.


OVERVIEW

PE is an instance of the database management software responsible for the communication between the requesting client and the relevant AMPs, usually via the BYNET. Each PE runs independently to handle sessions, parse SQL statements into processing steps with optimization, dispatching the steps to the relevant AMPs and sends the processing results back to the requesting client.

The PE vproc was invented in Teradata V2 to replace the following dedicated physical processors that performed the similar functions on the DBC 1012 systems:

    * The InterFace Processor(IFP) - The IFP was responsible for the communication between the DBC and the HOST. Its components included parser, dispatcher, session controller, client interface and YNET interface.
    * The COmmunication Processor (COP) - The COP was similar in function as IFP, but responsible for communication with network-attached hosts (DOS-PC/UNIX).

AMP


DEFINITION


AMP, acronym for "Access Module Processor," is the type of vproc (Virtual Processor) used to manage the database, handle file tasks and and manipulate the disk subsystem in the multi-tasking and possibly parallel-processing environment of the Teradata Database.


OVERVIEW


In reality, each AMP is an instance of the database management software responsible for accessing and manipulating data. As such, every AMP is allowed a part of the database to manage, and a part of the physical disk space to keep its set of database tables. Usually, the AMP obtains its portion of the disk space by being associated with a virtual disk (vdisk). It handles its disk reading / writing by using its file system software, which converts AMP steps (i.e., the steps from PEs) into physical data block requests. The AMPs are responsible to access and manipulate the data so as to complete the request processing. There may be mutiple AMPs on one node, and the communication among the AMPs is controlled by the BYNET.

The AMP vproc was invented with Teradata V2 to replace its dedicated physical predecessor on the DBC 1012 systems. In Teradata V1, the Access Module Processor (AMP) was the physical processing unit for all the Teradata database functions. Each AMP then contained its own microprocessor, disk drive, file system, database software (Database Manager), Teradata Operating System (TOS), and YNET interface. In that sense, each AMP was a node.

In Teradata V2, AMPs become software entities, and thus more flexible units that "deliver basic query parallelism to all work in the system." The number of AMPs (2 - 20 per node) is defined for the Teradata database before the database is loaded. The system partitions database tables across all the defined AMPs via hash functions to enable subquery-level parallel processing. In practice, all the database operations are run in parallel across all the AMPs, where all the related data rows are being processed simultaneously but independently.


FUNCTIONS

The functions of AMP can be classified as the following:

   1. BYNET interface, or Boardless BYNET interface;
   2. Database management:
         1. Locking;
         2. Joining;
         3. Sorting;
         4. Aggregation;
         5. Output data conversion;
         6. Disk space management;
         7. Accounting;
         8. Journaling;
   3. File-subsystem management;
   4. Disk-subsystem management.


SIZE LIMITS

 AMP SIZE LIMITS FOR TERADATA DATABASE

TERADATA DATABASE RELEASE     MAX CYLINDERS     MAX (BASE 2)     MAX (BASE 10)
V2R6.2.0.0 - UP                                 700,000                 1.26 TB     1.39 TB
V2R5.0.0.0 - V2R6.1.x.x                     600,000                 1.08 TB     1.19 TB
V2R4.0.1.0 - V2R4.1.3.x                     700,000                 1.26 TB     1.39 TB

Vprocs


DEFINITION

Vproc, acronym for "Virtual PROCessor," is a self-contained instance of the processor software on a Teradata platform (SMP platform or a node). Vprocs are the basic units of parallelism that make up the Teradata database server.


OVERVIEW


To put it simply, a virtual processor is a simulated processor in a processing software system, or a software version of a dedicated physical processor. Each vproc uses a portion of the physical processor resource, and runs independently of other vprocs.  The vprocs co-existing within a node share a single memory pool - a collection of free memory in the node. The portion of memory allocated from the pool to one vproc will be returned to the pool after usage for use by other vprocs.

Vprocs are the basic units of parallelism that make up the Teradata database server. A single SMP node is made up of multiple vprocs, and a single Teradata MPP system is made up of multiple SMP nodes.  Vprocs run as multi-threaded processes to enable Teradata's parallel execution of tasks without using specialized physical processors.  They are labeled as "Virtual Processors" because they perform the similar processing functions as physical processors (CPU).

The concept of virtual processors is often utilized in the multithreading technology for parallel processing. Some examples are available from IBM Informix, HP POSIX and VM Host (HP-UX 11i) and SUN Solaris OS, though the concept is used differently in each case.

The concept of vprocs was introduced to Teradata to eliminate the need for proprietary technology in the form of physical processors. With Teradata V1 - the version used on DBC 1012, there were three types of physical processors:

    * The InterFace Processor(IFP) was responsible for the communication between the DBC and the HOST. Its components included parser, dispatcher, session controller, client interface and YNET interface.
    * The COmmunication Processor (COP) was the type of IFP that contained a gateway process for communication with hosts (DOS-PC/UNIX) via a network.     * The Access Module Processor (AMP) was the physical processing unit for all the Teradata database functions. Each AMP contained its own microprocessor, disk drive, file system, database software (Database Manager), Teradata Operating System (TOS), and YNET interface. In some sense, an AMP was a node.

The YNET Interconnection Network linked all the IFPs, COPs, and AMPs together with circuit boards and cables.

In Teradata V2, the above-mentioned physical processors (IFP, COP, and AMP) were replaced by Virtual PROCessors (vprocs), which were invented as an abstraction layer between the work-unit multithreading and the physical parallel processing system. [6] Logically, each vproc is a seperate AMP or PE instance within an SMP node. Physically, a vproc is a directly addressable collection of processes that allow data correction and fault tolerance. Teradata allocates its data rows across all AMP vprocs via hash partitioning. PE vprocs manage sessions, break work units down to multiple steps, make query plans and distribute the steps to the relevant AMP vprocs. The AMPs process the steps in parallel. Differently from Teradata V1, there may be multiple AMPs in a node.

To sum up, as Carrie Ballinger and Ron Fryer observe, "the VPROC is the fundamental unit of apportionment, and delivers basic query parallelism to all work in the system."


 
TYPES

There are two types of vprocs on the Teradata platform: AMP(Access Module Processor) and PE(Parsing Engine).

    * AMPs are the type of vprocs that contain a BYNET interface and perform database, file and disk functions.
    * PEs are the type of vprocs that handle session control, SQL parsing/optimizing, step generating and dispatching.



RELATED UTILITIES

Vproc Manager (vprocmanager)

Vproc Manager(vprocmanager) is used for the following administrative functions:
    * To check the status of the specific vproc or vprocs;
    * To alter the state of a vproc or the states for a series of vprocs;
    * To initialize and start the specific vproc - mostly an AMP vproc;
    * To initialize the file subsystem on the vdisk associated with a certain AMP vproc;
    * To force a manual database restart.

BYNET

DEFINITION
BYNET, acronym for "BanYan NETwork," is a folded banyan switching network built upon the capacity of the YNET. It acts as a distributed multi-fabric inter-connect to link PEs, AMPs and nodes on a Massively Parallel Processing (MPP) system.


OVERVIEW

Interconnect technology is important for parallel computing. The BYNET is Teradata's "system interconnect for high-speed, fault tolerant warehouse-optimized messaging between nodes."[11] As an indispensable part of the Teradata MPP system, it can be understood better with its predecessor "YNET" in the background.
In 1982, the YNET interconnecting technology used on the DBC 1012 was patented for the parallelism. As a broadcast-based hardware solution, it linked all the IFPs, COPs, and AMPs together with circuit boards and cables in a dual bus architecture. Two costom-built busses operated concurrently within the interconnect framework: YNET A to connect the IFPs and COPs on one side, and YNET B to connect the AMPs on the other. The YNET was featured by 1) its 6MB/sec bandwidth, 2) hardware-based sorting mechanism, 3) binary tree structure with dual channels, and 4) global synchronization. Teradata V1 enabled its parallel processing by balancing the workload across all AMPs via the YNET.
The YNET had its own weaknesses. The assumption underlying the YNET design was that messages would be broadcasted to the processors that owned the data, but in fact, many messages turned out to be point-to-point. Furthermore, When the AMPs have data to return, the YNET actually moved the data together from the relevant AMPs. This overhead could be expensive - e.g., extra message traffic, task initiations and disk accesses. Although the YNET allowed hundreds of processors to share the same bandwidth, its bandwidth was not scalable to the maximum configuration of 1024 processors.
The BYNET interconnect was designed to address the YNET's weaknesses, especially that of scalability limitation. The BYNET handles the inter-vproc messaging via shared memory. Unlike the YNET actually transporting the data for a join, it changes the AMP-based object ownership of the memory location to that of the destination AMP. By minimizing the data traffic in this way, it preserves the interconnect bandwidth effectively. It is noteworthy that the BYNET is linearly scalable to the system configuration of 4096 processor modules.[8]
With the BYNET, the YNET hardware-based sorting mechanism has been replaced with a software-based one. The software-based sorting offers more flexibility* in selecting the sorting key and the sorting order. This redesign does not affect the application performance due to the reasons listed below:
  1. The hardware develops rapidly: the sorting function can be left to larger memories and faster processors;*
  2. Each BYNET interface controller is equipped with a dedicated SPARC processor.
Therefore, a simple high-speed interconnect subsystem with enhanced scalability has become more significant.
NOTICE
  • More flexibility: The YNET hardware-based sorting mechanism limited the sorting key to 512 bytes.[8]
  • Larger memories and faster processors: In 1980s, the Intel 8086 processors were quite slow, and the memories were not large enough. Hence it was meaningful to distribute processing between processors and the interconnect component. However, such clever but complex distributed-processing designs seem unnecessary in an era of ever faster and cheaper CPUs and memories.

 

FUNCTIONS

Physical Level

  • Linking all the vprocs in an SMP node;
  • Linking all the nodes in a multinode system.

Signaling / Messaging Level

  • Carrying bi-directional* point-to-point, multicast and broadcast messages among AMPs and PEs;
  • Carrying bi-directional point-to-point and broadcast messages among the nodes;

Application Level

  • Merging answer sets back to PEs;
  • Transporting data.
NOTICE
  • Bi-directional signaling / messaging: The BYNET transmission channel comprises two concurrent subchannels: a high capacity forward channel for executing the main BYNET activities and a low capacity back-channel for monitoring and signaling the status of those activities. The BYNET-driven semaphores (semaphore count or semaphore flags) are passed through the back-channel to signal the BYNET activity status (progress, failure or success). The bi-directional signaling aims to "minimize message flow within the database software by offering simpler alternatives to intense message-passing when parallel units require coordination." [10]


KEY FEATURES

  • Linear Scalability: The BYNET's folded banyan architecture was designed to address the linear performance scalability limitation of its predecessor - the YNET. It allows the overall network bandwidth to scale linearly with each node added to the configuration. Hence performance penalty for system expansion can be avoided.
  • Fault Tolerance: Firstly, the BYNET Banyan topology delivers multiple connection paths for each network. A Teradata MPP system is typically equipped with two BYNET networks (BYNET 0 and BYNET 1). Secondly, the BYNET can automatically detect faults and then reconfigure the network. If an unworkable connection path is detected in a certain network, that particular network will be automatically reconfigured so that the unworkable path will be avoided by all tasks. Furthermore, if that particular "BYNET 0" fails to be reconfigured - an unusual case, the hardware will be disabled on BYNET 0, and tasks will be re-routed around the failed components to BYNET 1.
  • Enhanced Performance: By default, a Teradata MPP system is equipped with two BYNET networks. Since both BYNET networks in a system are active, the system performance can be enhanced by using the combined bandwidth of the two networks.
  • Load Balancing: Workload or data traffic is distributed automatically and dynamically between the two BYNET networks.


HARDWARE AND SOFTWARE

The BYNET is the combination of hardware and software that enables the high speed communication inside and between the nodes.

Hardware

The BYNET hardware is used to connect nodes on the MPP system, including the following components:
  • BYNET switches (e.g., 8-port BYNET Ethernet switch, BYA4M switch board, BYA32 switch chassis and BYA64/BYC64 switch cabinet);
  • BYNET interface adapter boards (e.g., Node BIC Adapter like PCIe BIC2E);
  • BYNET network cables (e.g., BYNET Ethernet Switch-to-Node Cables, BYA32-to-Node Cables and BYA64-to-BYC64 Cables).

Software

The BYNET software consists of the following:
  • The BYNET driver: It is installed on all the nodes of the MPP system, and used as an interface between the BYNET hardware and the PDE software.
  • Boardless BYNET: It is installed on the SMP system or the single node platform to emulate the activity of the BYNET hardware. SMP systems do not use the BYNET hardware, and that explains why the software is named "Boardless BYNET". 


MESSAGING TYPES

Among Nodes

The following types of messaging are carried out among nodes via BYNET hardware and software (i.e., the BYNET driver):
  • Broadcast messaging from one node to all nodes;
  • Point-to-point messaging between two nodes.



Among Vprocs

The following types of messaging are carried out among vprocs via the PDE (Parallel Database Extensions) and BYNET software (i.e., the BYNET driver on the MPP system or the Boardless BYNET on the SMP system):
  • Point-to-point messaging:


    1. Point-to-Point messaging among vprocs from the same node via the PDE and the BYNET software;
    2. Point-to-Point messaging among vprocs from different nodes: Firstly, the message is sent to the recipient node using the inter-node point-to-point messaging via the BYNET hardware and software; then the message is delivered to the recipient vproc using the inter-vproc point-to-point messaging via the PDE and the BYNET software.
  • Multicast messaging: Firstly, the message is sent to all nodes using the inter-node broadcast messaging via the BYNET hardware and software; then the PDE and the BYNET software identify the recipient vprocs and deliver the message to them.
  • Broadcast messaging: Firstly, the message is sent to all nodes using the inter-node broadcast messaging via the BYNET hardware and software; then the message is delivered to the recipient vprocs using the inter-vproc broadcast messaging via the PDE and the BYNET software.


I/O TYPES 

BYNET I/O can be classified into three types:
  1. Point-to-point;
  2. Broadcast;
  3. Merge.
The BYNET I/O statistics can be collected and analyzed in support of performance management.

Point-to-point

The point-to-point I/O comes from the inter-vproc point-to-point messaging. This messaging type involves a sender and a recipient.
Used for:
  • Row redistribution between AMPs;
  • Communication between PEs and AMPs on a single AMP operation.
Caused by:
  • Aggregation
  • Create USI
  • Create fallback tables
  • Create Referential Integrity (RI) relationship
  • FastExports
  • FastLoad
  • INSERT SELECTs
  • Joins, including merge joins and exclusion merge joins, some large table/small table joins, and nested joins
  • MulitLoads
  • Updates to fallback tables
  • USI access

 

Broadcast

The broadcast I/O comes from the inter-vproc broadcast messaging. This messaging type involves multiple recipients - one message to multiple AMPs simultaneously. It has a subtype called "Multicast messaging", where only a subset of all vprocs (i.e., Dynamic Group) are passed to the BYNET messaging task. This is to eliminate or reduce the expensive but unneeded all-AMP operation: point-to-point messages will not be sent to many irrelevant vprocs, and thus those vprocs will not be involved in processing the message.
In practice, to send a message to multiple AMPs, an AMP can send a broadcast message to all nodes. The BYNET software on the recipient node identifies the AMPs on the node that are involved with the message; and only those involved AMPs can receive the message. Broadcasting messaging can be restricted when traffic is high and limited to point-to-point messaging.
Used for:
  • Broadcasting an all-AMP step to all AMPs;
  • Multicasting a multi-AMP step to a dynamic group of AMPs;
  • Row duplication.

Merge

The merge I/O comes from the BYNET merge.
Used for:
  • Returning a single answer set of a single SELECT statement.


COMMENT FROM EXPERTS

Teradata | Tech Tips 

The BYNET works like a phone switch, quite different from the typical network. Its switched "folded banyan" architecture delivers additional network bandwidth for each node added to the configuration. Each connection to a node delivers 120 MB/sec. A 2-node system has a 240 MB/sec. interconnect; 4 nodes, 480 MB/sec.; 8 nodes, 960 MB/sec.; 100 nodes, 12 GB/sec.
The BYNET can broadcast-deliver a single message to some or all of the nodes in the MPP configuration. There are many database functions that need to be performed on all nodes at once. With broadcast, the database has to send and manage only one message and one response, lowering the cost and increasing the performance.
The BYNET guarantees delivery of every message and ensures that broadcasts get to every target node. So the database isn't plagued by communication errors or network failures and does not have to pay the price of acknowledgements or other error-detection protocols.
The BYNET performs all of these functions using low-level communication protocols. It is a circuit-switched network, so the large messages that the database sends get through quickly.

Teradata Architecture


Symmetric multiprocessing (SMP) - A single node that contains multiple CPUs sharing a memory pool.

Massively parallel processing
(MPP) - Multiple SMP nodes working together comprise a larger configuration. The nodes are connected using the BYNET, which allows multiple virtual processors on multiple system nodes to communicate with each other.

Shared Nothing Architecture(MPP) - means that each vproc(Access Module Processors and Parsing Engines are Virtual processors) is responsible for its own portion of the database and do not share common components.each AMP manages its own dedicated memory space and the data on its own vdisk -- these are not shared with other AMPs. Each AMP uses system resources independently of the other AMPs so they can all work in parallel for high system performance overall

A node is made up of various hardware and softwares

A clique is a set of Teradata nodes that share a common set of disk arrays. Cabling a subset of nodes to the same disk arrays creates a clique.

A disk array is a configuration of disk drives that utilizes specialized controllers to manage and distribute data and parity acroos the disks while providing fast access and data integrity

RAID 5 Data and parity protection striped across multiple disks

RAID 1 Each disk has a physical mirror replicating the data


Teradata Storage Process
  • The Parsing Engine interprets the SQL command and converts the data record from the host into an AMP message
  • The BYNET distributes the row to the appropriate AMP
  • The AMP formats the row and writes it to its associated disks
  • The disk holds the row for subsequent access

Teradata Retrieval Process
  • The Parsing Engine dispatches a request to retrieve one or more rows
  • The BYNET ensures that appropriate AMP(s) are activated
  • The AMPs locate and retrieve desired rows in parallel access and will sort, aggregate or format if needed
  • The BYNET returns retrieved rows to parsing engine
  • The Parsing Engine returns row(s) to requesting client application

The BYNET is responsible for
  • Point-to-point communications between nodes and virtual processors
  • Merging answer sets back to the PE
  • making Teradata parallelism possible

The Parsing Engine is responsible for
  • Managing Individual sessions (up to 120)
  • Parsing and optimizing SQL requests
  • Dispatching the optimized plan to the AMPs
  • Sending the answer set response back to the requesting client

The AMP is responsible for
  • Storing and retrieving rows to and from the disks
  • Lock Management
  • Sorting rows and aggregating columns
  • Join Processing
  • Output conversions and formatting
  • Creating answer sets for clients
  • Disk space management and accounting

Datawarehousing Concepts


Data warehousing :

Data warehousing is combining data from multiple and usually varied sources into one comprehensive and easily manipulated database. Common accessing systems of data warehousing include queries, analysis and reporting. Because data warehousing creates one database in the end, the number of sources can be anything you want it to be, provided that the system can handle the volume, of course. The final result, however, is homogeneous data, which can be more easily manipulated.

Data warehousing is comprised of two primary tools: databases and hardware. In a data warehouse, there are multiple databases and data tables used to store information. These tables are related to each other through the use of common information or keys. The size of a data warehouse is limited by the storage capacity of the hardware.

The hardware required for a data warehouse includes a server, hard drives, and processors. In most organizations, the data warehouse is accessible via the shared network or Intranet. A data architect usually is responsible for setting up the database structure and managing the process for the updating of data from the original sources.                                                                                                                       


Data warehouse:

A data warehouse is a repository of an organization's electronically stored data. It is designed to facilitate reporting and analysis.
  • The purpose of data warehouse is to store data consistently across the organization and to make the organizational information accessible.
  • It is adaptive and resilient source of information. When new data is added to the Data Warehouse, the existing data and technologies are not disrupted. The design of separate data marts that make up the data warehouse must be distributed and incremental. Anything else is a compromise.
  • The data warehouse not only controls the access to the data, but gives its owners great visibility into the uses and abuses of the data, even after it has left the data warehouse.
  • Data warehouse is the foundation for decision-making.;


Difference Between Data Warehousing And Business Intelligence:

Data warehousing and business intelligence are two terms that are a common source of confusion, both inside and outside of the information technology (IT) industry. 
Data warehousing refers to the technology used to actually create a repository of data. Business intelligence refers to the tools and applications used in the analysis and interpretation of data. 
 
Data warehousing and business intelligence have grown substantially and are forecast to experience continued growth into the future.     



Different Types of Data Warehouse Design: 

There are two main types of data warehouse design: top-down and bottom-up. The two designs have their own advantages and disadvantages.
Bottom-up is easier and cheaper to implement, but it is less complete, and data correlations are more sporadic.
In a top-down design, connections between data are obvious and well-established, but the data may be out of date, and the system is costly to implement.

Data marts are the central figure in data warehouse design. A data mart is a collection of data based around a single concept. Each data mart is a unique and complete subset of data. Each of these collections is completely correlated internally and often has connections to external data marts.

The way data marts are handled is the main difference between the two styles of data warehouse design. In the top-down design, data marts occur naturally as data is put into the system. In the bottom-up design, data marts are made directly and connected together to form the warehouse. While this may seem like a minor difference, it makes for a very different design.

The top-down method was the original data warehouse design. Using this method, all of the information the organization holds is put into the system. Each broad subject will have its own general area within the databases. As the data is used, connections will appear between correlative data points, and data marts will appear. In addition, any data in the system stays there forever—even if the data is superseded or trivialized by later information, it will stay in the system as a record of past events.

The bottom-up method of data warehouse design works from the opposite direction. A company puts in information as a standalone data mart. As time goes on, other data sets are added to the system, either as their own data mart or as part of one that already exists. When two data marts are considered connected enough, they merge together into a single unit.

The two data warehouse designs each have their own strong and weak points. The top-down method is a huge project for even smaller data sets. Since big projects are also more costly, it is the most expensive in terms of money and manpower. If the data warehouse is finished and maintained, it is a vast collection, containing everything that the company knows.

The bottom-up process is much faster and cheaper, but since the data is entered as needed, the database will never actually be complete. In addition, correlations between data marts are only as strong as their usage makes them. If a strong correlation exists, but no users see it, it goes unconnected.       


Data Warehouse architecture:

 







 

 

 

 


Source Systems/Data Sources
Typically in any organization the data is stored in various databases, usually divided up by the systems. There may be data for marketing, sales, payroll, engineering, etc. These systems might be legacy/mainframe systems,Flat files or relational database systems.


 Staging Area
The data coming from various source systems is first kept in a staging area. The staging area is used to clean, transform, combine, de-duplicate, household, archive, and to prepare source data for use in data warehouse. The data coming from source system is kept as it is in this area. This need not be based on relational terminology. Sometimes managers of the data are comfortable with normalized set of data. In these cases, normalized structure of the data staging storage is certainly acceptable. Also, staging area doesnt provide querying/presentation services. 


 Ware House
Once the data is in staging area, it is cleansed, transformed and then sent to Data warehouse. You may or may not have ODS before transferring data to Data Warehouse.
 


OLAP
The data in Data Warehouse has to be easily manipulated in order to answer the business questions from management and other users. This is accomplished by connecting the data to fast and easy-to-use tools known as Online Analytical Processing (OLAP) tools. OLAP tools can be thought of as super high-speed forklifts that have knowledge of the warehouse and the operators built into them in order to allow ordinary people off the street to jump in and quickly find products by asking English-like questions. Within the OLAP server, data is reorganized to meet the reporting and analysis requirements of the business, including:
    * Exception reporting
    * Ad-hoc analysis
    * Actual vs. budget reporting
    * Data mining (looking for trends or anomalies in the data)
In order to process business queries at high speed, answers to common questions are preprocessed in some OLAP servers, resulting in exceptional query responses at the cost of having an OLAP database that may be several times bigger than the data warehouse itself.