Oracle Exadata and ZFS FAQ's

In this blog i have chosen a topic in which I had trouble finding answers to. In this Blog I will try and address the FAQ’s on Exadata and ZFS; Hope this becomes a home for all starters in this technology to start with.

                                              Exadata

Exadata Machine:

ð Exadata Machine is hardware and software engineered together that provides users with optimized functionality pertaining to enterprise class databases and their associated workloads. Its architecture majorly features intelligent storage which is capable of processing the offloaded request and returning only the piece of information requested by the client, A Flash Technology which can be configured in 2 different modes, 1) Flash Write Through and 2) Flash Write Back to provide optimal performance based on the workload. It also comes with an high speed InfiniBand Switch which binds Oracle servers and storage systems into a scalable, high performance cluster.

IDB Protocol:

  ð  IDB stands for Intelligent Database protocol, Oracle Exadata uses iDB Protocol to transfer between Database nodes, ASM Instance and Storage Cell nodes, It provides both simple I/O functionality, such as block-oriented reads and writes, and advanced I/O functionality, including offloading and I/O resource management. It is implemented in the database Kernel and work on a function shipping architecture to map DB operations to Exadata Cell Storage operations.

  ð  IDB implements a function shipping architecture in addition to the traditional data block shipping provided by the database.

  ð  IDB protocol provides interconnection bandwidth aggregation, Redundancy and failover.

  ð  IDB is built on the industry standard Reliable Datagram Sockets (RDSv3) protocol and runs over InfiniBand ZDP (Zero-loss Zero-copy Datagram Protocol), a zero-copy implementation of RDS.

What is interconnection bandwidth aggregation?
  ð  Bandwidth aggregation refers to the bundling of multiple physical interfaces to create a single virtual interface that supports the combined bandwidth of each physical interface. Not only does this offer a performance increase over using just a single physical interface to connect to a device, but it also offers redundancy because a physical interface failure does not take down the virtual interface as other physical interfaces are still operational. 

ZDP:

ðZDP is Zero-loss Zero-copy user Datagram protocol and its objective is to eliminate unnecessary copying of blocks. InfiniBand uses ZDP protocol.

*** Before we continue the FAQ’s I would like to elaborate more on Storage Server Architecture and Flow. ***

 EXADATA STORAGE SERVER ARCHITECTURE

  ð  Like any traditional storage server components, Exadata too uses commodity hardware like CPU, memory, network interface controllers (NICs), storage disks, and so on but the secret sauce lies in software provided by oracle i.e. an Exadata Storage Cell comes preloaded with the Oracle Enterprise Linux OS and Intelligent Exadata Storage Server Software.

  ð  An ASM instance running on the Compute node communicates with a storage server through an InfiniBand network connection using the special Intelligent Database (iDB) protocol.
                                                      
          
  ð  This figure depicts the typical Quarter Rack Exadata Storage Cell architecture details, two Compute Nodes and three Storage Cells, and how the communication and relation between a Compute Node and Storage Cells are established. The 8th QTR comes with two InfiniBand network switches, known as leaf switches, configured between a cell and Compute Nodes to provide a communication path tolerating any switch failure.
*********
Now that we have some background on how Exadata QTR looks like, lets again begin with some more FAQ’s.
*********


Compute Nodes in Exadata:
  ð  Compute Nodes are nothing but Database Servers.

Cell Server in Exadata:
  ð  Cell Server is Storage Node. The core functionality of Cell Server is delivered by 3 key components that run in background as CELLSRV, MS and RS.

Cellsrv:
  ð  Cell Server (CELLSRV) is a multithreaded process. It utilizes the most CPU cycles, and uses the special iDB protocol over InfiniBand for communication between an ASM instance and Storage Cells. It is the primary component running on the cell and is responsible for performing Exadata advanced responsibilities, such as SQL Offloading (Smart Scans), prioritizing, implementing IORM, cell to cell data transfer and so on …

Management Server:
 ð  Management Server (MS) provides standard cell configuration and management functionality in coordination with CellCLI.

 ð  It performs the following additional tasks:
o   Tracks down the hardware-level changes on the cell server and notifies the CELLSRV through an ioctl system call.

o   Collects, computes, and manages storage server metrics.

o   Rebuilds the virtual drives when a disk is replaced. Typically, when a disk performs poorly, the associated grid disk and cell disk will be taken offline, and MS service will notify the CELLSRV service.

o   MS also triggers the following automated tasks every hour:
§   Deletes files older than seven days from the ADR directory, $LOG_HOME, and all metric history.
§   Performs alert log file auto-maintenance whenever the file size reaches 10MB in size and deletes previous copies of the alert log when they become seven days old.
§   Notifies when file utilization reaches 80%.
Restart Server?
  ð  Restart Server (RS) monitors other services on the cell server and restarts them automatically in case any service needs to be restarted. It also handles planned service restarts as part of any software updates.

  ð  The cellrssrm is the main RS process and spans three child processes:
o    cellrsomt
o    cellrsbmt
o    cellesmmt
What is Offloading?
ð  Offloading is more of generic term used to offload requested information on to the storage layer than on database server. Below Query can be used to identify functions which can be offloaded. In Exadata few of the major offloading happens using:-

o    Traditional Scan Vs Smart Scan

Traditional Scan

§   With traditional storage, all database intelligence resides in database hosts.
§   Database CPU is used to evaluate the predicates and do further processing like joins operations etc.
§   Large percentage of data is retrieved which might not be of an interest and only subset of data is required.
Smart Scan
§   Only relevant columns are returned to the hosts (Column Projection).
§   Unlike traditional scan, CPU consumed by predicate evolution is offloaded to Storage cell.
§   Predicate filtering also avoids large percent of I/O which was the case with traditional scan.
§   Other features of Smart Scans are mentioned in Smart Scan Optimization detail below.

o    Smart Scan optimization includes:
§   Column Projection
§   Predicate Filtering
§   Simple Joins
§   Function Offloading
§   Virtual Column Evaluation
§   HCC Decompression
§   Decryption

o    Storage Indexes:

§   Oracle Exadata Storage Servers maintain a storage index which contains a summary of the data distribution on the disk. The storage index is maintained automatically, and is transparent to Oracle Database. It is a collection of in-memory region indexes, and each region index stores summaries for up to eight columns. There is one region index for each 1 MB of disk space. Each region index maintains the minimum and maximum values of the columns of the table. The minimum and maximum values are used to eliminate unnecessary I/O, also known as I/O filtering.

§   The content stored in one region index is independent of the other region indexes. This makes them highly scalable, and avoids latch contention.

§   Queries using the following comparisons are improved by the storage index:
·          Equality (=)
·          Inequality (<, !=, or >)
·          Less than or equal (<=)
·          Greater than or equal (>=)
·          IS NULL
·          IS NOT NULL

§   Storage indexes are built automatically after Oracle Exadata Storage Server Software receives a query with a comparison predicate that is greater than the maximum or less than the minimum value for the column in a region, and would have benefited if a storage index had been present. Oracle Exadata Storage Server Software automatically learns which storage indexes would have benefited a query, and then creates the storage index automatically so that subsequent similar queries benefit. Few advantages are as follows:-
1)     Elimination of Disk I/O with Storage Index.
2)     Partition Pruning-like Benefits with Storage Index
3)     Improved Join Performance Using Storage Index
Examples:-
·          The following figure shows a table and region indexes. The values in the table range from one to eight. One region index stores the minimum 1, and the maximum of 5. The other region index stores the minimum of 3, and the maximum of 8.


·          Using storage index allows table joins to skip unnecessary I/O operations. For example, the following query would perform an I/O operation and apply a Bloom filter to only the first block of the fact table. The I/O for the second block of the fact table is completely eliminated by storage index as its minimum/maximum range (5,8) is not present in the Bloom filter.

SELECT count(*) from fact, dim where fact.m=dim.m and dim.product="Hard drive"

o    Cell to Cell Data Transfer:
§   In earlier releases, Exadata Cells did not directly communicate to each other. Any data movement between the cells was done through the database servers. Data was read from a source cell into database server memory, and then written out to the destination cell. Starting with Oracle Exadata Storage Server Software 12c Release 1 (12.1), database server processes can offload data transfer to cells.

§   A database server instructs the destination cell to read the data directly from the source cell, and write the data to its local storage. This reduces the amount of data transferred across the fabric by half, reducing InfiniBand bandwidth consumption, and memory requirements on the database server.

§   Oracle Automatic Storage Management (Oracle ASM) resynchronization, resilver, and rebalance use this feature to offload data movement. This provides improved bandwidth utilization at the InfiniBand fabric level in Oracle ASM instances. No configuration is needed to utilize this feature. 

o    Incremental Backup Offloading:
§   To optimize the performance of incremental backups, the database can offload block filtering to Oracle Exadata Storage Server. This optimization is only possible when taking backups using Oracle Recovery Manager (RMAN). The offload processing is done transparently without user intervention. During offload processing, Oracle Exadata Storage Server Software filters out the blocks that are not required for the incremental backup in progress. Therefore, only the blocks that are required for the backup are sent to the database, making backups significantly faster.

  ð  To find out which Oracle functions are offloaded, the details can be retrieved using blow query:

SQL> Select * from v$sqlfn_metadata where offloadable='YES'
What is Flash Cache?
  ð   
o   Write through Cache?

§   Write-Through means that writing I/O coming from the database layer will first go to the spinning drives where it is mirrored according to the redundancy of the diskgroup where the file is placed that is written to. Afterwards, the cells may populate the Flash Cache if they think it will benefit subsequent reads, but there is no mirroring required.

§   Write Operations are not signalled as complete until the I/O to the disk is completed.

§   In case of hardware failure, the mirroring is already sufficiently done on the spinning drives.

o   Write Back Cache?
§   Exadata Smart Flash Cache transparently caches frequently-accessed data to fast solid-state storage, improving query response times and throughput. Write operations serviced by flash instead of by disk are referred to as "write back flash cache.

§    An active data block can remain in write back flash cache for months or years. All data is copied to disk, as needed.

§   If there is a problem with the flash cache, then the operations transparently fail over to the mirrored copies on flash. No user intervention is required. The data on flash is mirrored based on their allocation units. This means the amount of data written is proportional to the lost cache size, not the disk size.
Smart Flash Log:
  ð  Oracle Exadata Smart Flash Log only uses Exadata flash storage for temporary storage of redo log data. By default, Oracle Exadata Smart Flash Log uses 32 MB on each flash disk, for a total of 512 MB across each Exadata Cell. It is automatically configured and enabled, no additional configuration is needed. Oracle Exadata Smart Flash Log performs redo writes simultaneously to both flash memory and the disk controller cache, and completes the write when the first of the two completes. This improves the user transaction response time, and increases overall database throughput for I/O intensive workloads.

Flash Cache Compression:

  ð  Flash cache compression dynamically increases the logical capacity of the flash cache by transparently compressing user data as it is loaded into the flash cache. This allows much more data to be kept in flash, and decreases the need to access data on disk drives. I/O to data in flash is orders of magnitude faster than I/Os to data on disk. The compression and decompression operations are completely transparent to the application and database, and have no performance overhead.

  ð  Depending on the user data compressibility, Oracle Exadata Storage Server Software dynamically expands the flash cache size up to two times. Compression benefits vary based on the redundancy in the data. Tables and indexes that are uncompressed have the largest space reductions. Tables and indexes that are OLTP compressed have significant space reductions. Tables that use Hybrid Columnar Compression have minimal space reductions. Oracle Advanced Compression Option is required to enable flash cache compression.

  ð  This feature is enabled using the CellCLI ALTER CELL flashCacheCompress=true command.

Smart Block Transfer:

  ð  Minimum software required: 12.1.0.2 BP12. Many OLTP workloads can have hot blocks that need to be updated frequently across multiple nodes in Oracle Real Application Clusters (Oracle RAC). One example is Right Growing Index (RGI), where new rows are added to a table with an index from several Oracle RAC nodes. The index leaf block becomes a hot block that needs frequent updates across all nodes.

  ð  Without Exadata’ s Smart Block Transfer feature, a hot block can be transferred from a sending node to a receiver node only after the sending node has made changes in its redo log buffer durable in its redo log. With Smart Block Transfer, this latency of redo log write at the sending node is eliminated. The block is transferred as soon as the IO to the redo log is issued at the sending node, without waiting for it to complete.

  ð  To enable Smart Block Transfer, set the hidden static parameter "_cache_fusion_pipelined_updates"=FALSE on ALL Oracle RAC nodes
Exadata Storage Layout:



Lun:
ð  A LUN is created from a physical disk. Each Physical Disk is presented as Lun.

Cell Disk:
ð  A cell disk is created on a lun. A segment of cell disk storage is used by the Oracle Exadata Storage Server Software system, referred to as the cell system area. There is one to one relationship between a physical disk and the cell disk. One physical disk corresponds to a single cell disk.

Grid Disk:
ð  Grid disk comprises of many cell disks.

ASM Disk:
ð  A Grid disk corresponds to one ASM disk.
Migration Techniques of Oracle Database to Exadata Machine?
  ð  Using Rman.
  ð  Using EXPDP/IMPDP
  ð  Using Oracle GG Replication.
  ð  Creating Standby.
Consideration for DBA’s before migrating to Exadata Server:
  ð  Ensure the PGA is adequately Sized (On Exadata PGA is used a lot and sizing it appropriately will help direct path read operations).
  ð  Make Sure you are on Latest Exadata Patch Bundle ( MOS Note: 888828.1)
  ð  A Fast Network.
Consideration for Developer’s before migrating to Exadata Server:
  ð  Using SPA DBA can simulate application behaviour with newly added Exadata Simulator and test the given load.

  ð  Use Oracle Real Application Testing to test the performance.
     ðOrder By tables for better usage of storage indexes.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ZFS
What is ZFS and Pooled Storage in ZFS?
ð  ZFS is robust, scalable, and easy to administer. ZFS uses the concept of storage pools to manage physical storage. ZFS eliminates volume management, Instead of forcing you to create virtualized volumes; ZFS aggregates devices into a storage pool.
ð  The storage pool describes the physical characteristics of the storage (device layout, data redundancy, and so on) and acts as an arbitrary data store from which file systems can be created.
ð  File systems are no longer constrained to individual devices, allowing them to share disk space with all file systems in the pool.
ð  You no longer need to predetermine the size of a file system, as file systems grow automatically within the disk space allocated to the storage pool. When new storage is added, all file systems within the pool can immediately use the additional disk space without additional work.
ð  ZFS is a transactional file system, which means that the file system state is always consistent on disk.
ð   
ZFS Snapshots:
ð  A snapshot is a read-only copy of a file system or volume. Snapshots can be created quickly and easily. Initially, snapshots consume no additional disk space within the pool.
ð  As data within the active dataset changes, the snapshot consumes disk space by continuing to reference the old data. As a result, the snapshot prevents the data from being freed back to the pool.
ZIL:
o    Stands for ZFS Intent Log. The write SSD cache is called the Log Device, and it is used by the ZIL(ZFS intent log). ZIL basically turns synchronous writes into asynchronous writes, which helps e.g. NFS or databases. All data is written to the ZIL like a journal log, but only read after a crash.
o    The ZIL handles synchronous writes by immediately writing their data and information to stable storage, an "intent log", so that ZFS can claim that the write completed. The written data hasn't reached its final destination on the ZFS filesystem yet, that will happen sometime later when the transaction group is written. In the meantime, if there is a system failure, the data will not be lost as ZFS can replay that intent log and write the data to its final destination. This is a similar principle as Oracle redo logs, for example.
Slog:
o    Stands for Separate Intent Log. SLOG can use write optimized SSDs (which we call "Logzillas") to accelerate performance of synchronous write workloads.
o    ZIL stores the log data on the same disks as the ZFS pool. While this works, the performance of synchronous writes can suffer as they compete for disk access along with the regular pool requests (reads, and transaction group writes.) Having two different workloads compete for the same disk can negatively affect both workloads, as the heads seek between the hot data for one and the other. The solution is to give the ZIL its own dedicated "log devices" to write to. These dedicated log devices form the separate intent log: the SLOG.
Logzilla:
o    By writing to dedicated log devices, we can improve performance further by choosing a device which is best suited for fast writes. Enter Logzilla. Logzilla was the name we gave to write-optimized flash-memory-based solid state disks (SSDs.) Flash memory is known for slow write performance, so to improve this Logzilla device buffers the write in DRAM and uses a super-capacitor to power the device long enough to write the DRAM buffer to flash, should it lose power. These devices can write an I/O as fast as 0.1 ms (depends on I/O size), and do so consistently. By using them as our SLOG, we can serve synchronous write workloads consistently fast.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Some Important SQL’s
To Find out if Exadata is in use or not:
SQL> set serveroutput on
DECLARE
 i NUMBER;
 j NUMBER;
 k CLOB;
BEGIN
  dbms_feature_exadata (i, j, k);
  dbms_output.put_line('1: ' || i);
  dbms_output.put_line('2: ' || j);
  dbms_output.put_line('3: ' || k);
END;
/
I have tested few of the solutions but there are a lot which are more bookish. This is an attempt to try answer few of the FAQ’s related to Exadata and ZFS. Below are the References which I used to collate the information all at one place.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
References:
Oracle Exadata Expert Handbook.
MOS Note: 50415.1 - WAITEVENT: "direct path read"
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
!!!!!!Hope you enjoyed reading the BLOG. !!!!!!
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

No comments:

Post a Comment