In this blog i have chosen a topic
in which I had trouble finding answers to. In this Blog I will try and address
the FAQ’s on Exadata and ZFS; Hope this becomes a home for all starters in this
technology to start with.
Exadata
Exadata Machine:
ð Exadata
Machine is hardware and software engineered together that provides users with optimized functionality pertaining
to enterprise class databases and their associated workloads. Its architecture
majorly features intelligent storage which is capable of processing the
offloaded request and returning only the piece of information requested by the
client, A Flash Technology which can be configured in 2 different modes, 1) Flash Write Through and 2) Flash Write Back to provide optimal
performance based on the workload. It also comes with an high speed InfiniBand
Switch which binds Oracle servers and storage systems into a scalable, high
performance cluster.
IDB Protocol:
ð IDB stands for Intelligent Database protocol, Oracle Exadata uses iDB Protocol to
transfer between Database nodes, ASM Instance and Storage Cell nodes, It
provides both simple I/O functionality, such as block-oriented reads and
writes, and advanced I/O functionality, including offloading and I/O resource
management. It is implemented in the database Kernel and work on a function
shipping architecture to map DB operations to Exadata Cell Storage operations.
ð IDB implements a function shipping architecture in addition
to the traditional data block shipping provided by the database.
ð IDB protocol provides interconnection bandwidth aggregation,
Redundancy and failover.
ð IDB is built on the industry standard Reliable Datagram
Sockets (RDSv3) protocol and runs over InfiniBand ZDP (Zero-loss Zero-copy
Datagram Protocol), a zero-copy implementation of RDS.
What is interconnection
bandwidth aggregation?
ð Bandwidth aggregation refers
to the bundling of multiple physical interfaces to create a single virtual
interface that supports the combined bandwidth of each physical interface. Not
only does this offer a performance increase over using just a single physical
interface to connect to a device, but it also offers redundancy because a
physical interface failure does not take down the virtual interface as other
physical interfaces are still operational.
ZDP:
ðZDP is
Zero-loss Zero-copy user Datagram protocol and its objective is to eliminate
unnecessary copying of blocks. InfiniBand uses ZDP protocol.
*** Before we continue the FAQ’s I would like to elaborate more on Storage Server Architecture and Flow. ***
EXADATA
STORAGE SERVER ARCHITECTURE
ð Like any traditional storage server components, Exadata
too uses commodity hardware like CPU, memory, network interface controllers (NICs),
storage disks, and so on but the secret sauce lies in software provided by
oracle i.e. an Exadata Storage Cell comes preloaded with the Oracle Enterprise
Linux OS and Intelligent Exadata Storage Server Software.
ð An ASM instance running on the Compute node
communicates with a storage server through an InfiniBand network connection
using the special Intelligent Database (iDB) protocol.
ð This figure depicts the typical Quarter Rack Exadata
Storage Cell architecture details, two Compute Nodes and three Storage Cells,
and how the communication and relation between a Compute Node and Storage Cells
are established. The 8th QTR comes with two InfiniBand network
switches, known as leaf switches, configured between a cell and Compute Nodes
to provide a communication path tolerating any switch failure.
*********
Now that we have some background on how Exadata QTR looks
like, lets again begin with some more FAQ’s.
*********
Compute Nodes in Exadata:
ð Compute Nodes are nothing but Database Servers.
Cell Server in Exadata:
ð Cell Server is Storage Node. The core functionality of Cell Server is
delivered by 3 key components that run in background as CELLSRV, MS and RS.
Cellsrv:
ð Cell Server (CELLSRV) is a multithreaded process. It utilizes the most CPU cycles,
and uses the special iDB protocol over InfiniBand for communication between an
ASM instance and Storage Cells. It is the primary component running on the cell
and is responsible for performing Exadata advanced responsibilities, such as
SQL Offloading (Smart Scans), prioritizing, implementing IORM, cell to cell
data transfer and so on …
Management Server:
ð
Management Server (MS) provides standard cell
configuration and management functionality in coordination with CellCLI.
ð
It performs the following additional tasks:
o
Tracks down the hardware-level changes on the
cell server and notifies the CELLSRV through an ioctl system call.
o
Collects, computes, and manages storage server
metrics.
o
Rebuilds the virtual drives when a disk is
replaced. Typically, when a disk performs poorly, the associated grid disk and
cell disk will be taken offline, and MS service will notify the CELLSRV
service.
o
MS also triggers the following automated tasks
every hour:
§ Deletes files older than seven days from the ADR directory,
$LOG_HOME, and all metric history.
§ Performs alert log file auto-maintenance whenever the file
size reaches 10MB in size and deletes previous copies of the alert log when
they become seven days old.
§ Notifies when file utilization reaches 80%.
Restart Server?
ð Restart Server (RS) monitors other services on the cell
server and restarts them automatically in case any service needs to be
restarted. It also handles planned service restarts as part of any software
updates.
ð The cellrssrm is the main RS process and spans three child
processes:
o
cellrsomt
o
cellrsbmt
o
cellesmmt
What is Offloading?
ð Offloading is more of generic term used to offload requested
information on to the storage layer than on database server. Below Query can be
used to identify functions which can be offloaded. In Exadata few of the major
offloading happens using:-
o
Traditional Scan
Vs Smart Scan
Traditional
Scan
§ With traditional storage, all database intelligence resides
in database hosts.
§ Database CPU is used to evaluate the predicates and do
further processing like joins operations etc.
§ Large percentage of data is retrieved which might not be of
an interest and only subset of data is required.
Smart Scan
§ Only relevant columns are returned to the hosts (Column
Projection).
§ Unlike traditional scan, CPU consumed by predicate evolution
is offloaded to Storage cell.
§ Predicate filtering also avoids large percent of I/O which
was the case with traditional scan.
§ Other features of Smart Scans are mentioned in Smart Scan
Optimization detail below.
o
Smart Scan
optimization includes:
§ Column Projection
§ Predicate Filtering
§ Simple Joins
§ Function Offloading
§ Virtual Column Evaluation
§ HCC Decompression
§ Decryption
o
Storage Indexes:
§ Oracle Exadata Storage Servers maintain a storage index which
contains a summary of the data distribution on the disk. The storage index is
maintained automatically, and is transparent to Oracle Database. It is a collection
of in-memory region indexes, and each region index stores summaries for up to
eight columns. There is one region index for each 1 MB of disk space. Each
region index maintains the minimum and maximum values of the columns of the
table. The minimum and maximum values are used to eliminate unnecessary I/O,
also known as I/O filtering.
§ The content stored in one region index is independent of the
other region indexes. This makes them highly scalable, and avoids latch
contention.
§ Queries using the following comparisons are improved by the
storage index:
·
Equality (=)
·
Inequality (<, !=, or >)
·
Less than or equal (<=)
·
Greater than or equal (>=)
·
IS NULL
·
IS NOT NULL
§ Storage indexes are built automatically after Oracle Exadata
Storage Server Software receives a query with a comparison predicate that is
greater than the maximum or less than the minimum value for the column in a
region, and would have benefited if a storage index had been present. Oracle
Exadata Storage Server Software automatically learns which storage indexes
would have benefited a query, and then creates the storage index automatically
so that subsequent similar queries benefit. Few advantages are as follows:-
2)
Partition
Pruning-like Benefits with Storage Index
Examples:-
·
The following figure shows a table and region
indexes. The values in the table range from one to eight. One region index
stores the minimum 1, and the maximum of 5. The other region index stores the
minimum of 3, and the maximum of 8.
·
Using storage index allows table joins to skip
unnecessary I/O operations. For example, the following query would perform an
I/O operation and apply a Bloom filter to only the first block of the fact table.
The I/O for the second block of the fact table is completely eliminated by
storage index as its minimum/maximum range (5,8) is not present in the Bloom
filter.
SELECT count(*) from fact, dim where
fact.m=dim.m and dim.product="Hard drive"
o
Cell to Cell Data
Transfer:
§ In earlier releases, Exadata Cells did not directly
communicate to each other. Any data movement between the cells was done through
the database servers. Data was read from a source cell into database server
memory, and then written out to the destination cell. Starting with Oracle
Exadata Storage Server Software 12c Release 1 (12.1), database server
processes can offload data transfer to cells.
§ A database server instructs the destination cell to read the
data directly from the source cell, and write the data to its local storage.
This reduces the amount of data transferred across the fabric by half, reducing
InfiniBand bandwidth consumption, and memory requirements on the database
server.
§ Oracle Automatic Storage Management (Oracle ASM) resynchronization,
resilver, and rebalance use this feature to offload data movement. This
provides improved bandwidth utilization at the InfiniBand fabric level in
Oracle ASM instances. No configuration is needed to utilize this feature.
o
Incremental Backup
Offloading:
§ To optimize the performance of incremental backups, the
database can offload block filtering to Oracle Exadata Storage Server. This
optimization is only possible when taking backups using Oracle Recovery Manager
(RMAN). The offload processing is done transparently without user intervention.
During offload processing, Oracle Exadata Storage Server Software filters out
the blocks that are not required for the incremental backup in progress.
Therefore, only the blocks that are required for the backup are sent to the
database, making backups significantly faster.
ð To find out which Oracle functions are offloaded, the
details can be retrieved using blow query:
SQL> Select * from v$sqlfn_metadata where
offloadable='YES'
What is Flash Cache?
ð
o
Write through Cache?
§ Write-Through means that writing I/O coming from the
database layer will first go to the spinning drives where it is
mirrored according to the redundancy of the diskgroup where the file is placed
that is written to. Afterwards, the cells may populate the Flash Cache if
they think it will benefit subsequent reads, but there is no mirroring
required.
§ Write Operations are not signalled as complete until the I/O
to the disk is completed.
§ In case of hardware failure, the mirroring is already
sufficiently done on the spinning drives.
o
Write Back Cache?
§ Exadata Smart Flash Cache transparently caches
frequently-accessed data to fast solid-state storage, improving query response
times and throughput. Write operations serviced by flash instead of by disk are
referred to as "write back flash cache.
§ An active data block
can remain in write back flash cache for months or years. All data is copied to
disk, as needed.
§ If there is a problem with the flash cache, then the
operations transparently fail over to the mirrored copies on flash. No user
intervention is required. The data on flash is mirrored based on their
allocation units. This means the amount of data written is proportional to the
lost cache size, not the disk size.
Smart Flash Log:
ð Oracle Exadata Smart Flash
Log only uses Exadata flash storage for temporary
storage of redo log data. By default, Oracle Exadata Smart Flash Log uses 32 MB
on each flash disk, for a total of 512 MB across each Exadata Cell. It is
automatically configured and enabled, no additional configuration is needed.
Oracle Exadata Smart Flash Log performs redo writes simultaneously to both
flash memory and the disk controller cache, and completes the write when the
first of the two completes. This improves the user transaction response time,
and increases overall database throughput for I/O intensive workloads.
Flash Cache Compression:
ð
Flash cache compression dynamically increases
the logical capacity of the flash cache by transparently compressing user data
as it is loaded into the flash cache. This allows much more data to be kept in
flash, and decreases the need to access data on disk drives. I/O to data in
flash is orders of magnitude faster than I/Os to data on disk. The compression
and decompression operations are completely transparent to the application and
database, and have no performance overhead.
ð Depending on the user data compressibility, Oracle Exadata
Storage Server Software dynamically expands the flash cache size up to two
times. Compression benefits vary based on the redundancy in the data. Tables
and indexes that are uncompressed have the largest space reductions. Tables and
indexes that are OLTP compressed have significant space reductions. Tables that
use Hybrid Columnar Compression have minimal space reductions. Oracle Advanced
Compression Option is required to enable flash cache compression.
ð This feature is enabled using the CellCLI ALTER CELL
flashCacheCompress=true command.
Smart Block Transfer:
ð Minimum software required: 12.1.0.2 BP12. Many OLTP workloads
can have hot blocks that need to be updated frequently across multiple nodes in
Oracle Real Application Clusters (Oracle RAC). One example is Right Growing
Index (RGI), where new rows are added to a table with an index from several
Oracle RAC nodes. The index leaf block becomes a hot block that needs frequent
updates across all nodes.
ð Without Exadata’ s Smart Block Transfer feature, a hot block
can be transferred from a sending node to a receiver node only after the
sending node has made changes in its redo log buffer durable in its redo log.
With Smart Block Transfer, this latency of redo log write at the sending node
is eliminated. The block is transferred as soon as the IO to the redo log is
issued at the sending node, without waiting for it to complete.
ð To enable Smart Block Transfer, set the hidden static
parameter "_cache_fusion_pipelined_updates"=FALSE on ALL Oracle
RAC nodes
Exadata Storage Layout:
Lun:
ð A LUN is created
from a physical disk. Each Physical Disk is presented as Lun.
Cell Disk:
ð A cell disk is
created on a lun. A segment of cell disk storage is used by the Oracle Exadata
Storage Server Software system, referred to as the cell system area. There is one to one
relationship between a physical disk and the cell disk. One physical disk
corresponds to a single cell disk.
Grid Disk:
ð Grid disk comprises of many cell disks.
ASM Disk:
ð A Grid disk corresponds to one ASM disk.
Migration Techniques of Oracle Database to Exadata Machine?
ð Using Rman.
ð Using EXPDP/IMPDP
ð Using Oracle GG Replication.
ð Creating Standby.
Consideration for DBA’s before migrating to Exadata Server:
ð Ensure the PGA is adequately Sized (On Exadata PGA is
used a lot and sizing it appropriately will help direct path read operations).
ð Make Sure you are on Latest Exadata Patch Bundle ( MOS
Note: 888828.1)
ð A Fast Network.
Consideration for Developer’s before migrating to Exadata
Server:
ð Using SPA DBA can simulate application behaviour with
newly added Exadata Simulator and
test the given load.
ð Use Oracle Real Application Testing to test the
performance.
ðOrder By tables
for better usage of storage indexes.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ZFS
What is ZFS and Pooled Storage in ZFS?
ð
ZFS is robust, scalable, and easy to administer. ZFS uses the concept of storage pools to
manage physical storage. ZFS eliminates volume management, Instead of forcing
you to create virtualized volumes; ZFS aggregates devices into a storage pool.
ð
The storage pool
describes the physical characteristics of the storage (device layout, data
redundancy, and so on) and acts as an arbitrary data store from which file
systems can be created.
ð
File systems are no
longer constrained to individual devices, allowing them to share disk space
with all file systems in the pool.
ð
You no longer need to
predetermine the size of a file system, as file systems grow automatically
within the disk space allocated to the storage pool. When new storage is added,
all file systems within the pool can immediately use the additional disk space
without additional work.
ð
ZFS is a transactional file system, which means that the file
system state is always consistent on disk.
ð
ZFS Snapshots:
ð A snapshot is a read-only copy of a file system or
volume. Snapshots can be created quickly and easily. Initially, snapshots
consume no additional disk space within the pool.
ð As data within the active dataset changes, the
snapshot consumes disk space by continuing to reference the old data. As a
result, the snapshot prevents the data from being freed back to the pool.
ZIL:
o
Stands for ZFS Intent Log. The write SSD cache is called the Log Device,
and it is used by the ZIL(ZFS intent log). ZIL basically turns
synchronous writes into asynchronous writes, which helps e.g. NFS or databases.
All data is written to the ZIL like a journal log, but only read
after a crash.
o
The ZIL handles
synchronous writes by immediately writing their data and information to stable
storage, an "intent log", so that ZFS can claim that the write
completed. The written data hasn't reached its final destination on the ZFS
filesystem yet, that will happen sometime later when the transaction group is
written. In the meantime, if there is a system failure, the data will not be
lost as ZFS can replay that intent log and write the data to its final
destination. This is a similar principle as Oracle redo logs, for example.
Slog:
o
Stands for Separate Intent Log. SLOG can use write optimized SSDs (which we
call "Logzillas") to accelerate performance of synchronous write workloads.
o
ZIL stores the log
data on the same disks as the ZFS pool. While this works, the performance of
synchronous writes can suffer as they compete for disk access along with the
regular pool requests (reads, and transaction group writes.) Having two
different workloads compete for the same disk can negatively affect both workloads, as the heads seek between the hot data for one
and the other. The solution is to give the ZIL its own dedicated "log
devices" to write to. These dedicated log devices form the separate intent
log: the SLOG.
Logzilla:
o
By writing to
dedicated log devices, we can improve performance further by choosing a device
which is best suited for fast writes. Enter Logzilla. Logzilla was the name we
gave to write-optimized flash-memory-based solid state disks (SSDs.) Flash
memory is known for slow write performance, so to improve this Logzilla device
buffers the write in DRAM and uses a super-capacitor to power the device long
enough to write the DRAM buffer to flash, should it lose power. These devices
can write an I/O as fast as 0.1 ms (depends on I/O size), and do so
consistently. By using them as our SLOG, we can serve synchronous write
workloads consistently fast.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Some Important
SQL’s
To Find out if Exadata is in
use or not:
SQL> set serveroutput on
DECLARE
i NUMBER;
j NUMBER;
k CLOB;
BEGIN
dbms_feature_exadata (i,
j, k);
dbms_output.put_line('1:
' || i);
dbms_output.put_line('2:
' || j);
dbms_output.put_line('3:
' || k);
END;
/
I have tested few of the solutions but there are a lot which
are more bookish. This is an attempt to try answer few of the FAQ’s related to
Exadata and ZFS. Below are the References which I used to collate the
information all at one place.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
References:
http://www.oracle.com/technetwork/articles/oem/11g-exadata-simulator-315027.html (Exadata Simulator)
Oracle
Exadata Expert Handbook.
MOS
Note: 50415.1 - WAITEVENT: "direct path read"
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
!!!!!!Hope you enjoyed reading the BLOG. !!!!!!
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
No comments:
Post a Comment