The Hot Aisle Logo
Fresh Thinking on IT Operations for 100,000 Industry Executives

The Storage Area Network (SAN) is now a key component of most reasonably sized IT Infrastructures. The SAN we rely on today evolved from very humble beginnings into a mission critical Enterprise IT component. The first SANs were introduced to handle small numbers of enterprise scale systems (zSeries and iSeries computers from IBM) being connected to storage arrays a few cabinets away.

Note: SANs support block level access to data, the support for filesystems and metadata is contained at a higher level in operating system code or sometimes within the database or Java appliance. There are other types of shared storage that are file oriented (as opposed to block) these are known as Network Attached Storage (NAS) or Object Stores (ObS).

NAS storage often supports CIFS (Common Internet File System) and NFS (Network File System). CIFS is a Windows centric filesystem whilst NFS is usually for UNIX based platforms.

Both NAS and ObS use TCP/IP to communicate between the host computers and the shared storage controllers. This IP traffic is usually carried over the LocaL Area Network (LAN) rather than a dedicated SAN.

SAN traffic between the application and storage array is significantly more sensitive to delayed or lost data than is similar LAN traffic. To prevent customer affecting events, the SAN is typically designed to avoid overloading. One common solution is to construct the SAN so that it can accommodate every port sending and receiving at full speed. Unfortunately, SAN connections are expensive and SANs are cost constrained. SAN architects balance cost and risk and recognize that a mistake can result in customer impacting outages.

The protocols of the SAN are

  • FC (Fibre Channel)
  • FICON (Fibre Connectivity) – a version of FC for IBM zSeries computers
  • FCoE (Fibre Channel over Ethernet)
  • FCIP (Fibre Channel over IP)
  • iSCSI (IP Small Computer Systems Interface)

All are enterprise scale derivatives of the peer to peer SCSI (Small Computer Systems Interface) of the 1980s. The latter two (FCIP and iSCSI) are IP encapsulated protocols and are carried over a LAN.

The Application – Storage interface and it’s requirements
Neither SCSI nor its derivatives (FC, FCoE) include a transport layer protocol to perform error correction and retransmission functions so they object strongly to data loss. Additionally SCSI and its derivatives assume a direct peer-to-peer connection between the initiator and the target so end devices, and the applications that use them, do not tolerate latency and delays well.

FC is implemented at the same protocol level as Ethernet but unlike Ethernet is designed as a lossless protocol with very predictable latency characteristics. FC was designed to support the sensitive application to storage interface maximising throughput and reliability.

Note: FCIP and iSCSI don’t have any mechanism to deal with LAN overloading and as a result they are prone to delays and frame loss. As a result they not suitable in larger environments where engineered levels of performance and resilience are required. Their main benefit is that they can be implemented at low cost on an existing LAN.

FCoE was created to enable network convergence. Allowing delay and loss sensitive data flows to share the same physical network as normal LAN traffic. As a result FCoE is implemented at the same protocol level as FC and Ethernet and uses smart switches to ensure that the FCoE traffic is shaped to give good application – storage performance.
Enterprise Scale SAN Design
SANs are carefully designed to avoid frame loss, latency and delay by integrating non blocking switch fabrics and a judicious fan-out ratio between the hierarchies of the connected host, edge switch ports, the inter-switch links (ISL) to the core switches and the connections to the storage array. Typically SAN designs assume that all hosts may be communicating at full line rate at the same time, implying significantly over-provisioned SAN capacity

Physical layer over-provisioning is compounded when implementing complex, multi-tier, edge-core-edge switch designs in very large scale environments. Intended to provide physical reconfiguration flexibility to accommodate growth, these designs actually introduce substantial additional cost, complexity, and resultant risk.

SANs are typically designed around a fan-out ratio between host connections and storage array ports of between 6:1 and 12:1 depending on the estimated intensity of the storage activity. Low utilization hosts can be theoretically supported at the upper fan out ratios whilst highly utilized hosts need much lower fan out ratios.

SAN Layout

SAN design in the Cloud
Good SAN design involves balancing high intensity hosts and low intensity hosts on the same edge switch to maximize switch utilization. Choosing the correct fan-out ratio is a difficult enough decision at the initial implementation stage, but it becomes difficult to maintain in a mature and growing SAN – and time consuming to manage with a virtualized workload that will be automatically and transparently moving between physical hosts and their associated SAN connections in real-time.

In a virtualized world, applications are untethered from the underlying physical hardware, the same hardware that has physical network connections to the SAN. For example, VMware VMotion, (deployed in production at 60-70% of VMware customers according to VMware), enables the migration of an entire running virtual machine, application and all, from one physical host and physical storage system to another seamlessly and without user interruption.

Cloud computing offers many benefits to the Enterprise, obscuring the physical complexity of servers, storage and networks from applications, enabling rapid deployment and enhanced availability through outages. Cloud does not, however offer a free lunch; inadequate underlying SAN infrastructure will become very rapidly exposed as more and more workload is virtualized and automated tools move and migrate critical business applications between nodes on ever larger clusters of servers and storage.

Cloud not only insulates and obscures the physical complexity of the underlying hardware from applications; it also has the potential to obscure the cause of an outage from IT Operations, increasing time to resolve and reducing right first time diagnostics. It is not unusual for a fairly simple fault that can be repaired in a few minutes to take many hours to diagnose the root cause with much technical hand-off between support teams on the way.

The choice is stark: over-engineer, pay the price and hope for the best or adopt a more scientific approach and manage the SAN proactively.

  • http://www.virtualinstruments.com/uhrdahlblog/ Mark Urdahl

    Steve:

    This is a great start on a very fundamental question: What is “Good” SAN Design? I want to put some of your points in context.

    If we designed office space the way we design SANs, the standard office would be 5,000 square feet!

    The fact is that there is massive, habit-based over-provisioning that is not only extremely costly but actually adds risk… because there are more things that can go wrong.

    The interesting thing about storage infrastructure is that you don’t have to virtualize to consolidate. You just need to measure…twice, cut once.

    http://www.virtualinstruments.com/uhrdahlblog/

  • http://thehotaisle.com Steve O’Donnell

    Mark

    You got it completely right. One of the biggest problems is that the SAN is a black hole that we hope to route our valuable traffic through. We can’t see what is going on inside, all we can do is hope that it will work. Because we are so risk averse and can’t afford to be overloaded or have an outage, we over-provision.

    SANs actually work quite well except when they don’t and then we have a nightmare.

    Thanks

    Steve

  • http://www.afsnetworks.com Terry Bellinger

    Steve,

    Very enlightening for me. I am not an IT expert but even so your descriptions were understandable and cleared up some fuzzy areas for me. I am a last mile transport provider via my own fiber network. I am especially interested in your comments on FCIP and ISCSI –

    “Note: FCIP and iSCSI don’t have any mechanism to deal with LAN overloading and as a result they are prone to delays and frame loss. As a result they not suitable in larger environments where engineered levels of performance and resilience are required. Their main benefit is that they can be implemented at low cost on an existing LAN.”

    Here is my question: If I am providing WAN transport at the Layer2 Ethernet level which is not oversubscribed and carries latency and packet loss SLA, would FCIP and ISCSI be a viable option for a larger environment as you mentioned above? If so what kind of cost savings on the hardware/software might a company epxect versus going with FC, FICON or FCOE?

    I know it is a broad question but if you can give some back of the napkin or rule of thumb metrics it would be very helpful.

    Thanks
    Terry

  • http://thehotaisle.com Steve O’Donnell

    Hi Terry,

    Interesting question. To support FC, FCoE and FICON over a WAN we would need a network capable of layer 2 operation. Dark fiber typically. Expensive! IP has obvious benefits in that (even good ones with SLAs and not overbooked) they are relatively cheap and easily available.

  • Pingback: Storage Connectivity and why it is important « Enterprise Strategy Group