A quick introduction to GFS2 over iSCSI

Posted by Ryan Uber | Clustering,High Availability,Linux,Networking,Performance | Friday 8 October 2010 9:55 pm

What does one do when a large, clustered environment, say 15 or more front-end application servers, starts to choke up an NFS server, or your nodes are reading stale data due to the NFS protocol limitations? What should you do when gigabit Ethernet no longer provides sufficient throughput on your back end storage infrastructure to deliver your files to the world wide web? You implement GFS2.

If you want to know more about the difference between these two filesystems, you can read the comparison here.

I recently set up a small test environment to muck around with GFS, and the result turned out fairly decent and didn’t take a whole lot of time, either.

I’m assuming that if you are reading my blog, technology interests you, and you have heard of GFS2 or the Red Hat Clustering Suite before. But what is it? How does it work? I’ll try to explain a bit about it here.

GFS2 operates on block devices, not virtual “export” directories or “shares”, like CIFS or NFS do. So how is that going to help us achieve a shared file system among all of our web servers if it operates on block devices? There are many answers to that question, only one of which I have explored so far. You can attach block storage to remote systems in a number of different ways, to name a few:

  1. iSCSI – Probably the easiest, certainly the least expensive.
  2. ATAoE – Similar to the above, however, ATA-over-Ethernet is exactly that, where iSCSI is Internet SCSI over TCP/IP.
  3. Fibre Channel – Probably the best-performing option, and probably the most expensive

At work, I will most likely be using Ethernet exclusively, so I naturally chose iSCSI as my transport. iSCSI will enable me to export a block device on my shared storage for use on multiple remote systems. Let’s explore how that is done, as without shared block storage, GFS2 isn’t going to do much for us.

Defining an iSCSI target
Firstly, we need to designate our iSCSI target, or in other words, the storage target that the application servers will be accessing. In this example I will be using just one iSCSI target, which I intend to later have serving my “vhosts” directory for web data. Let’s call our iSCSI target system ‘storage1′. It is going to need software for creating iSCSI targets, so let’s go ahead and install the package “scsi-target-utils”:

# /usr/bin/yum -y install scsi-target-utils

This package contains some useful commands that we will use to create the iSCSI target. namely, “tgtadm”. Before we can use any of them though, we need to start up the targeting service:

# service tgtd start
Starting SCSI target daemon:                               [  OK  ]

Now let’s attempt to create a new target. For simplicity’s sake, I will be naming this particular target “vhosts”. In other examples that I have read through, a rather long string was used as the target name. Using a short name like I am here may have its downside, but other than being easily identifiable I am not sure at this point what the significance is. Fire away:
(more…)

CVE-2010-3081

Posted by Ryan Uber | Kernel,Linux,Security | Saturday 25 September 2010 12:33 pm

As pretty much every system administrator is now aware, a major security flaw was introduced to the Linux kernel in April of 2008, and just recently got some big exposure when a public exploit was published for it. Red Hat describes this vulnerability as:

an issue in the 32/64-bit compatibility layer implementation in the Linux kernel, versions 2.6.26-rc1 to 2.6.36-rc4. The compat_alloc_user_space() function is missing a sanity check on the length argument, and also a check to make sure the pointer to the block of memory in user-space that the process is attempting to write to is valid.

When panic struck, and systems began to be compromised, users of Red Hat Enterprise and similar rebuild-distributions such as CentOS remained vulnerable for some days. Red Hat is known for stability, and thus it makes perfect sense that they were not so quick to apply a patch, build an RPM, and push it to their mirrors. They tested their patch thoroughly.

However, some companies, like the one I work for, wanted a fix in place before an official release was available. Seeing how I build all of the RPM’s around here, I was tasked with patching the kernel into a distributable format for all of our managed customers.

I first went searching for any patches that already existed. I was not about to try writing my own and be held responsible for it. On September 19th, 2010, Roberto Yokota posted a patch on the Red Hat Bugzilla page. His patch does indeed work, as he demonstrates in his 3 posts from the 19th. However, his patch would not apply cleanly to the current RHEL kernel build, as it duplicated patches from previous releases. I consequently needed to modify his patch slightly to build cleanly into the kernel RPM’s that we released. Essentially all I needed to do was to remove all instances where “%eax” was replaced by “%rax”, as those bug fixes were included in other patches. I added my patch to the RPM (%patch25159):
(more…)

An epic RAID 10 Failure — Or was it a triumph?

Posted by Ryan Uber | Linux | Wednesday 8 September 2010 2:14 pm

One day a RAID issue came to my attention by means of a Nagios alert. I logged into the system to check out the 3ware array status, figuring a disk was likely going bad, needed to be replaced and the array rebuilt. However, to my horror and dismay, this is what I saw in tw_cli:

//server> show

Ctl   Model        (V)Ports  Drives   Units   NotOpt  RRate   VRate  BBU
------------------------------------------------------------------------
c0    9650SE-4LPML 4         4        1       1       1       1      -        

//server> focus c0

//server/c0> show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-10   DEGRADED       -       -       64K     279.377   Ri     ON     

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     DEVICE-ERROR     u0     139.73 GB   293046768     WD-WX20C7938667
p1     DEVICE-ERROR     u0     139.73 GB   293046768     WD-WX20C6901772
p2     DEVICE-ERROR     u0     139.73 GB   293046768     WD-WX20C7935022
p3     OK               u0     139.73 GB   293046768     WD-WXD0C7952295     

//server/c0>

A RAID 10 with 3 dying hard drives, and somehow, it was still online? This countered all of the logic I had come to know about RAID 10. How do you lose 3 disks and still be up and running? Two I can see happening if by coincidence two of the mirrored drives died, but this just doesn’t make any sense now, does it?

Notice the serial on each drive. All of them beginning with “WD-WX20C” died at the same time, while the one disk with “WD-WXD0C” is running just fine. Bad batch of drives? Possible, although the serials differ quite vastly among the “WX20C” drives.

This particular server was a miracle, no data was lost and we were able to get the array healthy again by taking the server offline, replacing one disk and rebuilding, 3 times.

The moral of the story? Monitor your RAID arrays if you like having usable data on your systems.

« Previous PageNext Page »