By ryan on October 27, 2010

Those of us in the systems world who work with NetApp products in a Microsoft-Centric environment are probably familiar with the old-fashioned way of scripting against NetApp Filers:

1. Set up an SSH key pair

2. Use plink in your script to run commands against the filer

3. To get information from the filer, do some crazy text manipulation in PowerShell.

For example, in the past, if you wanted to do anything in a script, your command for the filer would have to be prefaced with something like this:
plink <filer> -i filer.ppk -l root

This is just to establish a connection to the filer with root permissions and then you only have the option of getting data back in the text format of the console.

With the PowerShell toolkit for Data ONTAP, we can now make things easier, and get the object-oriented command line experience that PowerShell provides:

connect-nacontroller <filername> - This will establish a connection to the filer, so any subsequent commands are performed against that filer in the context of the logged on user. This way, we have no files that grant root access sitting anywhere.

Once we've connected to the controller, let's say we want to plug into a variable, all of the volumes with zero free space:

$vols = get-navol | where {$_.sizeavailable -eq "0"}

Now we not only have a list, but it's manipulable as a variable. If we wanted to add 10GB to each volume to allow snapshots to take place, we'd just use a foreach loop as follows:
foreach ($vol in $vols)

{

set-navolsize $vol -newsize +10g

}

And this is just the tip of the iceberg. I'll be posting more about different functionality with the Toolkit in the future. In the meantime, if you have NetApp equipment in your environment, I'd advise downloading and using the toolkit.

By ryan on June 10, 2010
There's been a lot of buzz around the capability of Windows Server 2008R2's ability to utilize Cluster Shared Volumes (CSVs). CSVs provide the ability to use a single LUN in clustered Hyper-V deployments. They grant better space utilization, easier management, and the file-lock approach (as opposed the LUN lock) allows for some amazing capabilities. In most situations, this is ideal. There are some situations, however, where CSVs don't make sense. What if you're using the advanced capabilities of your storage devices to mirror your data to another site? In the CSV scenario, you have an all or nothing DR scenario. But if you're using a hot/hot site recovery design, you may just want to use the multi-site capacity to handle individual machine hard fails. Or you may want to move a department and its servers from one site to another. In this case, individual LUNs can provide that functionality. While I realize that most environments don't need these capabilities, they can be very attractive to more dynamic companies who wish to have a fast-up DR scenario.
By ryan on May 04, 2010
So, let's say you have two sites with Hyper-V host clusters using NetApp devices for storage. NetApp includes a feature called SnapMirror, that lets you mirror volumes between two or more NetApp devices. In this scenario, you have your Hyper-V VM LUNs mirrored, but how do you bring them up again? Enter Microsoft SCVMM (Microsoft System Center Virtual Machine Manager). SCVMM is a handy server app that can managed Hyper-V, Virtual Server, and VMWare servers. In addition to management, SCVMM stores configuration information about each VM in it's database. That sounds great to me! So it would seem that to bring up my DR site, all I have to do is pull the config from SCVMM and apply it to the existing VHDs right? Unfortunately, this functionality is not natively included in SCVMM. There are solutions and scripts out there that add many layers of complexity to accomplish what should be the simple task of creating a new VM using configuration information, and a VHD that already exist. The solution I found is to take advantage of a new feature in SCVMM R2 called Rapid Provisioning. Rapid Provisioning was included to do almost what we want to do here. It's meant to build a new VM using a template and an existing sysprepped image, and yes folks, it checks to see if the VHD has been properly sysprepped. Back to the drawing board, I took a close look at Rapid Provisioning, as well as the creation of new templates. The result is a PowerShell script that generates a new template using configuration information obtained from the SCVMM database. After the template is generated, the script then attempts to rapid provision it using the VHD that's already there (as a result of the SnapMirror). The result is an error state in SCVMM, and a VM that's up and running properly on the DR Cluster. The next step is to figure out how to clear the error state.
By ryan on February 23, 2010

If you’re using the email router for Microsoft CRM 4.0, chances are you’ve run into some issues. The configuration page seems simple enough, three tabs and some buttons. For some of you it will be simple, but for those of us with environments that are in flux, or non standard, things can get pretty hairy very quickly.

Take for example, you have a multi-domain and multi-forest environment. You want your CRM deployment to handle support emails and workflows. In this case, you may have recipients to emails that are outside of your Exchange organization. This brings some interesting caveats to the table:
1. Exchange 2007 doesn’t like to act as an anonymous email relay (sure, there are ways around that, but we try to do things right). This means that in order to follow best practices, and route email outside your organization, you must run your outgoing profile with windows authentication. So if you want your email router to send emails as another user, you have to create an account to run the outgoing profile under that has send-as permissions to every mailbox that may be a sender (I’ve seen it happen). This tends to be frowned upon as well in most environments.
2. If you don’t have to send emails outside your organization, you can set a receive connector on your hub transport server to allow anonymous email from the server, but that’s something that most of us want to avoid.
So if your email needs to go outside the organization, the sender must be a service account (again there are other ways, but we don’t talk about them. Kind of like that one crazy uncle that always shows up to family occasions despite nobody telling him they’re occurring).

Ok, so our scenario is now that email is flowing to and from a service account mailbox that the CRM email router has access to. If something goes wrong, DO NOT try to fix it by messing with your profiles. For example, if you see this message:
The E-mail router service cannot access system state file Microsoft.Crm.Tools.EmailAgent.SystemState.xml. The file may be missing or may not be accessible
You need to delete the xml file and restart the router. The CRM email router will automatically recreate the file.
Additionally, it’s easy to eliminate exchange as the culprit using telnet and esmtp commands.

By ryan on January 27, 2010

Actually, SnapDrive is a wonderful tool for provisioning LUNs on NetApp storage devices. But sometimes when you’re clustering with Windows Server 2008, you’ll get an error when you try to provision a disk that say “The RPC server is unavailable”. The guidance available states that you must disable the firewall on the host to resolve this issue, but it goes farther than that. SnapDrive is a cluster-aware application, so when a disk is provisioned for a cluster RPC calls are made to the filer and to the other nodes in the cluster. This means that the firewall settings on ALL nodes in the cluster must match. Remember though, that there are three network profiles for the windows firewall in server 2008. Additionally, while Network Location Awareness is a step up from just using the connection DNS suffix (as was done in Server 2003), you can end up with two nodes deciding that a connection to the same network fall into different profiles. So you either need to enable an exception for the SWsvc.exe program in your SnapDrive install directory for ALL profiles, or disable the firewall for ALL profiles on ALL of your nodes.

By ryan on November 20, 2009
Quick scenario for you: You have forest A and forest B. Your SCOM implementation is in forest A and there’s a forest trust between forest A and forest B. But there’s a caveat, the authentication between forest A and forest B is selective. In this case, the guidance for allowing access states that you need to give the allowed to authenticate permission to the SCOM server on the client’s AD object and vice versa. What you also need is to grant the SCOM server the same permission on the DC’s in forest B.
By ryan on November 20, 2009
There’s a saying among Exchange admins; "If something really wierd is happening, it’s DNS". Well, it seems that with SCOM; "if something really wierd is happening, it’s your management packs". If no new agents were registering correctly, it’s probabl from installing a new management pack without first upgrading the existing management packs to their latest level. So the guidance I’m giving on this is that before you add a new MP, make sure all the ones you have in place are at their latest version and fully tested.
By ryan on November 20, 2009

There’s a lot of articles out there about Hyper-V host clustering. But there isn’t much about the associated caveats. The main issue I ran into was with creating new physical disk resources. When you create a cluster and add nodes to it, you may end up adding a disk resource from a machine that does not own the cluster group. Yes, contrary to what it looks like in the GUI, there are still cluster groups, and much of the underlying clustering administration is unchanged from server 2003. If you’re in a situation where you need to add physical disk resources and the GUI won’t see them, here’s what you do:

1. On one of the nodes that is a possible owner for the resource, open two consoles as an administrator.

2. Log on to the LUN via the ISCSI administrator

3. In storage management, bring the disk online, initialize the disk, and create a simple volume (note, you can initialize the disk via MBR or GPT, but the disk MUST be a basic disk).

4. In the first command prompt you have open, type diskpart. Type select disk <appropriate disk number here>. And then type detail disk. You should see something like this:
MSFT Virtual HD SCSI Disk Device
Disk ID: 2E3EA1FE
Type : iSCSI
Bus : 0
Target : 10
LUN ID : 0
Read-only : No
Boot Disk : No
Pagefile Disk : No
Hibernation File Disk : No
Crashdump Disk : No

5. In the second command prompt, type Cluster res “<resource name here>” /create /group:”Available Storage” /type:”Physical Disk” This will create the resource.

6. In the second command prompt, type Cluster res “<resource name here>” /priv DiskSignature=”0x<DiskID from first command prompt>” The 0x before the disk ID listed is important, and you won’t be able to bring the disk online if you leave it out.

7. In the second command prompt, type Cluster res “<resource name here>” /on This will bring your new disk resource online.

When doing this, make sure that the Available Storage resource group is owned by a machine with access to the LUN you’re trying to bring online. It seems that Microsoft wants you to grant access to every node in your cluster, but if you’re like me and are a bit nitpicky about which machines have access to which LUNs, you’ll want to follow these steps.

By ryan on November 20, 2009

As you may or may not know, I love clustering. I’ve been working with MSCS for a very long time, and I look at it as a way to make my daily life easier. An admin with a cluster can have downtime on hardware in the middle of the day, can do patches at any time knowing that if there’s a problem, all you have to do is fail back to an unpatched node. That and it makes for some really cool toys.

Up to the present. I decided to check out MSCS (now known as failover clustering) on Windows Server 2008. I’m a big fan of building your own ISCSI devices, and I like Openfiler and FreeNAS. Both provide a large list of features, and the capability of Openfiler to run a two-node failover cluster of it’s own (thus providing redundant storage for your cluster) to be a wonderful thing.

Back to a blast from the past, clustering on windows server 2003. Here’s another caveat for everyone out there: We’ll often build our nodes with two disks in them and mirror those disks. Be careful about what method you use, some onboard RAID solutions will use BUS0 for their volumes. If this is the case with your RAID controller, switch to software mirroring. The reason is that the ISCSI initiator also latches on to BUS0, putting it on the same bus as your boot volume. This will cause cluster creation to pop up a warning about being unable to find a quorum device (BEWARE, no warning will pop up if you use the advanced creation option, you just won’t be able to add any shared disks later).

By ryan on November 20, 2009

Just a few things to keep in mind when dealing with clusters using MSCS:

STORAGE:

Arguably the most important component of the cluster is the shared storage device. If there’s any single point where you want to spend money to make sure you’re data is safe, this is it. And yet, I have seen many companies skimp on the storage, either by going with cheap, or non-redundant devices. The bottom line is, if you’re not going to be able to gaurantee the uptime of your storage, you’re defeating the purpose of clustering in the first place. In any case, if you’re in a bit of a budget crunch, there are numerous less-expensive (notice the lack of the word cheap) solutions. ISCSI devices can give you great performance at a much lower cost than traditional cluster storage devices. For my part, that’s what I use exclusively.

ACTIVE/ACTIVE vs ACTIVE/PASSIVE

A simple online search will provide you with thousands of opinions on this topic. What I’ve found in my travels, is that it all comes down to the situation. Most of the time, going with the recommended Active/Passive approach is best. But there will always be exceptions. For example, I’ve seen a two-node exchange cluster that was running into client-side performance issues. The problem wasn’t with the hardware, but the switching infrastructure. The client facing interfaces on these boxes were on a 10mbps switch (with 100mbps uplink), and there was no budget for a switch upgrade. So a plan was put in to make room in the budget for a switch infrastructure upgrade in the next fiscal year. In the meantime, the cluster was changed to be Active/Active with the understanding that performance would suffer in the event of a node failure. Keep in mind though, that this is the exception, not the rule. I put this in to illustrate a point to the “never use Active/Active” crowd.

OTHER THOUGHTS

If you are planning a cluster that requires a DTC, I’ve always found it easier to give the DTC it’s own disk, IP, and Name resources, to me it seems a tidier way to go. Also, if you have the DTC tied to a group that fails for some other reason and there’s a second group relying on it, then you’ve just brought on an extra failure.

If you’re planning on clustering, most people (and MS) recommend having a cluster for SQL, and a cluster for Exchange, etc…. Unfortunately, out here in the real world, many of us have found reasons that we are required to put multiple services on a single cluster (note, single cluster, not single node, but that happens too). When this happens, spend time planning your failover scenarios, more complex is not necessarily worse, just more complex.

For something that seems to be an important server technology, MS doesn’t seem to have much available for those wishing to learn about clustering. Considering that I’ve heard from CTECs that the barrier to offering MSCS classes is the cost of the hardware for the class, and that MS has Windows Storage Server, I’d like to suggest that they combine a class on the two (especially since their classes are run with VMs these days anyway). Given 3 VMs, you could cover clustering in depth, and Windows Storage Server (something I’d love to get my hands on, I’ve been having to build Linux ISCSI targets in VMs for my testing).

Finally, if you notice a strange behavior in your cluster, where a node fails taking ownership of a resource, try opening regedit, going to HKEY_LOCALMACHINE\Cluster\Resources and finding the GUID of the resource. From there, give the Network Service account full controll access to the key.