Friday, September 27, 2013

DFSR canary script

Param(
    [parameter(Mandatory=$true)]
    [alias("share")]
    $shareName,
    [alias("timeout")]
    $copyTimeout)

$emailTo = "status@example.com" # Recipient of notification e-mails
$emailFrom = "status@example.com" # E-mail address notifications will come from
$smtpServer = "smtp.example.com" # Your SMTP server
$testDir = "DFSTest" # This is the directory that the test file will be created in
$successMail = 1 # Set to 0 for no e-mail on 100% success or 1 to get report on success or fail

# List of servers to check for files

$serverList = @(
"fileserver1", 
"fileserver2", 
"fileserver3", 
"fileserver4")

if($copyTimeout -eq $null) {
$copyTimeout = 300 # Set the default timeout if one is not specified on the command line
}

$repFail = 0

$msg = new-object Net.Mail.MailMessage
$smtp = new-object Net.Mail.SmtpClient($smtpServer)
$msg.From = $emailFrom
$msg.ReplyTo = $emailFrom
$msg.To.Add($emailTo)
$msg.Body = "Copy status messages:`n" # Apparently Powershell uses backticks for escaping characters

# Copy file to remote servers and verify replication

ForEach($sourceServer in $serverList) {
$dateTime = Get-Date
$curDate = $dateTime.Month, $dateTime.Day, $dateTime.Year
$curTime = $dateTime.Hour.ToString("00"), $dateTime.Minute.ToString("00")

$strFullPath = "\\$sourceServer\$shareName\$testDir\"
$strFileName = "DFSTest_" + [string]::join('-', $curDate) + "_" + [string]::join('',$curTime) + "_$sourceServer.txt"

$msg.Body = $msg.Body + "`nCopying $strFileName from $sourceServer`n"
# Create a new text file on the current server in the list
New-Item $strFullPath$strFileName -type file | Out-Null
"Test data from $sourceServer" >> $strFullPath$strFileName # Write some data to the text file

ForEach($server in $serverList) {
$timer = 0
$msg.Body = $msg.Body + "Checking $server for file..."
$testFileName = "\\$server\$shareName\$testDir\$strFileName"
Do {
If(Test-Path "$testFileName") {
$msg.Body = $msg.Body + "SUCCESS: $testFileName exists ($timer seconds)`n"
break # If the file is there break out of the loop
}
$timer = $timer + 1
Start-Sleep -s 1 # Time is in seconds
} While($timer -ne $copyTimeout) # Had to use this method because I couldn't find a way to test for not exist
if($timer -eq $copyTimeout) {
$msg.Body = $msg.Body + "FAIL: Copy to $shareName on $server did not occur within $copyTimeout seconds`n"
$repFail = $repFail + 1
}
}

# Delete the file from the server and verify it disappears

Remove-Item $strFullPath$strFileName | Out-Null

ForEach($server in $serverList) {
$timer = 0
$msg.Body = $msg.Body + "Checking $server for file removal..."
$testFileName = "\\$server\$shareName\$testDir\$strFileName"
While (Test-Path "$testFileName") {
$timer = $timer + 1
Start-Sleep -s 1 # Time is in seconds
if($timer -eq $copyTimeout) {
$msg.Body = $msg.Body + "FAIL: Delete from $shareName on $server did not occur within $copyTimeout seconds`n"
$repFail = $repFail + 1
break
}
}
if($timer -ne $copyTimeout) {
$msg.Body = $msg.Body + "SUCCESS: $testFileName deleted ($timer seconds)`n"
}
}
}

if($repFail -ne 0) {
if($repFail -gt 1) {
$processes = "processes" 
} else {
$processes = "$process"
}
$msg.Subject = "WARNING: DFS replication to $shareName failed $repFail $processes"
} else {
$msg.Subject = "DFSR replication successful for $shareName ($copyTimeout second threshold)"
}

if ($successMail -eq 1 -or $repFail -ne 0) {
$smtp.Send($msg)
}

DFSR is great until it isn't

If you follow DFSR's best practices you can have a solid replication infrastructure.

Too small of a Staging Area Quota
Are you seeing a lot of event ID’s 4202 and 4204? If so, your staging area is not sized correctly. The downside to an improperly sized staging area is that replication performance will be negatively affected as the service has to spend time cleaning up the staging area instead of replicating files.
DFSR servers are more efficient with a full staging area for at least these two reasons:
  1. It is much more efficient to stage a file once and send it to all downstream partners than to stage a file, replicate the file, purge for each downstream partner.
  2. If at least one member is running Enterprise Edition the servers can take advantage of Cross File RDC
An improperly sized staging area can also cause a replication “loop” to occur. This condition happens when a file get replicated to the downstream server and is present in the staging however the file is purges by the staging area cleanup process before the file can be installed into the Replicated Folder. The purged file will be replicated again to the server that just purged it from its staging as it never got to install the file. This process will keep repeating until the file gets installed.
  • Improper or Untested Pre-seeding procedure
Pre-seeding is the act of copying the data that will be replicated to a new replica member before they are added to the Replicated Folder with the goal of reducing the amount of time it takes to complete initial replication. Most failed pre-seeding cases I have worked on failed in 3 ways.
  1. ACL mismatch between source and target.
  2. Changes were made to the files after they were copied to the new member
  3. No testing was done to verify the pre-seeding process they were using worked as expected.
robocopy is the best way to pre-seed 
this is an example  - robocopy s:\home \\fileserver2\home /e /copyall /mt:32 /r:6 /xd DfsrPrivate /mir /b /log:c:\temp\robocopy-home-commit.log
  • High backlogs for extended periods of time
Besides the fact that having high backlogs for extended periods of time means your data is out of sync, it can lead to unwanted conflict resolution where a file with older content wins in a conflict resolution scenario. The most common way I have seen this condition hit is when rolling out new RF’s . Instead of doing a phased rollout some admins will add 20 new RF’s from 20 different branch offices at once overloading their hub server. Stagger your rollouts so that initial replication is finished in a reasonable amount of time. 
how to check the backlog  - dfsrdiag backlog /receivingmember:fileserver1 /sendingmember:fileserver2 /RGName:home /RFName:"home"    
I would check replication in both directions because I have seen it get stuck/backed up going the opposite way. 
  • DFSR used as a backup solution
Believe it or not some admins run DFSR without backing up the replicated content offline. DFSR was not designed as a backup solution. One of DFSR’s design goals is to be part of an enterprise backup strategy in that it gets your geographically distributed data to a centralized site for backup, restore and archival. Multiple members do offer protection from server failure; however, this does not protect you from accidental deletions. To be fully protected you must backup your data.
  • One way Replication – Using it, or Fixing One way replication the wrong way
In an attempt to prevent unwanted updates from occurring on servers where they know the data will never be changed (or they don’t want changes made there) many customers have configured one-way replication by removing outbound connections from replica members. One-way replication is not supported on any version of DFSR until Windows Server 2008 R2. On Windows 2008 R2 one-way replication is supported provided you configure Read-Only replicated folders. Using Read –Only DFSR members allows you to accomplish the goal of one-way replication, which is preventing unwanted changes to replicated content. If you must use one-way replication with DFSR then you must use Windows 2008 R2 and mark the members as read-only where changes to content should not occur.
  • Hub Server – Single Point of Failure or Overworked Hub Servers
I have seen many deployments with just one hub server. When that hub server fails the entire deployment is at risk. If you’re using Windows Server 2003 or 2008 you should have at least 2 hub servers so that if one fails the other can handle the load while the other is repaired with little to no impact to the end users. Starting with Windows Server 2008 R2 you can deploy DFSR on a Windows Failover Cluster which gives you high availability with half of the storage requirement.
Other times admins will have too many branch office servers replicating with a single hub server. This can lead to delays in replication. Knowing how many branch office servers a single hub server can service is a matter of monitoring your backlogs. There is no magic formula as each environment is unique and there are many variables.
  • Too many Replicated Folders in a single Jet Database
DFSR maintain one Jet database per volume. As a result placing all of your RFs on the same volume puts them all in the same Jet Database. If that Jet database has a problem that requires repair or recovery of the database all of the RF’s on that drive are affected. It is better to spread the RFs out using as many drives as possible to provide maximum uptime for the data.
  • Bargain Basement iSCSI deployments
I have seen more than one DFSR iSCSI deployment where the cheapest hardware available was used. Usually if you are using DFSR it is for some mission critical purpose such as data redundancy, backup consolidation, pushing out applications and OS upgrades on a schedule. Depending on low-end hardware with little to no vendor support is not a good plan. If the data is important to your business then the hardware that runs the OS and replication engine is important to your business.
  • Did not maintain the DFSR service at the current patch level
DFSR is actively maintained by Microsoft and has updates released for it as needed. You should update DFSR when a new release is available during your normal patch cycle. Make sure your servers are up to date 
Proactively patching the DFSR servers is advisable even if everything is running normally as it will prevent you from getting effected by a known issue.
  • Did not maintain NIC Drivers
DFSR will only work as well as the network you provide for it. Running drivers that are 5 years old is not smart. Yes, I have talked with more than a few customers with NIC drivers that old who’s DFSR replication issue was resolved by updating the NIC driver.
  • Did not monitor DFSR
Despite the fact that the data DFSR is moving around is usually mission critical many Admins have no idea what DFSR is doing until they discover a problem. Savvy admins have created their own backlog scripts to monitor their servers but a lot of customers just “hoped for the best”. The DFSR Management Pack has been out for almost a year now (and other versions for much longer). Deploy it and use it so you can detect problems and respond before they become a nightmare. If you can’t use the DFSR Ops Management pack at a minimum write a script to track your backlogs on a daily basis so that you know that DFSR is replicating files or not.
  • Making changes to disk storage without first backing up the data
    If you must replace or add hard drive space to your DFSR server it is critical that you have a current backup of the data in case something goes wrong. Any number if things can go wrong the most common being conflict event s created due to accidental changes to a parent folder or unintentional deletion of a parent folder that is replicated to all partners. You must backup your data before starting the changes and maintain the backup until the project is completed.
  • Stopping the DFSR service to temporarily prevent replication
    Sometimes there is a need to temporarily stop replication. The proper way to accomplish this is to set the schedule to no replication for the Replication Group in question. The DFSR service must be running to be able to read updates to the USN Journal. Do not stop the DFSR service as doing so may cause a journal wrap to occur. DFSR will recover from the journal wrap but in large deployments this will take a long time and replication will not occur or will happen very slowly during the journal wrap recovery. You may also see very high backlogs until the journal wrap recovery completes.

RHEL is great with updates

I use RHEL and CENTOS at work. And they're awesome except when they're missing the latest and greatest packages you need.

here is a list of some of them:

This is a update of DAG's old stuff and Rpmforge's re-branding
http://repoforge.org/

Remi usually has the latest PHP updates, critical for web servers
http://rpms.famillecollet.com/

3rd party packages from the fedora project which will work with RHEL and CENTOS
http://fedoraproject.org/wiki/EPEL

Here is a quick tutorial on how to add a repo to yum

download the packages
#wget http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
#wget http://rpms.famillecollet.com/enterprise/remi-release-6.rpm

install them
#sudo rpm -Uvh remi-release-6*.rpm epel-release-6*.rpm

Personally I never enable 3rd party repos so their updates will not conflict with my standard packages so will install the package I want as demonstrated below

#yum --enablerepo=remi install php

enable the repo
#sudo vim /etc/yum.repos.d/remi.repo

Edit the [remi] portion of the file so that the enabled option is set to 1. This will enable the remi repository.

name=Les RPM de remi pour Enterprise Linux $releasever - $basearch
#baseurl=http://rpms.famillecollet.com/enterprise/$releasever/remi/$basearch/
mirrorlist=http://rpms.famillecollet.com/enterprise/$releasever/remi/mirror
enabled=1
gpgcheck=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-remi
failovermethod=priority


Advance to the next level in the game

Ned Pyle is a Microsoft Stud. I have been following many of his best practices in my day to day tasks. I am placing a copy of his great post here in case it ever goes away.

http://blogs.technet.com/b/askds/archive/2011/09/02/accelerating-your-it-career.aspx

Your career isn’t win or lose anymore, it is win or die. The days of guaranteed work, pensions, and sticking with one company for fifty years are gone. Success has returned to something Cro-Magnon man would recognize: if you’re good at what you do, you get to eat.
I recently spoke to university graduates about their future as new Microsoft engineers. For the first time, that meant organizing my beliefs. It distills four simple pillars: discipline, technical powerhouse, communication, and legacy. In the tradition of Eric Brechner and in honor of Labor Day, I’d like to share my philosophy.

Discipline

Learn constantly, not just when life is forcing you. Read every trustworthy article you can get your hands on, before you need to know it; the time to learn AD replication is not when its failure blocks your schema upgrade. Understanding architecture is the key to deploying and troubleshooting any complex system. If you get nothing else from this post, remember that statement - it can alter your life. For Directory Services, start here.
Don’t be good at one thing - be amazing at a few things, and good at the rest. We all know someone who's the expert on X. He guards X jealously, making sure he is "indispensable.” Notice how he’s always in a lousy mood: he's not allowing anyone to relieve his boredom and he lives in fear that if anyone does, he'll be replaced. Learn several components inside and out. When you get jaded, move on to a few more and give someone else a turn. You'll still be the expert while they learn and if things get gnarly, you can still save the day. Over time, you become remarkable in many areas. Keep your skills up on the rest so that you can pinch hit when needed. Surround yourself with smart people and absorb their knowledge.
Admit your mistakes. The only thing worse than making a mistake is trying to cover it up. Eventually, everyone is caught or falls under a career-limiting cloud of suspicion. Now colleagues will remember times they trusted you, and won’t make that "mistake" again. Plead guilty and start serving community service, where you help the team fix the glitch.
Get a grip. It's never as bad as you think. Losing your composure costs you concentration and brainpower. Remaining emotional and depressed makes you a poor engineer, and a lousy person to be around to boot. Learn how to relax so you can get back to business.
Never surrender. Your career path is a 45-degree angle leading up to infinity, not an arc - arcs come back down! Keep learning, keep practicing, keep refreshing, keep growing. Keep a journal of "I don't know" topics, and then revisit it weekly to see what you've learned. IT makes this easy: it's the most dynamic industry ever created. In my experience, the Peter Principle is usually a self-induced condition and not the true limit of the individual.

Technical Powerhouse

Figure out what makes you remember long term. There is a heck-of-a-lot to know when dealing with complex distributed systems - you can't always stop to look things up. Find a recall technique that works for you and practice it religiously. You’re not cramming for a test; you’re building a library in your brain to serve you for fifty years. No amount of learning will help if you can’t put it to good use.
Be able to repro anything. When I first came to Microsoft, people had fifteen computers at their desk. Thanks to free virtualization, that nonsense is over and you can run as many test environments as you need, all on one PC. "Oh, but Ned, those virtual machines will cost a fortune!" Gimme a break, it’s walking-around money. A lab pays for itself a thousand times every year, thanks to the rewards of your knowledge and time. It's the best investment you can make. Study and memory are powered by experience.
Know your dependencies. What does the File Replication Service need to work? DNS, LDAP, Kerberos, RPC. What about AD replication? DNS, LDAP, Kerberos, RPC. Interactive user logon? DNS, LDAP, Kerberos, RPC. Windows developers tend to stick with trusted protocols. If you learn the common building blocks of one component, you become good at many other components. That means you can troubleshoot, design, teach, and recognize risks to them all.
Understand network captures. It's hard to find an IT system talking only to itself. Notepad, maybe (until you save a file to a network share). There are many free network capture tools out there, and they all have their place. Network analysis is often the only way to know how something works between computers, especially when logging and error messages stink - and they usually do. I'd estimate that network analysis solves a quarter of cases worked in my group. Learn by exploring controlled, working scenarios; the differences become simple to spot in failure captures. Your lab is the key.
Learn at least one scripting language. PowerShell, CMD, VBS, KiXtart, Perl, Python, WinBatch, etc. – any is fine. Show me an IT pro who cannot script and I'll show you one that grinds too many hours and doesn't get the bonus. Besides making your life easier, scripting may save your business someday and therefore, your career. An introductory programming course often helps, as they teach fundamental computer science and logic that applies to all languages. This also makes dependencies easier to grasp.
Learn how to search and more importantly, how to judge the results. You can't know everything, and that means looking for help. Most people on the Internet are spewing uninformed nonsense, and you must figure out how to filter them. A vendor is probably trustworthy, but only when talking about their own product. TechNet and KB trump random blogs. Stay skeptical with un-moderated message boards and "enthusiast" websites. Naturally, search results from AskDS are to be trusted implicitly. ;-P

Communication

Learn how to converse. I don’t mean talk, I mean converse. This is the trickiest of all my advice: how to be both interesting and interested. The hermit geek in the boiler room - that guy does not get promotions, bonuses, or interesting projects. He doesn't gel with a team. He can't explain his plans or convince anyone to proceed with them. He can't even fill the dead air of waiting… and IT troubleshooting is a lot of waiting. Introverts don’t get the opportunities of extroverts. If I could learn to suppress my fear of heights, you can learn to chat.
Get comfortable teaching. IT is education. You’re instructing business units in the benefits and behavior of software. You're schooling upper management why they should buy new systems or what you did to fix a broken one. You're coaching your colleagues on network configuration, especially if you don’t want to be stuck maintaining them forever. If you can learn to teach effortlessly and likably, a new aspect to your career opens up. Moreover, there's a tremendous side effect: teaching forces you to learn.
Learn to like an audience. As you rise in IT, the more often you find yourself speaking to larger groups. Over time they become upper management or experienced peers; an intimidating mix. If you let anxiety or poor skills get in the way, your career will stall. Arm yourself with technique and get out in front of people often. It's easier with practice. Do you think Mark Russinovich gets that fat paycheck for his immaculate hair?
Project positive. Confidence is highly contagious. When the bullets are flying, people want to follow the guy with the plan and the grin. Even if deep down he's quivering with fear, it doesn’t show and he charges forward, knowing that everyone is behind him. People want to be alongside him when the general hands out medals. Self-assurance spreads throughout an organization and you'll be rewarded for it your whole career. Often by managers who "just can't put their finger" on why they like you.
Be dominant without domineering. One of the hardest things to teach new employees in Microsoft Support is how to control a conference call. You’re on the phone with a half dozen scared customers, bad ideas are flying everywhere, and managers are interrupting for “status updates”. You can’t be rude; you have to herd the cats gently but decisively. Concentration and firmness are paramount. Not backing down comes with confidence. Steering the useless off to harmless tasks lets you focus (making them think the task is important is the sign of an artist). There's no reason to yell or demand; if you sound decisive and have a plan, everyone will get out of the way. They crave your leadership.

Legacy

Share everything. Remember "the expert?" He's on a desert island but doesn’t signal passing ships. Share what you learn with your colleagues. Start your own internal company knowledgebase then fill it. Have gab sessions, where you go over interesting topics you learned that week. Talk shop at lunch. Find a reason to hang out with other teams. Set up triages where everyone takes turn teaching the IT department. Not only do you grow relationships, you're leading and following; everyone is improving, and the team is stronger. A tight team won't crumble under pressure later, and that's good for you.
Did you ever exist? Invent something. Create documentation, construct training, write scripts, and design new distributed systems. Don’t just consume and maintain - build. When the fifty years have passed, leave some proof that you were on this earth. If a project comes down the pipe, volunteer - then go beyond its vision. If no projects are coming, conceive them yourself and push them through. The world is waiting for you to make your mark.
I used many synonyms in this post, but not once did I say “job.” Jobs end at quitting time. A career is something that wakes you up at midnight with a solution. I can’t guarantee success with these approaches, but they've kept me happy with my IT career for 15 years. I hope they help with yours.

2 is not better than 1 (for snapshots)

In a production environment a VM should never have more than 2 snapshots. It is a best practice to never have snapshots and only have 1 for testing purposes. All snapshots should be deleted within 24 hours if possible.

  • Use no single snapshot for more than 24-72 hours. Snapshots should not be maintained over long periods of time for application or Virtual Machine version control purposes.
    • This prevents snapshots from growing so large as to cause issues when deleting/committing them to the original virtual machine disks. Take the snapshot, make the changes to the virtual machine, and delete/commit the snapshot as soon as you have verified the proper working state of the virtual machine.
    • Be especially diligent with snapshot use on high-transaction virtual machines such as email and database servers. These snapshots can very quickly grow in size, filling datastore space. Commit snapshots on these virtual machines as soon as you have verified the proper working state of the process you are testing.
  • An excessive number of delta files in a chain (caused by an excessive number of snapshots) or large delta files may cause decreased virtual machine and host performance.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1025279

VDI virtual machine hardware optimizations

After you have configured your virtual machine and installed the OS the most important thing to do is install vmware tools.

After you do that I would suggest changing your nic to the latest paravirtualized version. You will also do the same if change the scsi disk controller as well. See the screenshot below.







I already saved it

Remember to make backups if you are going to change something like a configuration file or a database. Don't assume that the .tar or .zip file contains your latest file. I would name the file config.conf.mm-dd-yyyy vs config.conf.good

Have you tried turning it (selinux) off and on again?

If you are configuring something on RHEL/CENTOS and you can't wrap your brain about why it isn't working when you know it should. It's probably selinux. Remove selinux temporarily by doing the following:

edit /etc/sysconfig/selinux

change SELINUX=enforced to disabled

Reboot and see if it's working. If you're on a production box I would enable selinux again by changing it back to enforced.