Microsoft System Center Operations Manager 2012: Easy setup, bundled features - by John Joyner MVP

August 4, 2011, 8:51 am

≫ Next: OpsMgr 2012 Community Evaluation Program (CEP) is now taking applications.

≪ Previous: SCOM 2007 R2 Cumulative Update 5 has Released to the Web

http://www.techrepublic.com/blog/networking/microsoft-system-center-operations-manager-2012-easy-setup-bundled-features/4596 Takeaway: The public beta release of System Center Operations Manager 2012 is now available. Microsoft continues to add features...(read more)

↧

OpsMgr 2012 Community Evaluation Program (CEP) is now taking applications.

August 4, 2011, 10:19 am

≫ Next: Authoring Console looking for Microsoft.SystemCenter.Library MP 6.1.7221.61 or later

≪ Previous: Microsoft System Center Operations Manager 2012: Easy setup, bundled features - by John Joyner MVP

At TechEd 2011, we announced that the OpsMgr 2012 Community Evaluation Program (CEP) is now taking applications. What is a CEP Many of you are likely familiar with Microsoft TAP’s, Technology Adoption Programs, where a small pool of customers...(read more)

↧

Authoring Console looking for Microsoft.SystemCenter.Library MP 6.1.7221.61 or later

August 5, 2011, 8:48 am

≫ Next: Cumulative Update 5 for OpsMgr 2007 R2 is now available

≪ Previous: OpsMgr 2012 Community Evaluation Program (CEP) is now taking applications.

From KB 2590414 You are prompted for the latest version of the Microsoft.SystemCenter.Library management pack when you try to edit a new management pack in Operations Manager 2007 R2 When you try to edit a new management pack in the Authoring Console...(read more)

↧

Cumulative Update 5 for OpsMgr 2007 R2 is now available

August 5, 2011, 8:52 am

≫ Next: Graham Davies - MVP

≪ Previous: Authoring Console looking for Microsoft.SystemCenter.Library MP 6.1.7221.61 or later

The SCOM team is very happy to announce the release of Cumulative Update 5 for System Center Operations Manager 2007 R2 . Cumulative Update 5 for Operations Manager 2007 R2 resolves the following issues: Restart of non-Operations Manager services...(read more)

↧

Graham Davies - MVP

August 5, 2011, 10:41 am

≫ Next: SCOM 2007 R2 and SP1 now supports SQL Server 2005 SP4

≪ Previous: Cumulative Update 5 for OpsMgr 2007 R2 is now available

http://www.systemcentersolutions.com/blog/ I am a Microsoft System Center Operations Manager MVP and work for AKCSL, a Microsoft Gold Partner in the UK. I’ve been working with Enterprise Management Systems since 1999, when I joined NetIQ to...(read more)

↧

SCOM 2007 R2 and SP1 now supports SQL Server 2005 SP4

August 10, 2011, 11:50 am

≫ Next: Troubleshooting Network Discovery in SCOM 2012 by Stefan Koell (MVP)

≪ Previous: Graham Davies - MVP

OM Community, System Center Operations Manager 2007 SP1 and System Center Operations Manager 2007 R2 now supports SQL Server 2005 SP4 . Note: We will have the Supported Configuration and a KB article posted in the next few weeks to make this more...(read more)

↧

Troubleshooting Network Discovery in SCOM 2012 by Stefan Koell (MVP)

August 11, 2011, 11:16 am

≫ Next: Application Monitoring Architecture in OpsMgr 2012 Beta

≪ Previous: SCOM 2007 R2 and SP1 now supports SQL Server 2005 SP4

The following article applies to SCOM 2012 BETA and may or may not apply to RC or RTM release. I’ll try to repro the issue in the upcoming releases to see if the behavior changed and provide updates if necessary. I guess everyone is testing SCOM...(read more)

↧

Application Monitoring Architecture in OpsMgr 2012 Beta

August 12, 2011, 4:23 am

≫ Next: Looking for an Anti-Virus exclusion list? Here’s your one-stop shop

≪ Previous: Troubleshooting Network Discovery in SCOM 2012 by Stefan Koell (MVP)

For those who already know me, it has been a couple of weeks since I relocated to the Seattle area and started working as a Program Manager on the Operations Manager Application Monitoring team and this is my first post on this blog. For those who don...(read more)

↧

Looking for an Anti-Virus exclusion list? Here’s your one-stop shop

August 16, 2011, 2:29 pm

≫ Next: NOW LIVE: The Microsoft TechNet Gallery

≪ Previous: Application Monitoring Architecture in OpsMgr 2012 Beta

Security is something that is at the top of everyone’s mind, but what if your A/V software actually causes an issue with some of the software you’re running? If that’s the case then there’s probably an exclusion you need to make to keep things safe, secure and working smoothly. Luckily Microsoft’s own Jeff Patterson and Tony Soper have put together a pretty comprehensive list of ALL the AV exclusions you might want to configure for Windows Server, including AD, OpsMgr, ConfigMgr, Hyper-V, SQL, WSUS, MED-V, DPM, App-V and much much more. You can check it out on our TechNet Wiki below:

Windows Anti-Virus Exclusion List

J.C. Hornbeck | System Center Knowledge Engineer

↧

NOW LIVE: The Microsoft TechNet Gallery

August 18, 2011, 8:03 am

≫ Next: Event ID 4625 is logged every 5 minutes when using the Exchange 2010 Management Pack in OpsMgr 2007

≪ Previous: Looking for an Anti-Virus exclusion list? Here’s your one-stop shop

As many of you probably already know, for a long time the Script Repository (a special-purpose gallery) has been an engine of great content and community engagement on TechNet. Well starting last week it was upgraded significantly and launched as the new TechNet Gallery, supporting not only just scripts but many other technical resources for Microsoft products including App-V, Exchange, and System Center.

Also of note is that in addition to English, the new gallery is available in French, German, Spanish, Japanese, Simplified Chinese, Traditional Chinese, Brazilian Portuguese, Russian, Italian, Korean, Czech, Polish, and Turkish. Individuals’ contributions and engagement with the TechNet Gallery are tracked in their TechNet profiles and fully integrated with our reputation system which is also fully localized.

If IT resources like scripts, management packs, utilities, and extensions are important for your success, you’ll definitely want to bookmark this one.

TechNet Gallery

J.C. Hornbeck | System Center Knowledge Engineer

↧

Event ID 4625 is logged every 5 minutes when using the Exchange 2010 Management Pack in OpsMgr 2007

August 18, 2011, 10:50 am

≫ Next: System Center Operations Manager 2007 Support for SQL 2005 SP4

≪ Previous: NOW LIVE: The Microsoft TechNet Gallery

Here’s a heads up on a new SCOM 2007 KB article we published this morning:

Symptoms

When using the Exchange 2010 Management Pack in System Center Operations Manager 2007, you may receive a security audit failure event in the Security event log every 5 minutes. An example of the event is below:

Log Name: Security
Source: Microsoft-Windows-Security-Auditing
Date:
Event ID: 4625
Task Category: Logon
Level: Information
Keywords: Audit Failure
User: N/A
Computer: XXX

Description:
An account failed to log on.

Subject:
Security ID: NULL SID
Account Name: -
Account Domain: -
Logon ID: 0x0

Logon Type: 3

Account For Which Logon Failed:
Security ID: NULL SID
Account Name: Aextest_39076b2bb6ec4
Account Domain: XXXXXX

Failure Information:
Failure Reason: Unknown user name or bad password.
Status: 0xc000006d
Sub Status: 0xc0000064

Process Information:
Caller Process ID: 0x0
Caller Process Name: -

Network Information:
Workstation Name: XXXXXX
Source Network Address: XXXXXX
Source Port: 30956

Detailed Authentication Information:
Logon Process: NtLmSsp
Authentication Package: NTLM
Transited Services: -
Package Name (NTLM only): -
Key Length: 0

Note that the account name will have the format Aextest_<GUID>.

Cause

The actual Exchange mailbox account used is extest_<GUID>. This extra “A” is passed on due to an issue with the Exchange Correlation Engine when Outlook Anywhere is OFF (disabled). This is the default on a new installation of Exchange 2010.

Resolution

Two possible workarounds are below:

1. Enable Outlook Anywhere (see http://technet.microsoft.com/en-us/library/cc179036.aspx).

2. Disable every rule that is using the Test-OutlookConnectivity Exchange 2010 Powershell CMDLet. A list of these rules can be found here: http://technet.microsoft.com/en-us/library/ee758035(EXCHG.140).aspx

More Information

This article applies to System Center Operations Manager 2007 RTM, SP1 and R2.

=====

For the most current version of this article please see the following:

2591305 : Event ID 4625 is logged every 5 minutes when using the Exchange 2010 Management Pack in System Center Operations Manager 2007

J.C. Hornbeck | System Center Knowledge Engineer

↧

System Center Operations Manager 2007 Support for SQL 2005 SP4

August 18, 2011, 1:40 pm

≫ Next: Guidance, Tuning and Known Issues for the Exchange 2010 Management Pack for System Center Operations Manager 2007

≪ Previous: Event ID 4625 is logged every 5 minutes when using the Exchange 2010 Management Pack in OpsMgr 2007

Here’s another KB article we published today. This one talks about an issue and associated fix for OpsMgr 2007 running on SQL 2005 SP4:

=====

Summary

This article discusses the support for Microsoft System Center Operations Manager 2007 R2 that runs on a Microsoft SQL Server 2005 SP4 database.

More Information

Attempting to run SQL2005 SP4 where the Operations Manager Reporting role is installed will fail without following the steps below:

1. Open Internet Information Services (IIS) Manager (not 6.0) – found under Administrative Tools from the Start menu. Within IIS Manager complete the following:

a. Expand local machine connection to see App Pools and Sites.

b. Select Application Pools.

c. Find the app pool created by the Reporting Server installation, which has the Identity column’s value set to the domain account used for the DW Reader account.

d. Select that app pool and right click, selecting “Advanced Settings” from the context menu.

e. Under the “Process Model” section, change the value for “Identity” from the domain account to “NetworkService”.

f. Click “OK” to close the Advanced Settings dialog and save the changes.

g. With that app pool still selected, click “Recycle” under the “Application Pool Tasks” section of the Actions area to the right.

2. Run SQL2005 SP4 – it should now complete successfully

NOTE: At this point, if the Console were opened, Reporting would fail to load

3. Within IIS Manager, reverse the previous process:

a. Expand local machine connection to see App Pools and Sites.

b. Select Application Pools.

c. Find the app pool created by the Reporting Server installation, which has the Identity column’s value set to “NetworkService”.

d. Select that app pool and right click, selecting “Advanced Settings” from the context menu.

e. Under the “Process Model” section, change the value for “Identity” from “NetworkService” back to the original domain account.

f. Click “OK” to close the Advanced Settings dialog and save the changes.

g. With that app pool still selected, click “Recycle” under the “Application Pool Tasks” section of the Actions area to the right.

4. Open the Console and Reporting should load successfully.

5. Verify that Reports work as expected.

=====

For the most current version of this article please see the following:

2591380 : System Center Operations Manager 2007 Support for SQL 2005 SP4

J.C. Hornbeck | System Center Knowledge Engineer

↧

Guidance, Tuning and Known Issues for the Exchange 2010 Management Pack for System Center Operations Manager 2007

August 18, 2011, 2:13 pm

≫ Next: Meet Sergey Kanzhelev, developer on the Operations Manager Team

≪ Previous: System Center Operations Manager 2007 Support for SQL 2005 SP4

Summary

This article is intended to give some best practice guidance along with workarounds to known issues involving the Exchange 2010 Management Pack (MP) running on System Center Operations Manager 2007 (SCOM). Please look through this document before calling for support or posting to the forums as the issue may be covered below. If you find that these issues are particularly troublesome or find additional issues that you want fixed please call into Microsoft Support and raise a Request For Hotfix with the Exchange group.

More Information

The Exchange 2010 Management Pack introduced a correlation engine, the Correlation Engine is a stand-alone Windows service that uses the Operations Manager SDK interface to first retrieve the health model (or instance space) and then process state change events. By maintaining the health model in memory, and processing state change events, the Correlation Engine is able to determine when to raise an alert based on the state of the system.

In response to a problem, several monitors change state, and the corresponding state change events are forwarded by the agent to the Root Management Server (RMS). Once received by the RMS, they are processed by the Correlation Engine, which may raise an alert via the RMS’s Software Development Kit (SDK) interface. This alert is then visible on the Operations Manager Console.

Correlation Factors

The actions taken by the Correlation Engine is determined based on the several factors.

Monitor state change events Monitors, which watch for the specific diagnostics from Exchange such as event log messages, performance counter thresholds, and PowerShell task output events, register state change events when they detect that a problem has occurred or cleared (red to green or green to red), or as agents become unavailable or are placed in maintenance mode (and subsequently made available, and/or removed from maintenance mode).

Typically, alert rules are configured to fire when green to red state changes occur. In the Exchange Server 2010 Management Pack, you’ll find that this is not the case. Specifically, alerts are not automatically raised by monitor state changes. The Correlation Engine may determine the best alert to raise.

Health Model The class hierarchy imported into Operations Manager by the Exchange Server 2010 Management Pack is extensive. The class hierarchy includes class relationships that define component dependencies throughout the system. By defining these component dependencies in the object representation of the product, the Exchange Server 2010 Management Pack is able to better understand the health of the Exchange organization. For example, if the Exchange Server 2010 Management Pack identifies Active Directory as offline, it will also report that Exchange messaging is not fully functional.

Timing The Correlation Engine works in 90-second intervals. When state change events for multiple monitors come in at the same time, it waits to see whether anything else potentially related to the failure is detected so that it can make the most effective determination of the root cause.

Correlation Algorithm

Overview of the Correlation Engine process

1. First, it connects to the Operations Manager SDK service to download the Health Model hierarchy and instance state (on service startup only, or as needed if errors require it).
2. Next, it queries Operations Manager for the latest state change events related to entities in the Exchange Management Pack.
3. If new Non-Service Impacting (NSI) state changes are detected, then it raises alerts for them.
4. Key Health Indicator (KHI) monitors are then evaluated, and "chains" of red KHI monitors are isolated. These "chains" indicate issues where a dependency has failed and is impacting dependent processes. Recognizing these relationships is the key step.
5. Alerts are raised for the root cause monitor in the KHI chain.
6. It then waits 90 seconds, and then starts over at step 2 above.

Additional points of interest regarding the correlation engine process

· If the "chain" of KHIs includes both error and warning monitors, then the alert is raised as an error, regardless of the class of the root cause monitor. For example, if a top-level process defines an error monitor to catch failure cases, and if it is correlated to a warning monitor in a dependency, then the alert will be raised against the dependency, but it will be marked as an error instead of a warning.
· Not every class relationship is used for alert correlation. See the Appendix: Class Hierarchy later in this guide for the specific relationships used by the Correlation Engine.
· The KHI chain, including any forensic monitors, is included in the Alert Context field available in the properties of the final alert. This allows inspection of the monitors correlated to the given alert and, in the case of alerts firing from dependency monitors, is required to determine the specific failure referenced by the alert.
· Monitors in maintenance mode are simply skipped when evaluating the health model.

What is and is not Affected by Alert Correlation

A key point to understand about the Exchange Server 2010 Management Pack, and the Correlation Engine in particular, is what the Correlation Engine affects, and what it doesn’t affect.

The following items are different due to the Correlation Engine:
· Monitors are configured not to alert automatically on state change events. This allows the Correlation Engine to determine the best alert to raise (as described above).
· The Exchange Server 2010 Management Pack doesn't raise Exchange alerts that correspond to the health of your environment when the Correlation Engine is stopped. If the Correlation Engine is stopped, a general alert is raised to notify you that the Correlation Engine is not running.
The following items are not different due to the presence of the Correlation Engine:
· Overrides still work as expected; you can change certain values or disable monitors just as you do today.
· Monitors/objects in maintenance mode are skipped by the Correlation Engine. No special consideration is required since the monitors don’t raise state change events for consumption by the Correlation Engine.
· Per-monitor alert rules were added to the Exchange Server 2010 Management Pack. Per-monitor alert rules allow monitoring personnel to enter company-specific notes for a given alert into the Company Knowledge field, even when the alert rules aren’t used to raise alerts for their corresponding monitors.
· Other management packs are not affected by the presence of the Correlation Engine.

In summary, keep in mind that it’s just the "monitor state change to alert" step that is enhanced by correlation.

Operational Notes

Since the Correlation Engine needs to maintain the instance space of the management group in memory to determine related monitors/alerts, its memory footprint is relative to the number of instances in the management group. In plain terms, the more Exchange servers and databases you have, the more memory it will require.

In observing environments at Microsoft, the Correlation Engine scales roughly at about 5 megabytes per monitored Exchange server. There are factors that can drive this number up or down, but it’s a good starting point toward understanding the resource impact on the server hosting the service.
As stated above, the preferred location for the service is on the RMS role given the close SDK interaction and core functionality of raising alerts.
While SCOM 2007 is not limited to a number of managed servers, it is limited to the number of managed objects and relationships between them. SCOM by design is an object model based solution and any managed object defined in a management pack is tracked individually in SCOM. The more of these unique managed objects and any relationships for these objects, the more SCOM has to work at tracking the health and workflow processing for them.
The maximum number of tested objects and relationships per SCOM Management Group works out as such:

Maximum number of Managed Objects: 250,000
Maximum number of Relationships: 300,000

Now these are only the maximum tested numbers from the SCOM Development team. SCOM 2007 can manage more than these numbers, however SCOM performance starts to become impacted and monitoring may be impaired if these numbers are exceeded.

Major Note:
The Exchange Correlation Engine may not process alerting if there are too many managed objects, relationships or Groups containing a large number of objects. The noticed limit to relationships and group object members are:

Relationships: 600,000Group object members: 1,000,000

This is a known hard limit as the Correlation Engine will take too long to gather this information running into a timeout that will cause the process to re-start, to which it will hit a timeout and re-start continuously.

The Exchange 2010 MP creates a lot of managed objects due to the design with the Correlation Engine. The trouble this causes is that the number of managed objects and relationships in SCOM increase rapidly with any 1 Exchange server added to the Management Group. Here are some typical managed object and relationship counts based on the server added:

Common across all Exchange Servers:

20 Managed Objects
25 Relationships
CAS:
40 Managed Objects
40 Relationships
Transport:
20 Managed Objects
30 Relationships
Unified Messaging:
15 Managed Objects
20 Relationships

Mailbox:

Per Database Copy:

40 Managed Objects

65 Relationships

Note: These are approximate numbers and every environment setup will be different.

Using these Numbers you get this many managed objects and relationships for a simple 4 server Exchange 2010 installation:

Total Managed Objects: 340
Total Relationships: 710

Additionally, if more Database copies are added to the environment, these numbers increase rather quickly even without adding any more servers. Let’s say we added 2 more databases requiring a total of 4 database copies. We’ve now added 160 new managed objects and 260 relationships. That’s almost 50% more managed objects than before and a third more relationships without adding any new servers.

Due to this kind of increase in the Management Pack, we quickly start reaching the maximum tested numbers for the management group. In fact, larger scale Exchange 2010 installations can only manage effectively 400-500 Exchange 2010 servers in a single SCOM management group depending on the environment.

Make sure to look at this scale when designing the SCOM monitoring environment.

Get SCOM Prepped

Besides the scale of the objects injected into SCOM, this Management Pack has a high dataflow rate as there are not only potentially hundreds of thousands of managed objects, but also monitoring criteria such as health states, performance and event data flow for them. To allow SCOM to work with this high data flow, there’s a few things to do to prep SCOM:

On the Root Management Server (RMS):

SCOM CU3+ is highly recommended as there are quite a few performance based fixes including setting the standard agent queue size at 100 MB instead of the old 15 MB. This is required for Exchange 2010 agents as the amount of data to submit can at times grow quickly and the small queue can cause the agent to drop data or even stop functioning.

Additionally, there are Registry Keys to update to allow the RMS to more effectively utilize the server resources and reduce additional unneeded churn. The table below covers some of these keys:

Registry Hive
Key
Type
Value
Description

HKLM\Software\Microsoft\Microsoft Operations Manager\3.0
GroupCalcPollingIntervalMilliseconds
DWord
000dbba0
Changes the Group Calculation processing to 15 minutes.

HKLM\Software\Microsoft\Microsoft Operations Manager\3.0\Config Service
Polling Interval Seconds
Dword
00000078
Changes the Config Service Polling to 2 Minutes

Finally for the RMS, ensure that there are no agents reporting directly to the RMS whenever possible. The Exchange 2010 MP hosts a lot of “Non-Hosted” managed objects on the RMS which has to process a lot of health states as well as all alerting occurs from the RMS. Having the RMS process agent processing and dataflow can hinder this process and should if at all possible be avoided.

On all Management Server(s):

For all Management Servers including the RMS, there are a few more registry keys to update to allow for better resource utilization for SCOM processing. The table below covers some of these keys:

Registry Key
Type
Value
Description

HKLM\System\CurrentControlSet\Services\HealthService\Parameters\Persistence Cache Maximum
Dword
00019000
Allows more memory usage for the Health Service’s Data store on the local system.

HKLM\System\CurrentControlSet\Services\HealthService\Parameters\Persistence Version Store Maximum
Dword
00002800

HKLM\System\CurrentControlSet\Services\HealthService\Parameters\Persistence Checkpoint Depth Maximum
Dword
06400000

HKLM\System\CurrentControlSet\Services\HealthService\Parameters\State Queue Items
Dword
00005000
Allows more data be allowed to store in the Health Service’s Data store on the local system.

Note: These updates do not apply to gateway servers.
SCOM Data warehouse: (Where applicable.)

The Exchange 2010 MP adds some new Datasets to the SCOM DW for custom reporting. These new datasets have their own set of aggregations that can take a bit more time to complete than normal. Thus, you need to increase the timeout for DW processing to allow these aggregations to finish.
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse]
"Command Timeout Seconds "=dword:00000384 – Updates the DW processing timeout from 5 minutes to 15 minutes.

Note: This needs to be done on every Management Server (including the RMS).

Note: This may require to be up to 30 minutes for the timeout depending on how much data flow there is. (Mainly event and performance collection.)

MP Changes

The sheer amount of monitoring in the Exchange 2010 MP provides a whole host of alerting criteria never really experienced in SCOM previously. This is by far the largest MP to date from Microsoft, and provides a massive amount of visibility to Exchange issues. However, there are just some things in the Management Pack that just don’t work. Some of these can actually hinder or even stop all SCOM alert processing due to the inherent nature of the check/alert. These issues become much more visible when you start to scale up. Due to this some essential monitoring and override disables for some of the most common faulty monitoring in the Exchange MP. These are designed to provide more accurate monitoring for highly critical issues as well as reduce the chance of impacting SCOM monitoring.

Monitors:

Disable the following Monitors
· KHI: Database dismounted or service degraded.
· KHI: The Database is not mounted on the local server. Another database copy may be mounted on a different server.

Optional Overrides:

All Non-reporting based Performance Collections should be disabled, there are 504 Performance Collection rules to disable. After disabling these each environment will need to review these performance collections and re-enable what they care about in their environment as needed.

Other Considerations

Due to how the MP is designed, the Correlation Engine has to cache the Exchange infrastructure to be able to determine if all is healthy. It is highly important that all Exchange servers in an Exchange Site are in the same SCOM management group. Having only part of the whole Site in a single SCOM management group will cause a lot of noise as the Correlation Engine is expecting to see all these servers in the site, but does not see them in SCOM. Please plan accordingly to ensure that all active Exchange servers in each site are properly monitored in the same SCOM management group. (Best example is the North America Site is managed in one SCOM management group, while the South America Site is managed in another SCOM management group.)
Additionally, if any monitoring does not seem to be correct and causing churn/noise, turn it off by disabling the corresponding monitor. Once disabled, review the criteria and determine if it’s actionable and if any additional tuning is needed. It’s better to stop the alerting for a short time to ensure SCOM isn’t about to break, than allow a noisy alert that will mask other potential issues.
Finally, make sure that your SCOM agents are healthy. Put in a remediation process for agents that don’t report in. (You can toss a recovery process on “Health Service Heartbeat Failure”) A lot of the Exchange monitoring is based on the best health of n servers. If the 1 healthy server is not reporting in, SCOM thinks it’s all unhealthy and can go pretty nuts in the process. (At this MP most likely has 50+ monitoring criteria associated with that 1 healthy server.) Keeping the agents reporting in is key to ensuring monitoring is accurate.

Common Errors

Symptom

MicrosoftExchangeServerRoleDisovery.js returns empty discovery data for non-domain servers. Server will not be discovered as an Exchange server role in the Operations Manager database.

Cause

The MicrosoftExchangeServerRoleDisovery.js script creates a propertybag that returns the role of the every Exchange server. The script does not return any error or give any indication why it failed unless an override is placed on the discovery to enable VerboseLogging=True.
The script looks for the following parameters to be populated:
· Computer Principal Name
· Computer Netbios name
· Computer Active Directory Site
· Computer DNS name
· Install Path
· Version
If any of these parameters are not populated, the script will return an empty property bag and the server will not be discovered as an Exchange server role in the Operations Manager database.
If an Exchange Edge server is in a workgroup, the ComputerActiveDirectorySite parameter is not populated. Because of this, the server will not be discovered by Operations Manager. Apparently, it is quite common for Edge servers to be workgroup machines, so monitoring them in Operations Manager is not possible without faking some value for this parameter.

Resolution
Add a registry key to the server so that it returns a non-NULL value for Active Directory Site:
1. Open Registry Editor
2. Navigate to HKLM/System/CurrentControlSet/Services/NetLogon/Parameters
3. Find the SiteName value for this key. Populate this value with any non-NULL string (such as "perimeter", "DMZ","Edge", etc)
Note: Do not update the DynamicSiteName value, as the NetLogon service can overwrite this data. The SiteName value is not automatically updated.

Symptom

Alerts raised by SCOM have 10 custom fields. In order to get the integration with other ticketing systems the custom fields are used. The Exchange Management Pack (Correlation Engine in particular) does use same fields for their internal needs and therefore overwrites values.
It - for example - stores the CorrelatedProblemId in CustomField10. And if the field is overwritten alerts cannot be closed.

Cause
The Exchange MP uses Custom Fields as Critical Values, connectors that use any of the following will experience this issue:
CustomField4
CustomField5
CustomField7
CustomField8
CustomField9
CustomField10

Resolution
As the Exchange MP uses these, workaround is to change the custom fields the connector uses.

Symptom
AD integration breaks after installing Exchange 2010 Management pack. The moment Exchange 2010 MP is installed on a server running SCOM Agent, AD integration gets broken for that Agent. The memberships in Primary and secondary groups are all showing up correctly, but just that Agent reads everything in AD and tries to connect to all the Management servers including RMS (where RMS is not even configured with AD integration)

Cause
The reason why this is happening is because when a box is installed with Exchange 2010, the machine account gets added to the following three additional Domain groups that are created.

· Exchange Trusted Subsystem (read and special)
· Exchange Servers (only special)
· Exchange Windows Permissions (only special)

And these three groups have permissions in the Domain level itself. So, it gets inherited in “OperationsManager” and subsequent Management Group containers / SCP’s. When Agent has health service running under “local system”, and when it starts up, it is able to read everything in AD under the OperationsManager container.

Resolution

The issue is fixed by removing these three groups from the “OperationsManager” container thereby stopping inheritance.

Symptoms
Using System Center Operations Manager 2007, you import the Exchange 2010 Management Pack. As per the Exchange 2010 MP guide, all the object discovery rules are enabled by default and should automatically discover all Exchange 2010 roles and start monitoring them. This does not happen, thus you face a problem where none of the Exchange 2010 Server roles are getting discovered or getting monitored. It also does not log any error or throw any alert saying that discovery failed.

Causes
· This can occur if you install the 32-bit (x86) agent on a 64-bit (x64) based operating system or platform.
· This can happen if your Exchange 2010 Server roles are clustered. For example, the Mailbox server role or CAS server role is installed on Windows Cluster Server.

Resolutions
· Install the proper agent for the platform or OS hosting the Exchange Server roles.
· Make sure that OpsMgr 2007 R2 Agent is installed on all clustered nodes. Then from the OpsMgr Console -> Administration pane -> Device Management -> Agent Managed, go to each and every agent computer and from the security tab, enable the Agent Proxy check box. Restart System Center Management Service on each agent computer after doing this and within few minutes all Exchange 2010 Server roles should get discovered and monitored as expected.

Symptom

When attempting to override an alert priority on an Exchange 2010 rule with the Exchange 2010 MP the override does not take effect. The default value is defined as $Data/EventData/CorrelatedContext/RootCause/Priority$.

Cause
Alerts are generated by the Correlation Engine, so the overrides are not taking effect. Override scope is probably the issue.

Resolution

Overrides class should be selected for “All objects of Class: Root Management Server”

Symptoms
After deploying the Exchange 2010 Management Pack in a System Center Operations Manager environment, the Exchange 2010 MP may set the RMS in a critical state with the following error:

Failed to deploy Data Warehouse component. The operation will be retried. Exception 'DeploymentException': Failed to perform Data Warehouse component deployment operation: Install; Component: Script, Id: '0672dd6a-1e36-2336-b1f0-f701fe67f8a2', Management Pack Version-dependent Id: 'ab06eb14-eaf1-0f0b-04b8-f1cdd33f4acc'; Target: Database, Server name: 'serverName', Database name: 'OperationsManagerDW'. Batch ordinal: 15; Exception: Must declare the scalar variable "@SplitValue". Must declare the scalar variable "@SplitValue". One or more workflows were affected by this. Workflow name: Microsoft.SystemCenter.DataWarehouse.Deployment.Component Instance name: <FQDN> Instance ID: {05432A69-69F6-2B53-2D79-52BD1AC6E289} Management group: groupName

If the Exchange 2010 MP is removed then the health will return to normal (green).

Cause

This can occur if DB collation is set to be case sensitive.

Resolution

Change the DB collation to be case insensitive to resolve this issue.

Symptoms

When you drill down to a sub report "Top Alerts" of the SLA report in the Exchange 2010 SP1 management pack for System Center Operations Manager 2007 R2 you get the error:
An error has occurred during report processing, Query execution Failed For dataset TopAlerts.
Cannot Find either column ‘Exchange2010” or the user-defined Function or aggregate
Exchange2010.GetserverRole, or the name is ambiguous.

Resolution

Don't use this report.

Symptoms
High amounts of Config Churn are noticed after importing the Exchange 2010 Management Pack

Cause
There are two discoveries in the Exchange 2010 RTM MP version 14.0.650.8 that cause some significant config churn. The two discoveries target the “Mailbox” class – and are:

· Microsoft.Exchange.2010.Mailbox.MdbOwningServerLocalEntityDiscoveryRule
· Microsoft.Exchange.2010.Mailbox.MdbOwningServerRemoteEntityDiscoveryRule

These run every 14400 seconds by default (4 hours). The problem is they collect some properties that change with every run of the discovery. The commonly churning properties are:

· DatabaseSize
· DbFreeSpace
· LogDriveFreeSpace

Resolution

Upgrade to the latest version of the Exchange 2010 MP for SP1 as this changes these values to NULL, otherwise consider overriding these to run once per day (that is the maximum supported for this particular discovery) until this condition is resolved with an updated MP. This will not solve the config churn, it will simply reduce the amount caused by this specific workflow. Once per day is 86400 seconds. If you try to set it for more than 86400, you will get an error from the Scheduler Data Source Module about the synch time error.

Symptoms

Account Lockout, some customers who have enabled account lockout policies in their environment have reported issues with the test user being locked out.

Resolution
If you experience lockout problems in your environment, see Microsoft Knowledge Base article 2022687, Exchange Test CAS Connectivity user gets locked out when using Exchange 2010 MP (http://go.microsoft.com/fwlink/?linkid=3052&kbid=2022687).

Symptoms
Event Messages Concerning MSExchange Management Event Log. If the Exchange Server 2010 Service Pack 1 (SP1) version of the Management Pack is imported before all Exchange servers are upgraded to Exchange Server 2010 Service Pack 1 (SP1), the event log message below may be logged regularly.
Log Name: Operations Manager
Source: Health Service Modules
Event ID: 26004
Level: Error
Description:
The Windows Event Log Provider is still unable to open the MSExchange Management event log on computer 'server'. The Provider has been unable to open the MSExchange Management event log for 565200 seconds.
Most recent error details: The specified channel could not be found. Check channel configuration.
One or more workflows were affected by this.

Cause
The logging of this event is expected behavior when servers that have the RTM version of Exchange 2010 installed use the Exchange 2010 SP1 Management Pack. The Exchange Server 2010 Service Pack 1 (SP1) version of the Management Pack will still monitor Exchange computers that are running Exchange Server 2010 Service Pack 1 (SP1) and Exchange Server 2010 RTM while this event is being logged.

Symptoms
Size of the Operations Database grows out of control after disabling rules in the Exchange 2010 MP.

Cause
The Correlation Engine checks to see if alerts are created before creating a new event in the PendingSDKDatasource table. The basic components are Monitors and their matching rules. Each monitor has a corresponding rule. The monitors change state and the Correlation Engine picks up on them. If it is a new issue a new event is created via the SDK and the event description contains the whole chain of monitors. The rules then look for the SDK events and create an alert. If the alert is disabled the Correlation Engine will continue to insert events. Depending on the timing of some of the monitors and the type of failure, the number of events will continue to grow. The side effects are the size of the PendingSDKDatasource table grows quite large and the rules have trouble keeping up with the number of events. This may cause the MonitoringHost process running those workflows to consume very large amounts of memory (Private Bytes). This can have a negative overall performance impact on the RMS if that is where the Correlation Engine resides. The PendingSDKDatasource table does groom once a day. But depending on how unhealthy the Exchange environment is this may be too large an interval. The main takeaway for this is DO NOT disable an Exchange alerting rule unless you also disable the corresponding monitor.

Resolution
Disable the monitors that correspond with the alerting rules.

=====

For the most current version of this article please see the following:

2592561 : Guidance, Tuning and Known Issues for the Exchange 2010 Management Pack for System Center Operations Manager 2007

J.C. Hornbeck | System Center Knowledge Engineer

↧

Meet Sergey Kanzhelev, developer on the Operations Manager Team

August 19, 2011, 12:33 pm

≫ Next: Topology changes in System Center 2012 Operations Manager (Overview)

≪ Previous: Guidance, Tuning and Known Issues for the Exchange 2010 Management Pack for System Center Operations Manager 2007

http://blogs.msdn.com/b/sergkanz/ ...read more...(read more)

↧

Topology changes in System Center 2012 Operations Manager (Overview)

August 22, 2011, 2:18 pm

≫ Next: Application Monitoring – Working with Alerts

≪ Previous: Meet Sergey Kanzhelev, developer on the Operations Manager Team

OM Community, In this blog post, I will explain the changes made to the Operations Manager 2012 infrastructure topology. The purpose of this post is not to do a deep technical explanation on how some of these new features work but more of an overview...(read more)

↧

Application Monitoring – Working with Alerts

August 23, 2011, 9:18 am

≫ Next: DNS2008ComponentDiscovery Fails with Event ID 1155 in System Center Operations Manager 2007

≪ Previous: Topology changes in System Center 2012 Operations Manager (Overview)

Our team has made a few posts around APM with Operations Manager 2012, how to get things running , how it works , and how to simulate errors for testing . Here I’m going to talk about the application centric alerts you will see in OM when you start...(read more)

↧

DNS2008ComponentDiscovery Fails with Event ID 1155 in System Center Operations Manager 2007

August 25, 2011, 9:41 am

≫ Next: Operations Manager 2012 Beta - Usage Survey - Win and Xbox and Kinect

≪ Previous: Application Monitoring – Working with Alerts

Symptoms

When using Microsoft System Center Operations Manager 2007 (SCOM 2007), DNS 2008 Component Discovery Fails on Windows Server 2008. Below events are logged in the Operations Manager Event Log on the DNS Server.

Event 1

Log Name: Operations Manager
Source: Health Service Script
Date: <Event Time>
Event ID: 1155
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: <Computer Name>
Description:
DNS2008ComponentDiscovery : Unable to open WMI \root\default:StdRegProv.

Event 2

Log Name: Operations Manager
Source: Health Service Modules
Date: <Event Time>
Event ID: 21405
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Computer: <Computer Name>
Description:
The process started at <Time> failed to create System.Discovery.Data, no errors detected in the output. The process exited with 0
Command executed: "C:\Windows\system32\cscript.exe" /nologo "DNS2008ComponentDiscovery.vbs" <Script Parameters>

Working Directory: <Monitoring Host Temporary Files Location>

One or more workflows were affected by this.

Workflow name: Microsoft.Windows.DNSServer.2008.Discovery.Components
Instance name: <Computer Name>
Instance ID: {<GUID>}
Management group: <Management Group Name>

If you connect to WMI \root\default:StdRegProv directly using Windows Management Instrumentation Tester with required privileges you should be able to connect.

Cause

The DNS2008ComponentDiscovery.vbs script could not connect to the WMI Class MicrosoftDNS_Zone due to an unavailable Property Value in the class.

Resolution

This can be resolved by recreating the WMI values for DNS by recompiling the DNS mof files:

1. On the DNS Server, Open Administrator Command Prompt and execute the following commands:

mofcomp C:\Windows\System32\wbem\dnsetw.mof

and

mofcomp C:\Windows\System32\wbem\dnsprov.mof

Above commands will recreate the DNS information in the WMI repository. Once this is done the errors should no longer appear.

More Information

For More Information on DNS Management Pack and DNS Component Discovery, please refer to DNS Management Pack Documentation.

=====

For the most current version of this article please see the following:

2589504 : DNS2008ComponentDiscovery Fails with Event ID 1155 in System Center Operations Manager 2007

J.C. Hornbeck | System Center Knowledge Engineer

↧

Operations Manager 2012 Beta - Usage Survey - Win and Xbox and Kinect

August 29, 2011, 8:52 am

≫ Next: Troubleshooting the Installation of the System Center Operations Manager 2007 Agent

≪ Previous: DNS2008ComponentDiscovery Fails with Event ID 1155 in System Center Operations Manager 2007

Now that you have had a chance to test SCOM 2012, please provide feedback! Usage Survey Allow about 20 minutes to complete this survey. Complete the survey and qualify to Win an Xbox & Kinect! Sweepstakes Official Rules ...read more...(read more)

↧

Troubleshooting the Installation of the System Center Operations Manager 2007 Agent

September 1, 2011, 8:40 am

≫ Next: Standard Dataset Maintenance troubleshooter for System Center Operations Manager 2007

≪ Previous: Operations Manager 2012 Beta - Usage Survey - Win and Xbox and Kinect

If you’re looking for a good resource for troubleshooting OpsMgr 2007 client agent install issues then this new KB article we published today is for you:

=====

Symptoms

The System Center Operations Manager 2007 (SCOM 2007) agent can be deployed to Windows computers either via "remote push" from a management server or it can be manually installed on a target computer using MomAgent.msi. If the installation of the agent is not successful, there are a number of troubleshooting steps that can be used depending on where the error is occurring and how the agent will be deployed.

Resolution

Verify the target computer meets the supported configuration
The initial step in troubleshooting installation of the Operations Manager agent on a Windows computer is to verify that the potential agent meets the supported hardware and software configuration. The following article lists the requirements for an Operations Manager 2007 agent:

Operations Manager 2007 R2 Supported Configurations

If the target system is a Unix/Linux computer, verify that the distribution and version are supported. Please note that support for some versions requires post-R2 cumulative updates. The following article has the supported versions of Unix/Linux:

System Center Operations Manager 2007 R2 Cross Platform Monitoring Management Packs

Troubleshooting Agent Deployment via the Discovery Wizard in the Operations Manager Console
If the agent will be deployed by means of discovery from the Operations Manager console, the agent will be installed from the management server or gateway server specified in the discovery wizard to manage the agent, not the server the operations console was connected to when it opened. Any testing, therefore, should be conducted from the management server or gateway specified when the wizard is run or a different management server/gateway should be specified during the wizard to see if the same error occurs.

Problem:
The wizard does not display a list of potential agents to install.
Cause:
The credentials specified in the wizard during the initial discovery should have permission to search Active Directory for potential Operations Manager agents. If this account is not able to connect to Active Directory, then the Discovery Wizard will fail.
Typical errors that appear may be:
- Error Code: 800706BA
  Error Description: The RPC server is unavailable
- Error Code: 80070079
  The MOM Server failed to perform specified operation on computer "name". The semaphore timeout period has expired.
- Error Code: 80070643
  The Agent Management Operation Agent Install failed for remote computer "name"
Possible Resolutions:
- During discovery, specify an account that has both domain administrator permissions and is a member of the Operations Manager Admins group.
- If the LDAP query times out, or is not able to resolve the potential agents in Active Directory, discovery can be performed via the Operations Manager Command Shell. See the following section "Troubleshooting Agent Deployment via the Operations Manager Command Shell"for additional information.
Problem:
The intended target computer is not in the list of potential agents after the initial discovery runs.
Cause:
- The computer is already identified in the database as part of the management group.
- The computer is listed under 'Pending Actions' in the Operations Console.
Possible Resolutions:
- If the target computer is listed in the 'Pending Actions' node of the 'Administration' space in the Operations Console, the existing action must either be approved or rejected before a new action can be performed. If the existing install settings are sufficient, approve the pending installation from the console. If the existing settings are incorrect, reject the pending action, then run the discovery wizard again.
Problem:
The discovery wizard encounters one of the following errors while trying to install the agent:
- Operation: Agent Install
  Error Code: 800706D9
- Error Description: Unknown error 0xC000296E
- Error Description: Unknown error 0xC0002976
- Error Code: 80070643
  Error Description: Fatal error during installation.
Cause:
- The account previously specified to perform the agent installation in the discovery wizard will need to have permissions to connect remotely to the target computer and install a Windows service. This requires local administrator permissions due to the requirement to write to the registry.
- Group policy restrictions on the management server computer account, or the account used for agent push, can prevent successful installation. Group Policy Objects in Active Directory that prevent the Management Server computer account, or the user account used by the Discovery Wizard, from remotely accessing the Windows folder, the registry, WMI or administrative shares on the target computer can prevent successful deployment of the Operations Manager agent.
- The Windows Firewall is blocking ports between the Management Server and the target computer.
- Required services on the target computer are not running.
Possible Resolutions:
- If the credentials specified in the wizard do not have local administrator permissions, add the account to the local Administrators security group on the target computer, or use an account that is already a member of that group.
- Block group policy inheritance on the target computer, or the user account performing the installation.
- If an agent install is failing when using a domain account to push the agent from a management server, the use of Windows administrative tools can help identify potential issues. Log onto the Management Server under the credentials in question and attempt the following tasks. If the account does not have permission to log onto the management server, the tools can be run under the credentials to be tested from a command prompt.
  - "RUNAS /user:<username> compmgmt.msc". From the 'Action' menu item, select 'connect to another computer'. Browse or type in the remote computer name. Try to open event viewer and brows any of the event logs.
  - "RUNAS /user:<username>services.msc". From the 'Action' menu item, select 'connect to another computer'. Browse or type in the remote computer name. Attempt to start or stop print spooler or any other service on the target computer.
  - "RUNAS /user:<username> regedt32.exe". From the File' menu item, select 'connect network registry'. Browse or type in the remote computer name. Try to open "HKey_Local_Machine" on the remote machine.
  - "RUNAS /user:<username>Explorer.exe". Type the following in the address bar: \\admin$
    If any of these tasks fail, try using a different account known to have Domain Administrator or Local Administrator (on the target computer) permissions. Also try the same tasks from a member server or workstation to see if the tasks fail from multiple machines.
    Failure to connect to the admin$ share may prevent the Management Server from copying setup files to the target. Failure to connect to the Windows Registry on the target can cause the Health Service to not be installed properly. Failure to connect to Service Control Manager will prevent setup from starting the service.
- The following ports must be open between the Management Server and the target computer:
  - RPC endpoint mapper Port number: 135 Protocol: TCP/UDP
  - *RPC/DCOM High ports (2000/2003 OS) Ports 1024-5000 Protocol: TCP/UDP
  - *RPC/DCOM High ports (2008 OS) Ports 49152-65535 Protocol: TCP/UDP
  - NetBIOS name service Port number: 137 Protocol: TCP/UDP
  - NetBIOS session service Port number: 139 Protocol: TCP/UDP
  - SMB over IP Port number: 445 Protocol: TCP
  - MOM Channel Port number: 5723 Protocol: TCP/UDP
- The following services must be enabled and running on the target computer:
  - Netlogon
  - Remote Registry
  - Windows Installer
  - Automatic Updates

The following articles provide some good background about deploying the Operations manager agent using discovery from the Management Server:

How to Deploy the Operations Manager 2007 Agent Using the Agent Setup Wizard
How does Computer Discovery Work in OpsMgr 2007?
Agent discovery and push troubleshooting in OpsMgr 2007
Console based Agent Deployment Troubleshooting table

Troubleshooting Agent Deployment via the Operations Manager Command Shell
In some situations, automatic discovery of potential agents may time out due to very large or complex Active Directory environments. Other situations may require that automatic discovery be run with an LDAP query that is more limited than what is available in the UI. In these cases, automatic discovery of computers and remote installation of the Operations Manager agent is possible via the Operations Manager command shell. The following blog posting gives the syntax required to do this:

Discovering Windows Computers via PowerShell

Troubleshooting Agent Deployment via Verbose Windows Installer Logging
If the installation of the agent on a remote computer fails during installation, a verbose Windows Installer log may be created on the management server in the following default location:

C:\Program Files\System Center Operations Manager 2007\AgentManagement\AgentLogs

The log can be used to determine if there was a specific error encountered and may be useful to further troubleshoot installation of the Operations Manager agent on the target computer.

Look for the first entry with the string "Return Value 3" in the log. The preceding few lines will usually indicate the error that Windows Installer encountered. The format will typically be in the form of "function / description of error / error return code", and can indicate permission issues, missing files or other settings that need to be changed. Examples:

Error message:
ConvertStringSecurityDescriptorToSecurityDescriptor failed : 87
Possible cause:
The installation account does not have permission to the security log on the target computer
Error message:
ModifyEventLogAccessForNetworkService(): Could not grant read access to SecurityLog: 0x00000057
Possible cause:
The installation account does not have permission to the security log on the target computer
Error message:
Cannot open database file. System error -2147024629
Possible cause:
The installation account does not have permission to the system TEMP folder

There are many possible errors that can be logged here. Other individual errors can be further researched on TechNet or the Online Knowledge Base.

Troubleshooting Manual installation of the Operations Manager Agent
In cases where the Operations Manager agent cannot be deployed to a remote computer via the Discovery Wizard, the agent will need to be installed manually. This can be performed via command line using the MomAgent.msi file. The following references describe the various switches and configuration options available to perform a manual installation:

How to Deploy the Operations Manager 2007 Agent Using MOMAgent.msi from the Command Line
Windows Agent Install MSI Use Cases and Commands
Process Manual Agent Installations in Operations Manager 2007

If the agent is deployed via manual install, future Service Pack updates or cumulative updates will need to be manually deployed as well. Computers that have been manually installed will not be designated by the System Center Configuration Management service as being remotely manageable, and the option to upgrade them will not be presented in the Operations Console.

Other key considerations to account for during the manual installation of agents:

If the installation is being performed by a domain or local user, the account need to be a member of the local Administrators security group in Vista or later operating systems. In pre-Vista Operating Systems, users that were members of the "Power Users" security group had the permissions required to install services.
If the agent is being deployed via Configuration Manager, the Configuration Manager Agent service account will either need to run as Localsystem (which is the default) or under the context of a local administrator.

Errors that prevent agents from being installed manually can be identified in the Windows Installer setup logs. The following command can be used to enable verbose Windows Installer logging of the Operations Manager agent installation:
msiexec.exe /i "MOMAgent.msi" /l*v "C:\Agent\MOMAgent_install.log"

As an alternative, the following article describes how to enable verbose Windows Installer logging globally on a Windows computer:
How to enable Windows Installer logging

The log can be used to determine if there was a specific error encountered and may be useful to further troubleshoot installation of the Operations Manager agent on the target computer.

Examples:

Error message:
ConvertStringSecurityDescriptorToSecurityDescriptor failed : 87
Possible cause:
The installation account does not have permission to the security log on the target computer
Error message:
ModifyEventLogAccessForNetworkService(): Could not grant read access to SecurityLog: 0x00000057
Possible cause:
The installation account does not have permission to the security log on the target computer
Error message:
Cannot open database file. System error -2147024629
Possible cause:
The installation account does not have permission to the system TEMP folder

There are many possible errors that can be logged here. Other individual errors can be further researched on TechNet or the Online Knowledge Base.

=====

For the most current version of this article please see the following:

2566152: Troubleshooting the Installation of the System Center Operations Manager 2007 Agent

J.C. Hornbeck | System Center Knowledge Engineer

↧

Standard Dataset Maintenance troubleshooter for System Center Operations Manager 2007

September 6, 2011, 8:50 am

≫ Next: PRO Tips alerts in Microsoft System Center Virtual Machine Manger 2008 R2 may display and then be automatically removed

≪ Previous: Troubleshooting the Installation of the System Center Operations Manager 2007 Agent

What is Standard Data Set Maintenance?

Standard Data Set Maintenance is a workflow that runs against the Data Warehouse to aggregate, optimize, and groom data. This workflow runs on the RMS and is triggered every 60 seconds. There is a hard-coded timeout for the workflow and under many circumstances can fail, resulting in 31552 events in the RMS Operations Manager event log. This process is actually a group of stored procedures (SP) that get called by the parent sp – StandardDataSetMaintenance. There are many sp’s that make up this group, but the primary ones are:

StandardDataSetMaintenance – Parent, Calls the other primary sp’s

StandardDataSetAggregate – Called for State and Perf data aggregation

StandardDataSetOptimize – Updates indexes and statistics

StandardDataSetGroom – Grooms old data based on age

The above shows the StandardDataSet sp’s, but misses the Data Set specific sp’s that it calls, such as the performance specific sp’s, such as:

PerformanceAggregate

PerformanceAggregationDelete

PerformanceGroom

PerformanceProcessStaging

Each data set type will have its own specific set of stored procedures similar to the above.

What causes things to go wrong?

There are a few different things that can cause this workflow to fail. Here are the five that will be the focus of this document as they are the most common:

1 – SQL Standard

SQL Standard Edition does not allow for online indexing of the database. Every night the maintenance sp will trigger the “StandardDataSetOptimize” stored procedure which will try to perform indexing on the data set tables. This process will fail so every night when this happens we will see corresponding 31552 events indicating the failure.

2 – SQL Performance

SQL Performance can cause 31552 events but not typically those events alone. When this is the cause you will typically see a wide variety of symptoms such as write failures, reports slow or timing out, scheduled reports failing, and possibly 2115 events. Disk performance is the most common cause here, which will be where we start in our troubleshooting steps later.

3 – Data Flood (Perf and/or State)

Since the maintenance process has to aggregate, copy, and groom State and Performance data an influx of data can cause it to fail. A single bad monitor can wreak havoc on the DW database if it is changing state quickly enough. Understand it does not necessarily have to be a “bad” monitor, it can simply be an issue in the environment, for instance an R2 upgrade that caused a new flood of state changes or a power outage that impacted enough objects to initiate a flood. Performance collections that make it fail are not necessarily bad either, as it could be simply too many collection rules, too many objects that the rules are running against, as well as a collection rule interval set a little too aggressive. A few quick queries as shown below will give us our answer here.

4 – Large MEP tables (ManagedEntityProperty)

This is mostly seen from SP1 environments. There was an issue in SP1 where we did not groom these properties out as we should, so this table just kept growing. This is fixed in R2 environments, but is listed here as we still support SP1 and the fact that we could still see it in an upgraded R2 management group. If this table is too large then it can take a long time to process the managed entity objects needed for our maintenance process, which leads to the 31552 event. We can manually groom this table if needed, and is shown below, but we need to take caution when performing these steps.

5 – SQL Permissions

This is the most unlikely of the scenarios and will not be discussed in detail here. This is primarily caused by wrongly configured Run As Profiles/Accounts and tend to manifest elsewhere as well. We would normally see more failures than just 31552 events in this case, and to troubleshoot you simply verify them, so I will not spend much time going over this here.

So now that we know a little about the event and its primary causes, let’s look at what we would need to do to troubleshoot some of the above scenarios. A key takeaway when troubleshooting these events is in the data set section of the event. Take the event below:

Failed to store data in the Data Warehouse.
Exception 'SqlException': Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
One or more workflows were affected by this.
Workflow name: Microsoft.SystemCenter.DataWarehouse.StandardDataSetMaintenance
Instance name: State data set
Instance ID: {8FBE35E8-41E7-F8F8-2DDC-AFC4A44D7522}
Management group: Operations Management

The highlighted area above will tell you the data set experiencing the issue. This is important as in the scenarios listed above not all data sets apply. In the troubleshooting section I will notate any restrictions necessary.

So how do we troubleshoot these events?

The first thing we need to do is isolate the cause. So as stated above we need to know what is impacted. Here is a list of things to check in the beginning:

What data sets are impacted?

If it is only State or Perf (sometimes both) then we can remove the idea of SQL permissions, MEP table, and SQL version. This means we can direct our focus on performance or noisy rules/monitors. If it is all data sets, then we can eliminate data flood as this only impacts the data set that is flooding the database. Now that we know the data sets, we move on to the next question.

When are the events happening?

If the events are only happening at night then we want to check the SQL version first as mentioned above. This should be the first thing checked in the all data set scenario since it is the quickest to eliminate. If the events are occurring during the day this is also a quick way to eliminate this step.

Are there any correlating events?

Sometimes we will see 31553 events or events indicating permissions issues with SQL. We also can check the SQL logs for login/permission issues. If we see these they are generally related to the SQL accounts defined in the console as seen below:

During the DW/Reporting install these accounts are created with spaces for the account and password. A quick thing to do here is if we know they are not supposed to be defined we can “reset” them by simply opening the account and putting a space in each box so it looks like this:

Once the above change is made we can restart the RMS Health Service and see if we still get the events. Correlating events like 31553’s can also indicate a couple of other issues that are outside of the scope of this document, so keep that in mind when troubleshooting this issue. 31553 events indicate that something went wrong during the maintenance procedure, and generally will indicate the specific failure in the event. For example:

The above error indicates an issue where the table needs to be reseeded. Our partition tables have an identity column with a max value of type “int”, which means they have a max row count of 2147483647. We normally never reached this, and we usually see this error because the identity value is higher than the actual row count. I will not get in to how we get in this situation but to detect and resolve this issue we need to do the following:

Check the current ident value:

DBCC CHECKIDENT (“TableName”)

If the above verifies the value is too high, run this:

Select Count(*) from TableName – Tells us the value to use in the next query

DBCC CHECKIDENT (“TableName”, RESEED, 3) –the “3” here is the result from the last query

The above steps will get the total number of rows currently in the table and reset our identity so that the next row entered gets the correct value.

Now let’s get back to our problem for this document.

Take the below scenario:

As you can see there are quite a few events, and we are going to assume two situations, one at a time.

All workflows are impacted and we think it is SQL Performance

SQL Performance can be very tricky to troubleshoot, and I suggest you get an SQL engineer involved to assist with the diagnosis. Before you do, here are some things you should check before getting them involved.

First, let’s check some performance counters. Instead of listing the counters individually, here is a screenshot of what we should check:

When looking at these counters we need to verify where the files are located. If the database files are on the E and F drives, then we need the counters for those drives in Perfmon. Most of the above counters are up for interpretation, but PLE (Page Life Expectancy) is more specific. This should never drop below 300. This counter indicates how long a page stays in memory before being cycled out by SQL to make room for a new page. If this gets too low, then our maintenance process can take too long to process data and timeout, resulting in the event. Manoj Parvathaneni discusses SQL performance in his Grey Agent Troubleshooter located here. You can reference that for what to look for in the counters listed in this document. To save time, here is the section from his guide relative to our investigation:

· MSSQL$<instance>: Buffer Manager: Page Life expectancy – How long pages persist in the buffer pool. If this value is below 300 seconds, it may indicate that the server could use more memory. It could also result from index fragmentation.
· MSSQL$<instance>: Buffer Manager: Lazy Writes/sec – Lazy writer frees space in the buffer by moving pages to disk. Generally, the value should not consistently exceed 20 writes per second. Ideally, it would be close to zero.
· Memory: Available Mbytes - Values below 100 MB may indicate memory pressure. Memory pressure is clearly present when this amount is less than 10 MB.
· Process: Private Bytes: _Total – This is the amount of memory (physical and page) being used by all processes combined.
· Process: Working Set: _Total – This is the amount of physical memory being used by all processes combined. If the value for this counter is significantly below the value for Process: Private Bytes: _Total, it indicates that processes are paging too heavily. A difference of more than 10% is probably significant.
Counters to identify disk pressure: Capture these Physical Disk counters for all drives containing SQL data or log files:
· % Idle Time – How much disk idle time is being reported. Anything below 50% could indicate a disk bottleneck.
· Avg. Disk Queue Length – This value should not exceed 2 times the number of spindles on a LUN. For example, if a LUN has 25 spindles, a value of 50 is acceptable. However, if a LUN has 10 spindles, a value of 25 is too high. You could use the following formulas based on the RAID level and number of disks in the RAID configuration
RAID 0 – All of the disks are doing work in a RAID 0 set
Average Disk Queue Length <= # (Disks in the array) *2
RAID 1 – half the disks are “doing work” so only half of them can be counted toward Disks Queue
Average Disk Queue Length <= # (Disks in the array/2) *2
RAID 10 – half the disks are “doing work” so only half of them can be counted toward Disks Queue
Average Disk Queue Length <= # (Disks in the array/2) *2
RAID 5 – All of the disks are doing work in a RAID 5 set
Average Disk Queue Length <= # Disks in the array *2
· Avg. Disk sec/Transfer – The number of seconds it takes to complete one disk I/O.
· Avg. Disk sec/Read – The average time, in seconds, of a read of data from the disk.
· Avg. Disk sec/Write – The average time, in seconds, of a write of data to the disk.
The above three counters should be around .020 (20 ms) or below consistently and never exceed.050 (50 ms). Here are the thresholds documented in the SQL performance troubleshooting guide:
Less than 10 ms – very good
Between 10 - 20 ms – okay
Between 20 - 50 ms – slow, needs attention
Greater than 50 ms – Serious I/O bottleneck
· Disk Bytes/sec – The number of bytes being transferred to or from the disk per second.
· Disk Transfers/sec – The number of input and output operations per second (IOPS).
When % Idle Time is low (10% or less) – which means that the disk is fully utilized – the above two counters will provide a good indication of the maximum throughput of the drive in bytes and in IOPS, respectively. The throughput of a SAN drive is highly variable, depending on the number of spindles, the speed of the drives and the speed of the channel. The best bet is to check with the SAN vendor to find out how many bytes and IOPS the drive should support. If % Idle Time is low and the values for these two counters do not meet the expected throughput of the drive, engage the SAN vendor to troubleshoot.
he following links are great resource for getting deeper insight into troubleshooting SQL performance:
Troubleshooting Performance Problems in SQL Server 2005: http://technet.microsoft.com/en-us/library/cc966540.aspx (http://technet.microsoft.com/en-us/library/cc966540.aspx)
Troubleshooting Performance Problems in SQL Server 2008:
http://msdn.microsoft.com/en-us/library/dd672789(SQL.100).aspx

Only the State Data Set is in the 31552 events

A lot of this is covered in my Blank Reports troubleshooter located here. There are some additional steps covered in that document that are not listed here as they are not really relevant. So when only one data set is present in the events we can take the approach of a data flood or excessive data. Both can be investigated the same way. The first thing we need to do is find out how much data is being inserted into the DW. To see that, we can run the following:

SELECT datepart(dd,[DateTime]) as day , count( * ) as Count_of_Events FROM [State].[vStateRaw] s with (NOLOCK) WHERE ([DateTime] BETWEEN CONVERT(datetime, '2011-07-01 00:00:00', 120) AND CONVERT(datetime, '2011-07-30 00:00:00', 120)) group by datepart(dd,[DateTime]) order by 1 asc

The above query will list out the amount of State data insertions per day, during the timeframe configured in the query. An example of the result is:

day Count_of_Events
1 59336
2 29516
3 17585
4 15396
5 15023
6 14792
7 17538
8 61737
9 2687722
10 2017917
11 1857917
12 20904
13 30034
14 29353
15 30565
……..

As you can see in the highlighted section above we had a flood on the 9^th, 10^th, and 11^th of the month, which probably caused our issue. We can diagnose this further and see how many insertions per hour and see what monitor was possibly causing the flood.

--to show what items per hour per day in state are logged

SELECT datepart(hh,[DateTime]) as hr, count( * ) as Count_of_Events FROM [State].[StateRaw_<guid>] s with (NOLOCK) WHERE ([DateTime] BETWEEN
CONVERT(datetime, 'yyyy-mm-dd hh:mm:ss', 120) AND CONVERT(datetime, 'yyyy-mm-dd hh:mm:ss', 120))
group by datepart(hh,[DateTime])
order by 1 asc

--example of the above but with times added

SELECT datepart(hh,[DateTime]) as hr, count( * ) as Count_of_Events FROM [State].[StateRaw_561ADF0271A34D38AAA027F790BEDF82] s with (NOLOCK) WHERE ([DateTime] BETWEEN
CONVERT(datetime, '2011-04-01 00:00:00', 120) AND CONVERT(datetime, '2011-04-01 23:59:59', 120))
group by datepart(hh,[DateTime])
order by 1 asc

-- Noisiest monitors during the hours of interest

select datepart(hh,[DateTime]) as hr, ManagedEntityMonitorRowId,
count(StateRawRowid) as Count_Of_Entries from [State].[StateRaw_561ADF0271A34D38AAA027F790BEDF82] s with (NOLOCK)
WHERE ([DateTime] BETWEEN CONVERT(datetime, '2011-03-31 00:00:00', 120)
AND CONVERT(datetime, '2011-03-31 23:59:59', 120))
group by datepart(hh,[DateTime]),ManagedEntityMonitorRowId
having count(StateRawRowid) > 100 -- for now ignore anything with less than 100 entries
order by 1,3 desc

The above can be useful if the on the days of the flood we had an extremely large amount of insertions. We may have to flip the DirtyInd rows for specific timeframes to 0 to allow us to aggregate it. If this is the case, I recommend you work with a senior engineer to make these changes. You can also look into the monitor causing the flood and make any changes necessary.

Now that have this information we can check to see how many DirtyInd rows there are for the data set. To check that, run this:

DECLARE @DatasetId uniqueidentifier SELECT @DatasetId = DatasetId FROM Dataset d WHERE (d.DatasetDefaultName = 'State data set') Select AggregationDateTime, AggregationTypeId From StandardDatasetAggregationHistory Where DatasetId = @DatasetId And DirtyInd = 1

This will output the amount of DirtyInd rows we have to deal with. We should expect to see at least 2 entries, 1 for the current hour and a second for the current day. Anything under 5-10 rows can probably be ignored, but in our example here it returned 550 rows. With a result this high, it will take quite some time to get caught up. The amount of data we can move at one time is restricted in the StandardDataSetAggregation table under the MaxRowsToGroom column. For State data, this is set to 50000 rows by default. We can increase this during our tests to say 100000 rows, by running the following:

Update StandardDatasetAggregation Set MaxRowsToGroom = 100000 Where GroomStoredProcedureName = 'StateGroom'

This will let our next step move more data at one time. Before running this, we need to make sure we have enough space in the transaction log and in the temp db. We also need to override the rule in the console to make sure the workflow does not run while we run it manually.

Once we have verified this, we can run the below query:

DECLARE @DataSet uniqueidentifier SET @DataSet = (SELECT DatasetId FROM StandardDataset WHERE SchemaName = 'State') EXEC standarddatasetmaintenance @DataSet

The above query will execute maintenance for the specified Data Set manually. Sometimes this is not enough (if we have too many DirtyInd rows) so we need to run it more than once. You can run the below to execute it in a loop:

declare @i int set @i=1 while(@i<=500) begin DECLARE @DataSet uniqueidentifier SET @DataSet = (SELECT DatasetId FROM StandardDataset WHERE SchemaName = 'State') EXEC standarddatasetmaintenance @DataSet set @i=@i+1 Waitfor delay '00:00:05' End
After running this query return to the DirtyInd query to see how many rows are left. We can force the query to run until we get below a defined number of rows (not recommended, but can be useful for situations where the customer does not want to babysit the query) by running this instead:

DECLARE @i int DECLARE @DataSet uniqueidentifier SET @DataSet = (SELECT DatasetId FROM StandardDataset WHERE SchemaName = 'State') While(Select COUNT(*) from StandardDatasetAggregationHistory where DirtyInd = 1 And DatasetId = @DataSet) > 10 Begin EXEC standarddatasetmaintenance @DataSet Waitfor delay '00:00:05' End

Once the above is finished we should have all the data aggregated. I would recommend resetting the MaxRowsToGroom column back to the default and then remove the override to the Standard Data Set rule. Then we can monitor the event log to see if the events return.

Fred Lee | System Center Support Escalation Engineer

↧