5 Ways to Use Log Data to Analyze System Performance--reference

Recently we looked across some of the most common behaviors that our
community of 25,000 users looked for in their logs with a particular focus on
web server logs. In fact our research identified the top 15 web server tags and
alerts created by our customers – you can read more about these from in ourcommunity
insights
 section – and you can also easily create tags or alerts based
on the patterns to identify these behaviours in your systems.

This week we are focusing on performance analysis using log data. Again we
looked across our community of over 25,000 users and identified 5 ways in which
people use log data to analyze system performance. As always customer data was
anonymized and privacy protected. Over the course of the next week we will be
diving into each of these area’s in more detail and will feature
customers first hand accounts of how they are using logs to help identify
and resolve such issues in their systems.

Our research looked at more than 200k patterns from across our Community
 to identify important events in their log data. With a particular focus on
performance related issues we identified the following 5 areas as trending and
common across our user base.:

1. Slow Response Times:Response times are one of the most
common and useful performance measures that are available from your
log data. They give you an immediate understanding of how long a
request is taking to be returned. For example web server logs can give you
insight into how long a request takes to return a response to a client device.
This can include time taken for the different components behind your
web server (application servers, DBs) to process the request so it can
give an immediate view as to how well your application is
performing. Recording response times from the client device/broswer
can give you an even more complete picture since it also captures page load time
in the app/browser as well as network latency.

A good rule of thumb when measuring response times is to follow
the 3 response time limits
as outlined by Jakob Nielsen in his
publication on ‘Usability Engineering’ back in 1993 that is still relevant
today. In short 0.1 second is about the limit for having the user feel that the
system is reacting instantaneously, 1.0 second is about the limit for the user’s
flow of thought to stay uninterrupted, and 10 seconds is about the limit for
keeping the user’s attention focused on the dialogue.

Slow response time patterns almost always follow the pattern below:

  • response_time>X

Where response_time is the field value representing the server or client’s
response and ‘X’ is a threshold, which if exceeded, you want the event to be
highlighted or a notification to be sent so that you and your team are aware
that somebody is having a poor user experience.

2. Memory Issues and Garbage Collection: Outofmemory
errors can be pretty catastrophic when they occur as they often result in
the application crashing due to lack of resources. Thus you want to know
about these when they occur and creating tags and generating notifications via
alerts when these events occur is always recommended.

However a leading indicator of outofmemory issues can be your garbage
collection behavior, thus tracking this and getting notified if heap
used vs free heap space is over a particular threshold, or if garbage collection
is taking a long time can be particularly useful and can often point you in the
direction of memory leaks. Identifying a memory leak before an out of memory
exception can be the difference between a major system outage and a simple
server restart until the issue is patched.

Furthermore slow or long garbage collection can also be one of the reasons
for user’s experiencing slow application behavior as during garbage collection
your system can slow down or in some situations it blocks until garbage
collection is complete (e.g. with ‘stop the world’ garbage collection).

Below are some examples of common patterns used to identify some of the
memory related issues outlined above:

  • Out of memory

  • exceeds memory limit

  • memory leak detected

  • java.lang.OutOfMemoryError

  • System.OutOfMemoryException

  • memwatch:leak: Ended heapDiff

  • GC AND stats

3. Deadlocks and Threading Issues

Deadlocks can occur in many shapes and sizes and can have pretty bad
effects when they occur – everywhere from bringing your system to a
complete halt to simply slowing it down. In short, a deadlock is a
situation in which two or more competing actions are each waiting for the other
to finish, and thus neither ever does. For example, we say that a set of
processes or threads is deadlocked when each thread is
waiting for an event that only another process in the set can cause.

Not surprisingly deadlocks feature as one of our top 5 performance related
issues that our users write patterns to detect in their systems.

Most deadlock patterns simply contain the keyword ‘deadlock’, but some of the
common patterns follow the following structure:

  • ‘deadlock’

  • ‘Deadlock found when trying to get lock’

  • ‘Unexpected error while processing request: deadlock;’

4. High Resource Usage  (CPU/Disk/ Network)

In many cases a slow down in system performance may not be as a result of any
major software flaw, but can be a simple case of the load on your system
increasing, yet not having increased resources available to deal with this.
Tracking resource usage can allow you to see when you require additional
capacity such that you can kick off more server instances for example.

Example patterns used when analysing resource usage:

  • metric=/CPUUtilization/ AND minimum>X

  • cpu>X

  • disk>X

  • disk is at or near capacity

  • not enough space on the disk

  • java.io.IOException: No space left on device

  • insufficient bandwidth

5. Database Issues and Slow Queries

Knowing when a query failed can be useful as it allows you to identify
situations when a request may have returned without the relevant data and thus
helps you identify when users are not getting the data they need. However more
subtle issues can be when a user is getting the correct results but the results
are taking a long time to return and while technically the system may be fine
and bug free a slow user experience may be hurting your top line.

Tracking slow queries allows you to track how your DB queries are performing.
Setting acceptable thresholds for query time and reporting on anything that
exceeds these thresholds can help you quickly identify when your users
experience is being effected.

Example patterns:

  • SqlException

  • SQL Timeout

  • Long query

  • Slow query

  • WARNING: Query took longer than X

  • Query_time > X

As always let us know if you think we have left out any important issues that
you like to track in your logs. To start tracking your own system
performance, create a free
account
and include these patterns listed above to automatically create
tags and alerts relevant for your system.

Published at DZone with permission of Trevor Parsons, author and DZone
MVB. (source)

http://java.dzone.com/articles/5-ways-use-log-data-analyze?mz=110215-high-perf

5 Ways to Use Log Data to Analyze System
Performance--reference

时间: 2024-07-28 18:50:15

5 Ways to Use Log Data to Analyze System Performance--reference的相关文章

【 翻译自mos文章】Alter Database Add Supplemental Log Data 命令挂起

Alter Database Add Supplemental Log Data 命令挂起 来源于: Alter Database Add Supplemental Log Data Hangs (文档 ID 406498.1) 适用于: Oracle Database - Enterprise Edition - Version 10.2.0.1 and later Information in this document applies to any platform. 症状: 作为流复制配

错误描述:请求“System.Data.SqlClient.SqlClientPermission, System.Data, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089”类型的权限已失败

错误描述:请求“System.Data.SqlClient.SqlClientPermission, System.Data, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089”类型的权限已失败. 解决办法:在配置文件web.config中<trust></trust>节点,把<trust level="WSS_Minimal" originUrl=""

编译器错误消息: CS0122: “System.Data.DataRow.DataRow(System.Data.DataRowBuilder)”不可访问,因为它受保护级别限制

编译错误 说明: 在编译向该请求提供服务所需资源的过程中出现错误.请检查下列特定错误详细信息并适当地修改源代码. 编译器错误消息: CS0122: "System.Data.DataRow.DataRow(System.Data.DataRowBuilder)"不可访问,因为它受保护级别限制 源错误:   行 17: 行 18: DataTable dt = new DataTable(); 行 19: System.Data.DataRow r = new DataRow(); 行

System.Data.DataRow.DataRow(System.Data.DataRowBuilder 因为它受保护级别限制 如何解决,解决办法

错误 1 "System.Data.DataRow.DataRow(System.Data.DataRowBuilder)"不可访问,因为它受保护级别限制 原因:DataRow dr= new DataRow();    // 错误,DataRow 不能直接new 解决办法: DataRow dr :或者,使用DataTable dt = new DataTable(); //对应的行,使用dt[],例如第一行  dt[0]

System.Data.Dbtype转换为System.Data.SqlDbType

最近在做一些OM Mapping的准备工作,新学了一招. 如果要将System.Data.Dbtype转换为System.Data.SqlDbType,以前以为要写Switch Case语句.其实有很简单的方法: ??????? private System.Data.SqlDbType ConvertToSqlDbType(System.Data.DbType pSourceType)??????? {??????????? SqlParameter paraConver = new SqlP

Data Dictionary and Dynamic Performance Views(数据字典和动态性能视图)

Overview of the Data Dictionary Because Oracle Database stores data dictionary data in tables, just like other data, users can query the data with SQL. Contents of the Data Dictionary The data dictionary consists of the following types of objects: Ba

错误 1 “System.Data.DataRow.DataRow(System.Data.DataRowBuilder)”不可访问,因为它受保护级别限制

new DataRow 的方式: DataTable pDataTable = new DataTable(); DataRow pRow = new DataRow(); 正确的方式: DataRow pRow=pDataTable.newRow();

SQL Server ErrorLog

SQL Server 使用ErrorLog记录SQL Server启动和运行过程中的信息,具体信息参考:<SQLSERVER errorlog讲解>.通常来说,ErrorLog是指SQL Server Error Log,其实,SQL Server存在另外一种类型,SQL Server Agent ErrorLog,用于记录Agent的运行信息. 默认情况下,SQL Server 保存 7 个 ErrorLog 文件,分别命名为: ErrorLog,ErrorLog.n(n=1,2,3,4,5

GoldenGate配置(二)之双向复制配置

 GoldenGate配置(二)之双向复制配置 环境: Item Source System Target System Platform Red Hat Enterprise Linux Server release 5.4 Red Hat Enterprise Linux Server release 5.4 Hostname gc1 gc2 Database Oracle 10.2.0.1 Oracle 11.2.0.1 Character Set ZHS16GBK ZHS16GBK OR