Visible Ops

Link:http://www.wikisummaries.org/Visible_Ops

Contents

[hide]

What is ITIL?

  • ITIL = Information Technology Infrastructure Library
  • A "drastically different approach to IT" (p79)
  • A "maturity path for IT that is not based on technology" (p79)
  • A "collection of best practices codified in seven books by the Office of Government Commerce in the U.K." (p85)
  • A collection "without prioritization or any prescriptive structure" (p18)
  • Used by Visible Ops authors as a framework "to normalize terminology" and categorize traits shared across studied high performing organizations (p18-20)

Introduction

(p10-24)

What is Visible Ops?

  • Highest ROI best practices divided into four prioritized and incremental Phases
  • All ideas are mapped to ITIL terminology
  • Intended to be an "on-ramp" to ITIL

Key premises to the Visible Ops rational

  1. 80% of unplanned outages are due to ill-planned changes made by administrators ("operations staff") or developers
  2. 80% of Mean Time To Repair (MTTR) is spent determining what changed
  3. With the right processes in place, it is easier, better, and more predictable to rebuild infrastructure than to repair it
  4. Concentrating staff time on pre-production efforts is more efficient and less expensive due to the high cost of repairing defects while in production
  5. Without process controls, pieces of infrastructure often become like unique snowflakes or irreplaceable works of art ... only understood by the "rocket scientist" creator who‘s time is tied to maintaining it (p41)
  6. "You can not manage what you can not measure" (p59)

Phase One: Stabilize the Patient

(p25-40)

Goals

  • Identify most critical IT systems generating the most unplanned work
  • Stabilize infrastructure (prioritizing the most fragile components)
  • Create a "culture of causality" where all changes are viewed as key risks that need to managed by facts rather than by beliefs
  • Reduce unplanned work to 25% or less (high performers achieve lower than 5%)
  • Maximize change success rates (high performers hit 98%)
  • Minimize Mean Time to Repair (MTTR)
  • Ensure security specialists become part of the decision process
  • Shift staff time from "perpetual firefighting to more proactive work that addresses the root causes of problems"
  • Minimize the IT failures that cause stress and damage IT‘s reputation
  • Increase the overall level of confidence in IT
  • Collect data to affirm the new processes and foster an understanding that any previous perceptions of nimbleness and speed were not factoring in time spent troubleshooting and doing unplanned work

Recommended key steps (to be implemented on most fragile systems first)

  1. Reduce or eliminate change privileges to fragile infrastructure

    • Why? Every time a change is made you risk breaking functionality
  2. Create scheduled maintenance windows where all changes are made
    • Why? Scheduled changes are more visible, and are more likely to be planned and tested before going into production
  3. Automate daily scans to detect and report changes
    • Why? To automatically verify and log that all scheduled changes were made ... and that no other changes were made
    • Warning: Due to their collected data, the authors strongly recommend that even the most trusted administrators still work under automated detections
    • Disclosure: One of the authors is the CTO at Tripwire, Inc, the manufacturer of the recommended software for these automated scans....
  4. When troubleshooting incidents, first analyze the recent changes (approved and detected) to isolate likely causes before recommending additional changes
  5. Schedule a weekly Change Advisory Board (CAB) made up of representatives from operations, networking, security and the service desk
    • Why? To ensure key stakeholders collectively inform and influence change decisions
  6. Create a Change Advisory Board - Emergency Committee (CAB/EC) who can assemble quickly to review emergency change requests
    • Why? "Emergency changes are the most critical to scrutinize"
  7. Create a Change Request Tracking System to document and track requests for changes (RFCs) through authorization, verification, and implementation processes
    • Why? To facilitate the change approval process and to generate reports with metrics

Phase Two: Catch & Release and Find Fragile Artifacts

(p41-46)

Goals

  • Prioritize IT‘s most critical services
  • Identify critical pieces of production infrastructure (hardware and software)
  • Identify interdependencies between components of production infrastructure
  • Foster organizational learning
  • Identify the high-risk "fragile artifacts"

Recommended key steps

  1. Create a prioritized service catalog that documents the most critical services
  2. Create a Configuration Management Database (CMDB) that illustrates mappings between services and infrastructure, and shows the interdependencies between all configuration items (CI)
  3. Freeze all related configurations for an agreed upon change-free window
    • Why? To ensure an accurate inter-related configuration inventory (see below)
  4. Inventory all equipment and software in the data center, recording the whos, whats, interdependencies and history for each item
    • Why? To facilitate faster problem management and to inform change decisions
    • Note: This inventory should be implemented by the most senior staff to ensure the most knowledgeable capturing of configuration details and histories
  5. Identify the "fragile artifacts" that have the worst historical change success rates and/or the least technical mastery by the supporting technicians, and prioritize them by the criticality of the services they provide
    • Why? To create a prioritized list of servers to rebuild in Phase Three
  6. To the extent possible, place fragile artifacts under a permanent configuration freeze until they can be replaced by complete rebuilds in Phase Three

Phase Three: Establish Repeatable Build Library

(p47-58)

Goals

  • Remove processes that encourage heroics in rewarding vigilant firefighters
  • Increase team-level technical mastery of production infrastructure
  • Shift senior staff from firefighting to fire prevention
  • Ensure that critical infrastructure can be easily rebuilt
  • Enable a new troubleshooting process with a short, predictable Mean Time To Repair (MTTR)
  • Ensure perfect configuration synchronization between pre-production and production servers
  • Ensure all configurations and build processes are completely documented

Recommended key steps (to be implemented on most fragile systems first)

  1. Create and maintain a versioned, Definitive Software Library (DSL) for all acquired and custom developed software and patches

    • Note: additions must be approved by the Change Approval Board (CAB)
    • Exception: at the time of initial creation, all currently used production software will be accepted into the DSL under a one year grace period
  2. Create a team of release management engineers from your most senior operations staff. Only more junior staff will be on the production operations team.
  3. Prevent developers and the release management engineers (previously the senior operations staff) from accessing production infrastructure
    • Reason 1: Policy encourages recommended changes to be error free with bullet-proof installation and back-out processes in place
    • Reason 2: Process verifies completeness and accuracy of documentation for installation and operations procedures
  4. Release management engineers create automated, consolidated, integrated, patched, tested, security scanned, layer-able build packages which will then be provisioned onto production infrastructure by the more junior, production operations staff
    • Reason 1: Consolidates the number of unique configuration counts (and thus increases team mastery of those fewer configurations)
    • Reason 2: Ensures fully integrated quality assurance tests and security verifications
  5. Updates and even non-emergency patches are then rolled into a new a "golden build" which is then applied to production hardware as a new build
    • Reason 1: Eliminates the risk of "patch and pray"
    • Reason 2: Otherwise, over time, break/fix cycles tend to encourage configuration variance between production and pre-production servers ... and between similar servers that should be identical
    • Reason 3: Applying new builds allows for highly accurate predictions of downtime, reduces chances of human error, and is typically faster than applying numerous individual patches and updates
  6. As a general rule, installed build packages will be preceded by erasing the production hard drive (or partition) ... the book calls this a "bare-metal build"
    • Why? This process ensure that production servers do not contain any hidden dependencies, and guarantees that the "golden builds" accurately reflect production systems, enabling perfect synchronization with pre-production servers

Phase Four: Enable Continuous Improvement

(p59-64)

Goals

  • Continuous increase in technical mastery of production infrastructure by reducing configuration variance
  • Continuous improvement of change success rates
  • Continuous increases in effective rate of change
  • Continuous monitoring to avoid slips in performance

Recommended key steps

  1. Use recommended metrics to hone efforts from the first three Phases. A few selected examples:

    • Percent of systems that match known good builds (higher is better)
    • Time to provision known good builds (lower is better)
    • Percent of builds that have security sign off (higher is better)
    • Number of authorized changes per week (higher is better)
    • Change success rate (higher is better)
  2. Strive to implement additional recommended improvement points. A few selected examples:
    • Segregate the development, test, and production systems to safeguard against any possible unintentional crossovers or hidden dependencies
    • Enforce a standard build across all similar devices
    • Define bullet-proof back out processes to recover from failed or unauthorized changes
    • Internalize the fundamental relationship between Mean Time to Repair (MTTR) and availability. By improving MTTR you also improve overall availability.
    • Track repeat offenders who circumvent change management policies.
时间: 2024-10-11 02:36:57

Visible Ops的相关文章

关于DevOps你必须知道的11件事

转自:http://www.infoq.com/cn/articles/11devops 关于作者 Gene Kim在多个角色上屡获殊荣:CTO.研究者和作家.他曾是Tripwire的创始人并担任了13年的CTO.他写过两本书,其中包括<The Visible Ops Handbook>,目前他正在编写<The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win>和<DevOps C

系统管理员资源大全

另外推荐一篇文章:<10本适合于系统管理员的最佳书籍>,目前比 kahun 他们整理的列表更丰富. 伯乐在线已在 GitHub 上发起「系统管理员资源大全中文版」的整理.欢迎扩散.欢迎加入. https://github.com/jobbole/awesome-sysadmin-cn 备份 备份软件 Amanda -客户端-服务器模型备份工具 Bacula – 另一个客户端-服务器模型备份工具 Backupninja -轻量级,可扩展的元数据备份系统 Backuppc -客户端-服务器模型备份

推荐!国外程序员整理的系统管理员资源大全

备份 备份软件 Amanda -客户端-服务器模型备份工具 Bacula - 另一个客户端-服务器模型备份工具 Backupninja -轻量级,可扩展的元数据备份系统 Backuppc -客户端-服务器模型备份工具和文件共享方案. Burp -网络备份和还原程序 Duplicity -使用rsync算法加密的带宽-效率备份 Lsyncd -监控一个本地目录树的变化,然后产生一个进程去同步变化.默认使用rsync. Rsnapshot -文件系统快照工具 SafeKeep -使用rdiff-ba

《凤凰项目》读书笔记

最近把<凤凰项目>刷了一遍,感觉都像发生在我们身边每一天的故事,把一些重点和自己的思考记录了下来: *四类工作: 1.业务项目 2.内部IT项目 3.变更 4.计划外工作 业务项目, 内部IT项目, 以及变更. 还有一种类型的工作,也许是最重要的一类,因为它的破坏性实在很强. 第四类工作是计划外工作. 与其他种类的工作不同,计划外工作是恢复性工作,几乎总是让你远离目标.因此,知道你计划外工作从何而来就显得尤为重要 *三步工作法: 第一工作法是关于从开发到IT运维再到客户的整个自左向右的工作流.

推荐!国外程序员整理的系统管理员资源大全(转)

受其他程序员汇编 PHP 资源,kahun 在 Github 发起系统管理员相关的开源资源整理. 内容分类包括:备份/克隆软件.云计算/云存储.协作软件.配置管理.日志管理.监控.项目管理…… 当然也有系统管理员相关书籍. 另外推荐一篇文章:<10本适合于系统管理员的最佳书籍>,目前比 kahun 他们整理的列表更丰富. 备份 备份软件 Amanda -客户端-服务器模型备份工具 Bacula - 另一个客户端-服务器模型备份工具 Backupninja -轻量级,可扩展的元数据备份系统 Ba

POJ 1436 Horizontally Visible Segments(线段树)

POJ 1436 Horizontally Visible Segments 题目链接 线段树处理染色问题,把线段排序,从左往右扫描处理出每个线段能看到的右边的线段,然后利用bitset维护枚举两个线段,找出另一个两个都有的线段 代码: #include <cstdio> #include <cstring> #include <algorithm> #include <bitset> #include <vector> using namesp

appium +ios 判断元素是否存在,排除visible=“false”的数据

问题 想要判断name=xxx的元素是否存在,存在的话进行点击,结果页面并没有展示我要的元素时也提示找到了元素 原因 ios通过driver.find_element_by_name("name值"),会找到visible=false的值(即不可见的元素),而实际再操作UI时,我们只想要visible=true的 driver.page_source部分信息如下: </XCUIElementTypeCell> <XCUIElementTypeCell type=&qu

is(&#39;:visible&#39;)

.end()为结束前面处理函数,返回到最初的元素 .next()为此元素的下一个元素,可以再加上.next()表示下下一个元素,以此类推 :visible 选择器选取每个当前是可见的元素.语法:$(":visible") 除以下几种情况之外的元素即是可见元素: 设置为 display:none type="hidden" 的表单元素 Width 和 height 设置为 0 隐藏的父元素(同时隐藏所有子元素) $(document).ready(function()

Visible Lattice Points(spoj7001+初探莫比乌斯)gcd(a,b,c)=1 经典

VLATTICE - Visible Lattice Points no tags Consider a N*N*N lattice. One corner is at (0,0,0) and the opposite one is at (N,N,N). How many lattice points are visible from corner at (0,0,0) ? A point X is visible from point Y iff no other lattice point