Description
The university has established operational incident-handling capabilities designed to reduce the impact of security incidents; including preparation, detection, analysis, containment, recovery, and user response activities. Service availability falls under this incident response procedure.
Scope
This policy applies to everyone who accesses University data or information resources.
Incident Response Procedure
Declaring an Incident
An incident can be declared by anyone in LTS and there is no limit to how small or large the impact of the incident is. Users outside of LTS should contact the Help Desk or submit a ticket and the Help Desk will elevate the ticket as needed. The following steps must be taken to officially declare an incident.
The incident must be officially declared in Slack #incidentmanagement channel.
Incidents can be declared for applications, systems, services, or security events that occur.
Outages, degradation of services or security incidents of Tier 0 and 1 services must be declared an incident unless the impact is for a limited number of users.
Involvement and communications may vary between declared incidents.
Incident Owner must be identified in the Slack #incidentmanagement channel.
Incident Owner must provide a Zoom link in the Slack #incidentmanagement channel for those that want to meet to discuss updates or work on the issue. Technical teams working on the issue can use whatever collaboration methods they want, including this Zoom session.
Use only one Zoom link for the incident. If there is a need for smaller discussions, use the Breakout Room feature in Zoom.
It is highly recommended that the Zoom session be recorded. There are no security, compliance, or insurance concerns over recording incident response Zoom sessions.
Important: Use threads to organize communications and discussions. For larger incidents, it’s acceptable to create a tech working thread and a communications discussion thread. If multiple incidents occur at the same time, multiple threads should be created.
Use the pin message to the channel feature to bookmark important links such as communications documents.
Utilize the Canvas feature in the channel to track updates to services impacted.
The Incident Owner or Help Desk creates LTS Alert(s) if needed.
Note: In an event that Slack is not available, Google Chat is an acceptable alternative. If Zoom is unavailable, then Google Meet can be used.
Incident Management vs Operations Slack Channel
While there are no clear rules for when an incident should be declared in the Slack #incidentmanagement channel, below are some general guidelines for when to use #incidentmanagement channel vs #operations channel. Regardless, please make sure all incidents and disruptions are communicated through either of these channels.
Incident Management Channel
Anything posted here will get the attention of staff and operations very quickly.
Degradation or outage of Tier 0 and 1 services.
Incidents posted here will have an Incident Manager and the service incident will be communicated in some form to the campus.
Blameless post mortem will be conducted.
Operations Channel
Non-critical and non-emergency operations issues and discussions will be posted here.
Scheduled downtime notification will be posted here.Monitored by Operations.
Communicating the Incident
LTS Communications Strategist or the Help Desk claim ownership of communicating the incident by notifying the Slack #incidentmanagement channel and providing a link to the Google Document for the communications.
Incident Owner contributes to the communications wording for thoroughness and accuracy.
Note: In an event that Slack is not available, Google Chat is an acceptable alternative.
Closing out an Incident
When an incident has been officially closed out by the Incident Owner, the following steps must be completed.
Incident Owner declares incident over in the Slack #incidentmanagement channel thread(s).
Incident Owner closes out LTS Alert(s).
The Incident Owner must conduct a restrospective to document findings, observations, and corrective/preventative actions.
Retrospective Process
As part of the process to close out an incident, a blameless retrospective must be conducted by the Incident Owner. The retrospective process must be followed in a way that no one is to be blamed for the incident itself and activities through completion. It must be in an open, honest, and learning environment so that we may capture findings, observations, and actions to continuously improve our services.
All LTS Retrospectives are stored in Google Shared Drive: https://drive.google.com/drive/folders/1cE9od_f0d9cqDqeEG9scYB_x6oIwcRIx
The Retrospective template is located at https://docs.google.com/document/d/1tRgl6V8nNqUtOIadrcGDqkXylmQnZ0GaR8X5_TpTFXk/edit
Key information captured includes:
Incident Summary: Executive summary of the incident
Detection: How were we notified of the incident? Our management tools? Help Desk ticket?
Incident Time Frame: Time between incident notification, incident declaration, and services completely restored.
Root Cause: What was the root cause of the incident, if known.
Resolution: What resolved the incident to return the service to normal operations?
Team Members: Who was involved in resolving the incident, identifying roles such as Incident Owner.
Timeline (Sequence of events from first incident through normal operations): Document key events that occurred during the incident such as notifications, communications, debugging, testing, changes in service state, and final resolution.
Findings and Observations (Positives and Negatives): Open and honest assessment of the incident. What went well, what went wrong, and where did we get lucky?
Actions: Document any corrective, preventive, and process actions. Assign owners and migrate actions into Jira in either team, project, or general tickets for tracking and documentation purposes.
Incident Owner Role
Incident Owner is responsible for communications between the team working on the incident and those that need updates on the progress. This provides a barrier so those working on the incident can be focused on troubleshooting and/or restoring service.
Should not be an active person working on remediating the incident.
Does not have to be a member of the team that is working on the incident but should have some technical expertise to be a bridge between those working and those needing updates.
Creates and manages the Zoom session that can be used for communications, debugging, or general updates.
Must coordinate or post to LTS Alerts.
Works with LTS Communications Strategist on communications, including feedback on verbiage.
Declares when the incident is over in the Slack #incidentmanagement channel thread(s), and closes out LTS Alert(s).
Responsible for ensuring the order of operations is followed so that our services are online according to our service priorities.
The Incident Owner must conduct a blameless post mortem to document findings, observations, and corrective/preventive actions.
Incident Response Workflow
Order of Operations and Service Tiers
During an incident, especially a large incident impacting multiple services, the following diagram serves as reference for which services should be brought online in priority order. Certain times of the days or year may yield different priorities such as when classes are in session or between semester breaks. Order of Operations and Service Tiers can be found at https://lehigh.atlassian.net/wiki/x/ywLDAQ
Incident Response Monitoring
At a later date in time, an outage that was not declared an incident could be required to follow the blameless post mortem to learn from the outage and to improve incident decision making.
Revision History
Date | Version | Description | Approval |
---|---|---|---|
1.2 | Changed PM to Retrospective | ||
1.1 | Added Order of Operations and Service Tiers | Approved | |
1.0 | Final Original Document | Approved | |
| 0.1 | Original Document | Draft |