Skip to content

Functional Overview of ESM

This document explains ESM's mechanisms of operation. It is referenced by the installation and configuration documentation. It is recommended that any users looking to install, configure or administer the ESM application read this document first.

High Level Overview

An ESM deployment is comprised of one or more instances of ESM Agents, a single instance of the ESM Server application, and any number of client machines which access the ESM application interface through the thin client browser-based front end.

An ESM Agent runs on each SAS Application Server node and is responsible for collecting process and server events and metrics, and submitting them to the ESM Server Application. The ESM Server Application collects metric data from each instance of an ESM Agent and stores it in the ESM Database. It also serves the web-based front end for the ESM application interface and performs a minimal level of scheduled database maintenance.

The ESM agent continuously monitoring the resource usage of any active (SAS) processes that it knows about. Agents submit their data to the ESM Server web application via a periodic Web Service call. The data resolution, disk polilng interval, and other agent settings are configurable via the ESM Admin interface, accessed via the Web GUI.

ESM Overview

How ESM integrates with SAS

In order to maximise compatibility and stability and eliminate any possibility of interfering with the user processes being monitored, ESM's integration with traditional SAS Foundation sessions relies on SAS DATA Step and a common filesystem location for inter-process communication.

Each ESM agent instance is configured with its own events directory, which it continually monitors for event files generated by processes looking to communicate with it. As soon as an event triggerfile appears in this location, it is read by the ESM agent, and if the data it contains is valid, the file is removed. The agent then interprets the information or instruction specified in this event file and acts upon it, by either e.g. monitoring a new process, or forwarding the data within the event file on to the ESM Server.

How events are communicated

Deploying the ESM Agent onto a shared filesystem allows for an instance of an Agent to be started on each node in the cluster without requiring multiple installations. To avoid potential conflicts that may arise from non-unique PIDs, each node has its own dedicated events directory. In such a multi-node or GRID installation, where the filesystem is shared across multiple nodes, the layout of the event directories would look something like this:

opt
   ESM
     esm-agent
       events
         node1_hostname
         node2_hostname
           esm_eventfile_1
           esm_eventfile_2
           esm_eventfile_3
         node3_hostname

Multiple types of events are supported by ESM.

Event Types

new events

A new event tells ESM to begin monitoring a process, communicating relevant attributes of that process. The properties are typically communicated:

  • pid - the ID of the process to be monitored
  • hostname - the hostname of the machine (must match configured agent hostname / ESMNODENAME environment variable)
  • owner - the name of the user that the process can be attributed to
  • sasUuid - a unique identifier for this 'session'. Session-provided UUIDs are useful when this value needs to be propagated to sub-sessions as an environment variable for the purposes of reconciliation with parent jobs/sessions. If a value is not provided here it will be automatically generated by the agent
  • queue - the name of the queue that this session / job belongs to
  • jobName - the identifier of the session. For jobs this is typically the job name.
  • workFolder - the temporary directory attributed to the session as transient WORK storage (SAS specific). Can be an array of directory locations*.
  • utilFolder - the temporary directory attributed to the session as transient UTIL storage (SAS specific). Can be an array of directory locations*.
  • logFile - the logfile to be attributed to the session and parsed in real time for events.
  • logs - a list of logFiles, where a job generates more than one logfile or requires more than one log to be followed
  • esmType - the 'session type' is an ESM attribute. Typically it is one of WS, PWS, STP, Batch, GRID, LASR, JVM or SYS, but categories can be added dynamically. SYS sessions are not shown by default, and cannot be acted on (terminated) by users.

tag events

A tag event is a basic event which is attributed to a process at a given time, containing contextually relevant information. It is intended to be used by programmers to help identify progress between code blocks or functions, but can be extended for any purpose where overlaying contextual data flags to the timeseries is beneficial.

A basic tag event has the following properties:

  • text - the title of the tag event, searchable from the Tag Search
  • tooltip - detailed information about the event, shown when the user hovers over the flag to display the tooltip
  • color - the colour of the tag flag, in HTML colour notation

highlightStart and highlightEnd events

Highlight are called highlightStart and highlightEnd for legacy reasons and are better described as jobStart and jobEnd events. These are a special type of tag event and are used for communicating information specific to jobs, such as job return codes and job flow information.

A highlightStart event requires the following:

  • pid - the process ID of the job in question
  • hostname - the hostname of the machine the job is executing on (must match configured agent hostname / ESMNODENAME environment variable)
  • uuid - the code-generated unique ID for the job in question. The purpose of this ID is to reconcile the data communicated in the highlightEnd tag with the PID of the job

A highlightEnd event requires the following:

  • hostname - the hostname of the machine the job is executing on (must match configured agent hostname / ESMNODENAME environment variable)
  • uuid - the code-generated unique ID for the job in question. The purpose of this ID is to reconcile the data communicated in the highlightEnd tag with the PID of the job
  • text - the identifier for the job, typically the job name matching the jobName identifier in the new event
  • returnCode - the exit code, or completion status, with which the job terminated (i.e. 0 = success, 1 = warning, 2+ = error). Return codes of 3 and 6 (ABORT exits and internal errors) are treated as errors
  • flow - a colon-separated string of identifiers containing the job's position within the LSF flow hierarchy. This expects the verbatim value of the LSB_JOBNAME environment variable, from which superflous variables such as user name or LSF job ID are stripped

A note on UUIDs and highlight tags

The highlight tag mechanism may appear convoluted, but it serves to facilitate the reconciliation of job PIDs and return (exit) codes. When a SAS 'Job' is launched, a SAS process is spawned by the parent instance of the executing script (i.e. sasbatch.sh), and when that job finishes, the return code of the SAS job subprocess is collected by that script. In order to ensure a unique relationship between the session being monitored and the exit code returned upon job termination, the uuid must therefore be generated and exported by the parent context of the sasbatch.sh process so that the highlightStart tag (generated by the job process at startup, once the subprocess ID is known), can be linked to the exit code reported back to the parent process.

Triggerfile formats

When communicating with the agent through the triggerfile mechanism, processes can output triggerfiles for consumption by the agent as either a space-separated array of quoted values, or in JSON format. The space-separated array depends on field order and is considered legacy, but remains around 40% faster to parse and therefore continues to be supported for high-stress environments with little available overhead.

The following are examples of each format.

Warning

For legacy reasons, the ESM Agent matches triggerfiles based on hostname and will disregard any files where the hostname element of the filename does not match the agent hostname exactly. This check may be removed in a future release.

new event triggerfile

File name: new_12345_thishostname (where 12345 is the PID of the process to be monitored and thishostname is the agent recognised hostname)

File content (legacy)

"thisUser" "WS" "Lev1_SASAppTest" "/data/saswork/SAS_workC02600003039_thishostname" "/data/sasutil/tempUtil_12345_thishostname" "/data/logs/someWorkspaceLog.log" "defaultQueue"

File content (json)

{
   "owner": "thisUser",
   "sasUuid": "234801532270870",
   "queue": "defaultQueue",
   "jobName": "Lev1_SASAppTest",
   "workFolder": "/data/saswork/SAS_workC02600003039_thishostname",
   "utilFolder": "/data/sasutil/tempUtil_12345_thishostname",
   "logFile": "/data/logs/someWorkspaceLog.log",
   "esmType": "WS"
}

tag event triggerfile

File name: tag_12345_thishostname_123 (where 12345 is the PID of the process to be monitored, thishostname is the agent recognised hostname and 123 is an identifier that ensures a unique filename for the triggerfile, often an autoincremented counter)

File content (legacy)

"flagText" "tooltipText with more detail" "#FAAD39"

File content (json)

{
   "text": "flagText",
   "tooltip": "tooltipText with more detail",
   "color": "#FAAD39"
}

highlightStart event triggerfile

File name: highlightStart_12345_thishostname (where 12345 is the PID of the process to be monitored and thishostname is the agent recognised hostname)

File content (legacy)

"234801532270870" 

File content (json)

{
   "uuid": "234801532270870"
}

highlightEnd event triggerfile

File name: highlightEnd_hostname_234801532270870 (where thishostname is the agent recognised hostname and 234801532270870 is the uuid

File content (legacy)

"234801532270870" "myJobName.sas" "null" "#CCCCCC" "0" "1670:thisUser:some_daily_flow:some_daily_subflow:myJobName" 

File content (json)

{
    "uuid":"234801532270870", 
    "text":"myJobName.sas", 
    "returnCode":0, 
    "flow":"1670:thisUser:some_daily_flow:some_daily_subflow:myJobName"
}