GrIDS Outline Design Document

Introduction

GrIDS's core component is a graph-based language for analyzing network connection activity in a LAN-MAN sized system. GrIDS contains a language for specifying how network connection activity is represented in graphical form, incorporating information from host-based IDSs, network sniffers and connection analysis components.

The same language allows users to specify graphical structures which require notification of system administrators. This language is suitable for implementing network access control policies as well as identifying attack signatures.

This document attempts to address, in outline form, how the system would meet all the requirements in our requirements document.

Detection of Network Attacks

Idea

The key point of GrIDS is to detect large-scale automated attacks on networked systems. The mechanism that we propose is to build activity graphs. This section first explains somewhat intuitively what an activity graph is, and then describes in more detail the means by which we would build the graphs and assess their significance.

The nodes of an activity graph correspond to hosts in a system, while edges in the graph correspond to network activity between those hosts. Let us take the tracking of a worm as an example of building an activity graph. The worm begins on system A, then it initiates connections to hosts B and C which causes them to become infected. When the aggregation engine receives a report of a connection between host A and B, it starts a new graph with two nodes and one edge between them (represented here as "A->B"). The engine sets an activity time on this graph. If nothing is received within the activity time, the graph will be dropped. Next, the engine receives a report of the B->C connection. Provided that this connection occurs soon enough after the A->B connection, , the engine notices that it already has a graph containing B, and so it adds C and the edge B->C to the graph which previously only contained A->B.

Next the worm spreads to D, E, and so on. The aggregation engine incorporates them too. The graph soon looks like:


          A--> B--> D 
          |    | 
          |    `-> E 
          | 
          `-> C--> F
              | 
              `-> G

At this point, the aggregation engine notices the tree like structure of the graph, and it reports a suspected worm.

In this simple example, we put edges into a graph purely based on time co-incidence. In a more complicated algorithm, we could also use other information to decide if edges were correlated - such as whether they used the same destination ports, whether they went to hosts of similar operating system, etc.

In addition to forming graphs from network connections, we allow the graphs to be "decorated". That is, both nodes and edges in an activity graph may have attributes which supply additional information about the connection/host, or document other significant events which occurred in the same time frame. For example, if an NSM reported that the B->D connection appeared to contain a password file, an annotation to that effect could be added to that link. Similarly, if a tripwire module on C reported a change in a critical system file in the right activity time-frame, that would become an an attribute of the C node in this graph.

In the worm example, we detected the worm based soley on the tree like structure of the graph. In a more complicated algorithm, we could also use the node attributes ("decorations") in detecting a worm. A tree structure may not, in this algorithm, be suspicious until some node in the tree is decorated by some worm-related attribute.

Refinement

A difficulty in the above worm example is that only fast spreading worms will create large graphs (because graphs built at each stage of a slower worm will time-out before the next stage of infection.) Without the large worm-ish graphs, a slow worm won't be detected.

One solution is to allow graphs longer latency periods between connections. This, however, creates a serious problem: legitimate user activity clumps together into one huge graph, obscuring possible worm activity.

A better solution is to refine what kinds of network activity should be associated with a particular network abuse. If a worm is known to spread using rlogin and telnet connections, then a graph can be built that only contains rlogin and telnet connections. Since legitimate rlogin and telnet activity is less profuse than the sum of all legitimate activity, the latency time of this new graph may be increased without creating a huge (obscured) graph of legitimate activity.

Of course, for numerous types of network abuse we may search, so we need to be able to build numerous kinds of graphs. The graph engine is capable of maintaining multiple graph spaces, where each graph space contains only graphs of one type. The "type" of graph is specified by a set of rules which specify how graphs in the graph space are built (including rules specifying which kinds of connections to represent).

Policy

The network access control policy is implemented in the graph-building rules. The policy is enforced by checking for graph structures which correspond to policy violations.

We have assumed that the network organization is hierarchical in nature. The implications of this is that if policies require information from sub-organizations, the sub-organizations must install detection capability and information sources sufficient to implement the policy of the parent organization.

Scalability

Communications

We assume an organization with numerous semi-independent departments. It is assumed that most activity is within a department. Each department has its own module which computes graphs of activity in that department, enforces an intra-departmental policy, etc.

Further, we assume that the departments are part of some organizational hierarchy (the exact hierarchy structure will be dynamically configurable). Records of communications between departments will be passed up to the next level in the hierarchy. As far as that level is concerned, it's sub departments will form single nodes in it's activity graphs. Attacks, etc, detected within the sub-department will become attributes of a node in an activity graph at the higher level. In addition, sub-departments will attempt to inform larger units about when graph edges coming into and out of the sub-department may be causally connected. Policy about interdepartmental communications will naturally be specified at this level.

An important aspect of this system is that it must resist building large graphs out of its own communications. For that reason we must include the option of filtering out messages known to be sent by the various security components of the network.

Note that different departments may have some differences in which attributes were added to graphs, so their graphs were only partially mutually comprehensible. The basic skeleton of the graph would be transferable, but if one department had NIDES report attributes on some nodes of the graph, while the other department did not use NIDES, those would be incomprehensible to that department. Our approach to this is:

Full syntactic agreement. All attributes must be well-formed syntactically.
Agreement on a very basic attribute semantics to be supported by all departments ("root has been compromised", "this connection has suspicious content").
Departments free to add additional semantics themselves as suits them to get the best security they can.
Departments ignore attributes their rules don't understand. Graphs should still make sense when "extra" attributes are deleted.

Data Sources

Host-based IDS with some appropriate interface

TCP wrappers to give host-reports of connections

Network sniffers along the lines of NSM or NID to give network reports of connections and to provide non-TCP connection coverage.

Infrastructure

We perceive a 3-tiered system. At the bottom are protected components which consist of hosts, bridges, and routers arranged in a network topology which is known to the system administrators. Imposed on top of this is a map of the human organizations.

At the second level are GrIDS primitive components which consist of the host-based IDSs and host-based policy enforcement systems, the network sniffer components, tracing and thumbprinting components and honeypot systems. The installation of these systems depends on a thorough understanding of the network topology for reasons discussed above.

Finally, we have the GrIDS detector systems which consist of the graph agggregation units, the network access policy enforcement systems, and the signature detection systems; these are connected to the GrIDS primitives and each other using our SCL-type interface.

For the sake of usability we envisage more abstract policy languages and attack signature generation tools sitting on top of these systems.