|
|
SECURITY LAB SEMINAR
April 14, 1999
1131 ENG II
1-2:30 p.m.
Guest Speaker: Tom Goldring from NSA
- Models for User Profiling
- Tom Goldring and Jim Shostak
- Build model for a user to identify a given session, based on behavior
not on login/password
- Possible Data Sources
- Command Line
- Process Table
- Syslog, BSM, etc.
- System Calls
- Sample Data from Process Table
- /proc filesystem
- Actual time data is generated
- Scan process table; build list of processes
- See what the user is doing, as he does it
- Viewing It as a Time Series
- Build list of unique process names - now over 650
- Map PID to number à stream
of PID files
- Viewing It as a Tree
- Parent/Child Relationships
- Some Scoring Techniques (time series viewpoint)
- Monographic and Likelihood Function
- Frequency counts for processes
- Probability à Likelihood
Function
- (P[ps]*P[ls]) Likelihood Function
- Transition Matrix and Likelihood Function
- Frequency counts for pairs
- Each user has own transition matrix
- Chi-square Test
-

- KL: Consider order-unimportant
- Computational Immunology
- Stephanie Forest's Work using ngrams of length 6
- Don't do frequency count, put ngrams into table of behavior
- If it's not in the table, it counted as an anomaly (frame locality
count)
- With an attack, the frame locality count goes through the roof.
- User profiling is a much harder problem than program profiling
- An Experiment
- Data collected for 4 different users (50 sessions per user) 1 session
equals one day of work
- Data was divided into 20 training sessions and 30 test sessions
per user
- User profiles built for each user
- Guess who the user was
- Results - Percentage Correct
| Scoring Method |
Training |
Test |
| Monographic |
92% |
79% |
| Transition |
100% |
83% |
| Chi-square |
100% |
92% |
| Immunology |
100% |
88% |
- But this is isn't the Whole Story
- Don't see how the score is separated
- Risk of high false positive alarm
- Transition Matrix Scores (10 "B" Sessions)
- Data not separate
- Chi-Square - separation of data (length=5)
- Superior scoring technique
- Why isn't this good enough?
- Can't score short pieces of a session, generates false positives
- When user is given a new assignment, behavior looks anomalous
- Insider Threat - legitimate user training the system
- Non-repudiation - User claims someone is impersonating them
- Method to address all problems
- Train Profiler to Accept Attacks
- Scenario: A legitimate user plans a future attack, and wants to
fool the intrusion detection system into thinking it's his normal
behavior
- This is a common objection to statistically based methods - the
user can supposedly, over time, train the system to accept attacks
- How attacker implements it
- Knows how the profiler works
- Runs the attack on another system and generates the same output
as a profiler
- Find non-intrusive ways to generate individual pieces of same
output
- Use these in daily activities over a period of time
- If score is an average over user's output, then it doesn't matter
that you generate the attack piecemeal
- Other ways to fool the system
- Chris - Add a Trojan - everyone becomes an attacker; intrusion
detection system gives up
- Steven T. - Process names using other program names; rename dummy
files to attack name
- Jeff - Intersperse attack within inconsequential commands
- How hard is it to do this?
- For a monographic model, it is relatively easy. The profiler only
counts different activities
- It is difficult to generate longer sequences without running parts
of the attack. The down side to using long sequences is that the size
of the sequence space grows exponentially in length.
- A Divide and Conquer Strategy
- When you work, you do various things
- Output from the process table can therefore be partitioned into
blocks corresponding to different user "modes"
- The user profile now a collection of models, one per mode, plus
a characteristic pattern of mode changes
- Sample Process Data
- Can see mode transitions - visually separated user modes
- Finding and Interpreting the Modes
- Inside emacs - can read mail, compile programs etc.
- Look at data by hand, can tell a lot
- How can mode partitioning address "new assignment" and "score short
pieces" problems?
- Even with a new topic, still do old processes
- Data will be partitioned accordingly
- Still score old modes against user's model and authenticate that
user
- Look at patterns
- New activity shows up as a new mode, or set of modes
- As far as scoring a short piece of a session, we first see what
mode the piece belongs to, then score it against that mode
- Chi-square test would work well, but look at the timing information
- How can mode partitioning address the "train the system" and "non-repudiation"
problem?
- When user tries to train the model by generating attack output
in piecemeal, pieces will be associated with the modes he normally
uses
- Unlikely sequence of mode transition when run the attack
- Masquerader won't look like normal user
- Every attack is a separate mode, so run attacks, then run and build
attack modes into model
- The Advantage of using the Process Table
- Process data has a natural tree structure which can be exploited
to find modes.
- Training data to build list of processes, a list of its parents
- Most of the time, the child process has same mode as the parent.
- Exploit the tree structure of process table - ID mode for any process
having only a single parent over all training sessions.
- Allow us to write down a set of modes for all of our training data,
and in practice, we can identify modes for more than 99% of all processes
in this way.
- Using series structure, can score mode individually¾
- Summary
- Partition data into modes allows us to address the following problems;
- Scoring short pieces of sessions
- Legitimate activity never done before
- User training the system in preparation for a future attack
- Non-repudiation
- Possibility of identifying attack modes
- To do these things, we exploit the dual structure of the process
data
Comments:
ST: When you update the number of users to about 20, there are many problems
that arise. When you assume a session is over a longer period of time,
the profile spreads out.
|