home
projects
people
papers
awards
seminars
visitor information
internal
 
Computer Science Department
U C Davis
Comments
Contact Security Lab
SECURITY LAB SEMINAR
April 14, 1999
1131 ENG II
1-2:30 p.m.

Guest Speaker: Tom Goldring from NSA


  1. Models for User Profiling
    1. Tom Goldring and Jim Shostak
    2. Build model for a user to identify a given session, based on behavior not on login/password
  2. Possible Data Sources
    1. Command Line
    2. Process Table
    3. Syslog, BSM, etc.
    4. System Calls
  3. Sample Data from Process Table
    1. /proc filesystem
    2. Actual time data is generated
    3. Scan process table; build list of processes
    4. See what the user is doing, as he does it
  4. Viewing It as a Time Series
    1. Build list of unique process names - now over 650
    2. Map PID to number à stream of PID files
  5. Viewing It as a Tree
    1. Parent/Child Relationships
  6. Some Scoring Techniques (time series viewpoint)
    1. Monographic and Likelihood Function
      1. Frequency counts for processes
      2. Probability à Likelihood Function
        1. (P[ps]*P[ls]) Likelihood Function
    2. Transition Matrix and Likelihood Function
      1. Frequency counts for pairs
      2. Each user has own transition matrix
    3. Chi-square Test
      1. KL: Consider order-unimportant
    4. Computational Immunology
      1. Stephanie Forest's Work using ngrams of length 6
      2. Don't do frequency count, put ngrams into table of behavior
        1. If it's not in the table, it counted as an anomaly (frame locality count)
        2. With an attack, the frame locality count goes through the roof.
      3. User profiling is a much harder problem than program profiling
  7. An Experiment
    1. Data collected for 4 different users (50 sessions per user) 1 session equals one day of work
    2. Data was divided into 20 training sessions and 30 test sessions per user
    3. User profiles built for each user
    4. Guess who the user was
  8. Results - Percentage Correct
  9. Scoring Method Training Test
    Monographic 92% 79%
    Transition 100% 83%
    Chi-square 100% 92%
    Immunology 100% 88%
  10. But this is isn't the Whole Story
    1. Don't see how the score is separated
    2. Risk of high false positive alarm
  11. Transition Matrix Scores (10 "B" Sessions)
    1. Data not separate
  12. Chi-Square - separation of data (length=5)
    1. Superior scoring technique
  13. Why isn't this good enough?
    1. Can't score short pieces of a session, generates false positives
    2. When user is given a new assignment, behavior looks anomalous
    3. Insider Threat - legitimate user training the system
    4. Non-repudiation - User claims someone is impersonating them
    5. Method to address all problems
  14. Train Profiler to Accept Attacks
    1. Scenario: A legitimate user plans a future attack, and wants to fool the intrusion detection system into thinking it's his normal behavior
    2. This is a common objection to statistically based methods - the user can supposedly, over time, train the system to accept attacks
    3. How attacker implements it
      1. Knows how the profiler works
      2. Runs the attack on another system and generates the same output as a profiler
      3. Find non-intrusive ways to generate individual pieces of same output
      4. Use these in daily activities over a period of time
    4. If score is an average over user's output, then it doesn't matter that you generate the attack piecemeal
    5. Other ways to fool the system
      1. Chris - Add a Trojan - everyone becomes an attacker; intrusion detection system gives up
      2. Steven T. - Process names using other program names; rename dummy files to attack name
      3. Jeff - Intersperse attack within inconsequential commands
  15. How hard is it to do this?
    1. For a monographic model, it is relatively easy. The profiler only counts different activities
    2. It is difficult to generate longer sequences without running parts of the attack. The down side to using long sequences is that the size of the sequence space grows exponentially in length.
  16. A Divide and Conquer Strategy
    1. When you work, you do various things
    2. Output from the process table can therefore be partitioned into blocks corresponding to different user "modes"
    3. The user profile now a collection of models, one per mode, plus a characteristic pattern of mode changes
  17. Sample Process Data
    1. Can see mode transitions - visually separated user modes
  18. Finding and Interpreting the Modes
    1. Inside emacs - can read mail, compile programs etc.
    2. Look at data by hand, can tell a lot
  19. How can mode partitioning address "new assignment" and "score short pieces" problems?
    1. Even with a new topic, still do old processes
    2. Data will be partitioned accordingly
    3. Still score old modes against user's model and authenticate that user
    4. Look at patterns
    5. New activity shows up as a new mode, or set of modes
    6. As far as scoring a short piece of a session, we first see what mode the piece belongs to, then score it against that mode
    7. Chi-square test would work well, but look at the timing information
  20. How can mode partitioning address the "train the system" and "non-repudiation" problem?
    1. When user tries to train the model by generating attack output in piecemeal, pieces will be associated with the modes he normally uses
    2. Unlikely sequence of mode transition when run the attack
    3. Masquerader won't look like normal user
    4. Every attack is a separate mode, so run attacks, then run and build attack modes into model
  21. The Advantage of using the Process Table
    1. Process data has a natural tree structure which can be exploited to find modes.
    2. Training data to build list of processes, a list of its parents
    3. Most of the time, the child process has same mode as the parent.
    4. Exploit the tree structure of process table - ID mode for any process having only a single parent over all training sessions.
    5. Allow us to write down a set of modes for all of our training data, and in practice, we can identify modes for more than 99% of all processes in this way.
    6. Using series structure, can score mode individually¾
  22. Summary
    1. Partition data into modes allows us to address the following problems;
      1. Scoring short pieces of sessions
      2. Legitimate activity never done before
      3. User training the system in preparation for a future attack
      4. Non-repudiation
    2. Possibility of identifying attack modes
    3. To do these things, we exploit the dual structure of the process data
Comments:

ST: When you update the number of users to about 20, there are many problems that arise. When you assume a session is over a longer period of time, the profile spreads out.