US8332944B2 - System and method for detecting new malicious executables, based on discovering and monitoring characteristic system call sequences - Google Patents

System and method for detecting new malicious executables, based on discovering and monitoring characteristic system call sequences Download PDF

Info

Publication number
US8332944B2
US8332944B2 US12/697,559 US69755910A US8332944B2 US 8332944 B2 US8332944 B2 US 8332944B2 US 69755910 A US69755910 A US 69755910A US 8332944 B2 US8332944 B2 US 8332944B2
Authority
US
United States
Prior art keywords
system call
sequences
malicious
dataset
call sequences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related, expires
Application number
US12/697,559
Other versions
US20100229239A1 (en
Inventor
Boris Rozenberg
Ehud Gudes
Yuval Elovici
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Deutsche Telekom AG
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Assigned to BEN-GURION UNIVERSITY OF THE NEGEV RESEARCH AND DEVELOPMENT AUTHORITY reassignment BEN-GURION UNIVERSITY OF THE NEGEV RESEARCH AND DEVELOPMENT AUTHORITY ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: ROZENBERG, BORIS, ELOVICI, YUVAL, GUDES, EHUD
Assigned to DEUTSCHE TELEKOM AG reassignment DEUTSCHE TELEKOM AG ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BEN-GURION UNIVERSITY OF THE NEGEV RESEARCH AND DEVELOPMENT AUTHORITY
Publication of US20100229239A1 publication Critical patent/US20100229239A1/en
Application granted granted Critical
Publication of US8332944B2 publication Critical patent/US8332944B2/en
Expired - Fee Related legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting

Definitions

  • the field of the invention relates to systems for detecting malicious executables. More particularly, the present invention relates to a system and method for detecting malicious executables, based on the use of a database of system call sequences that are characteristic only to malicious executables.
  • Malicious executables which propagate through the Internet can be classified into three main categories: (a) worm-related; (b) non-worm related (i.e. virus, Trojan); and (c) probes (i.e. adware, spyware, spam, phishing).
  • the detection of malicious executables that are known beforehand is typically performed using signature-based techniques.
  • Said signature-based techniques typically rely on the prior explicit knowledge of the malicious executable code, which is in turn represented by one or more signatures or rules that are stored in a database.
  • the database is frequently updated with new signatures, based on new observations.
  • the main disadvantage of these techniques is the inability to detect totally new un-encountered malicious executables, (i.e. malicious executables whose signatures are not yet stored in the database).
  • An object of the present invention is to provide a technique which can detect new malicious executables, whose signatures are unknown yet.
  • the static analysis approach suggests an inspection of the code of executables without actually running them, while the dynamic analysis approach suggests monitoring during the execution phase of the executable in order to detect anomaly behavior.
  • the present invention suggests a new technique of the dynamic analysis approach for the detection of new, unknown malicious executables.
  • anomaly detection techniques that are based on dynamic analysis approach have been used to detect new electronic threats (eThreats). These techniques build models of a normal program behavior during a training phase, and then, using the models the techniques attempt to detect deviations from said normal behavior during a detection phase.
  • S. Forrest “A Sense of Self for UNIX Processes”, Proceedings of the IEEE Symposium on Security and Privacy, Oakland, Calif. 120-128, 1996, introduces a simple anomaly detection technique which is based on monitoring the system calls issued by specific privileged processes.
  • the system of Forrest records short sequences of system calls that represent a normal process behavior into a “normal dictionary”.
  • sequences of actual system calls are compared with said normal dictionary. An alarm is issued if no match is found.
  • the main advantage of said anomaly detection techniques is their ability to detect new, previously un-encountered malicious codes.
  • the main drawback of using these techniques is the necessity to perform a complex and frequent retraining in order to separate “noise” and natural changes to programs from malicious codes. Legitimate program updates may result in false alarms, while malicious code actions that seem to be normal may cause missed detections. Furthermore, most applications that are based on anomaly detection techniques identify malicious behavior of specific processes only.
  • the distance threshold between sequences is defined by as the minimum “cost” required in order to transform one sequence of system calls to another sequence of system calls, by applying a set of predefined operations.
  • the process of Lee and Jigar results in a classifier, which includes plurality of medoids, wherein each medoid is a best representative of each cluster.
  • the classification of new objects is performed using the nearest neighbor classification method as described in K. Beyer, J. Goldstein, R. Ramakhrisnan, and U. Shaft, “When is ‘nearest neighbor’ meaningful?”, Proc. 7th Int. Conf. on Database Theory (ICDT'99), pages 217-235, 1999. A new object is compared to all medoids, and receives a class label of the closest one.
  • the technique above can be used to classify a given malicious code instance as belonging to one of the predefined number of classes, but cannot be used for a new malicious code detection in real time.
  • the present invention relates to a method for detecting malicious executables, which comprises the steps of: (a) in an offline training phase, finding a collection of system call sequences that are characteristic only to malicious files, when such malicious files are executed, and storing said sequences in a database; (b) in runtime, for each running executable, continuously monitoring its issued run-time system calls and comparing with the stored sequences of system calls within the database to determine whether there exists a match between a portion of the sequence of the run-time system calls and one or more of the database sequences, and when such a match is found, declaring said executable as malicious.
  • each of said system call sequences that are determined during the training phase includes zero or more wildcards, wherein each wildcard defines the existence of zero or more system calls of any undefined type at the location of the wild card within the sequence.
  • said training phase comprises the steps of: (a) providing an M r dataset which comprises recordings of as many as possible system call sequences of malicious executables, and a B r dataset which comprises recordings of as many as possible system call sequences of benign executables; (b) for a specific support value, and using a SPADE algorithm, finding a set S of system call sequences, each of said sequences is repeated within some group equal or larger than the support value from among all the system call sequences within the malicious recordings in dataset M r ; (c) for each of the sequences found within set S, determining whether it is found within any of the recordings within the dataset B r , and forming a reduced dataset S m which contains only those sequences that are not included within any of the recorded sequences within benign dataset B r ; (d) Adding S m into database M, and eliminating from dataset M r all the recordings which have been found to contain any one or more of the sequences of S m ; (e) If, however, dataset S m is found
  • FIG. 1 illustrates the method for detecting malicious executables, as performed in runtime, according to an embodiment of the present invention
  • FIG. 2 illustrates a training procedure which is performed off-line, prior to the performance of the run-time procedure of FIG. 1 , and which determines a set of system call sequences that are characteristic only to malicious executables and not to any benign executable.
  • the present invention introduces a novel technique for the real-time detection of new malicious executables.
  • the present invention instead of looking for anomalies, or trying to separate between malicious and benign behavior of executables, the present invention finds “behavior signatures” (i.e. sequences of system calls) that are characteristic to malicious executables and not to benign executables.
  • the invention utilizes the observation by the inventors that specific sequences of system calls are characteristic each to only a group of malicious executables while not characteristic to any benign executable.
  • the present invention determines and assigns sequences of system calls as representing the behavior of a malicious program. This is performed during a learning/training phase. During a detection phase, which is performed in run time (i.e.
  • the invention identifies malicious executables by comparing their own run time sequences of system calls with said stored (in the database) sequences of system calls that are characteristic to only malicious executables.
  • the present invention detects malicious objects by (a) determining during a training phase a group of system calls sequences that are characteristic only to malicious executables, and storing all said sequences in a database; (b) monitoring in runtime the system calls relating to each running executable, and comparing the same in real time with said database of malicious of sequences of system calls; and (c) if a match is found between a monitored sequence and one or more of the sequences that are stored within the “malicious” database, declaring the monitored executable as malicious.
  • a first aspect of the invention relates to the phase of forming the database M which, as said, includes the sequences that are characteristic only to malicious executables. This phase will be referred to also as the training phase.
  • a second aspect of the invention relates to the run-time phase, which utilizes the database M for determining whether a running executable is malicious or not.
  • FIG. 1 is a flow diagram illustrating the process for detecting malicious executables according to said first aspect of the present invention.
  • Training phase 101 is a preliminary phase, which is performed off-line.
  • an “M determining module” 102 operates to determine as many as possible characteristic M-sequences of system calls that are characteristic only to malicious executables, and not to any benign program. It should be noted that each found M-sequence generally relates to a group of existing malicious executables.
  • Said M determining module 102 produces an “M database” 103 which includes the collection of M-sequences, as determined.
  • the M database 103 forms an input data to comparator 104 which operates in runtime, or more particularly, it is a part of runtime monitoring phase 105 .
  • comparator 104 continuously receives inputs relating to the system calls that are issued by the currently running executables. More specifically, comparator 104 receives over input bus 109 for each issued system call the system call ID and the file ID (i.e. an indication regarding to the executable that issued said specific system call). Comparator 104 , which has an access to M database 103 , compares separately for each running program in real time the sequence of system calls 109 that it issues, with each of the sequences stored in the M database, that are characteristic only to malicious executables.
  • comparator 104 If with respect to a specific running program a match is found with one or more of the M-sequences, comparator 104 outputs such an indication (for example, in a form of Malicious, File ID), and this specific executable is declared as malicious and can be terminated. Otherwise, as long as no such an alert signal is issued, this running file is considered as benign.
  • an indication for example, in a form of Malicious, File ID
  • each of the M-sequences of system calls comprises two or more system calls that appear successively or not.
  • Each M-sequence may therefore include wildcards that are indicated by (*).
  • a wildcard that appears within a sequence indicates any number (one or more of unidentified system calls.
  • the various system calls will be indicated herein by one of the letters a-z.
  • the a-z indications do not represent all of the approximately 1100 existing system calls, but for the sake of the present explanation a reference to only 26 different system calls (as represented by the letters a-z) suffice.
  • the following are only some examples for possible M-sequences of system calls within the M-database 103 :
  • FIG. 2 describes a training phase process for determining the database of M-sequences, according to one aspect of the invention.
  • the process comprises accumulation of as many as possible (for example 50,000) executables that are known to be malicious, and as many as possible (for example 70,000) executables that are known to be benign.
  • each of said benign and malicious executables are activated (i.e. executed), and some selected run-time sequence of system calls is recorded for each of said executables.
  • the length of each of said n and m sequence records (within M r and B r ) is relatively long (for example, between 100 and 10000 system calls.
  • said “raw” sequences of system calls may be recorded during about 5 seconds in which the respective benign or malicious file is run.
  • a running file typically issues between 100 to 10,000 system calls. It should be noted that there is no necessity for having a same sequence length for all the various “raw” recorded sequences within either M r and/or B r datasets.
  • the results of the training phase are a database M of M 1 ⁇ q , sequences that are each characteristic only to some group G of malicious files but not to any of the benign files.
  • the lengths of the various M 1 ⁇ q , sequences are not necessary identical, and each of said sequences may comprise zero or more wild cards.
  • FIG. 2 A flow diagram for finding the M 1-q sequences, i.e., those which are characteristic only to malicious executables, is shown in FIG. 2 .
  • step 200 the M r and the B r datasets of “raw” malicious and benign sequences respectively are provided.
  • each of said datasets includes as many as possible recorded “raw” sequences of system calls of executables that are known to be malicious (in the M r dataset) and benign (in the B r dataset) respectively.
  • an initial support value—CurrSupp is set to 100%.
  • support relates to the percent of files within the M.sub.r dataset in which a certain specific sequence of system calls is present. For example, the use of a support value of 76% indicates that the process looks for specific sequences of system calls that appears in at least in 76% of the files whose “raw” sequences appear within the M r dataset.
  • Step 203 may apply the SPADE algorithm as described in M. G. Zaki, “SPADE: An Efficient Algorithm for Mining Frequent Sequences”, Machine Learning, 42, 31-60, 2001, or any other suitable algorithm.
  • SPADE is an algorithm for fast mining of sequential patterns in large databases. Given a database and a minimal support value (in the present case Current_Support), SPADE efficiently generates all sequences that repeat (i.e. frequent) in the database with a support equal to or greater than Current_Support.
  • each s.sub.i in the found sequences S may contain one or more wildcards.
  • the process checks whether the sequence s i appears within any of the sequences included within the dataset B r , which as said contains raw sequences of benign executables. If it is found in step 204 that a sequence s i appears within one or more of the raw sequences within the B r dataset, that means that s i is not a suitable sequence for the purpose of determining malicious executables according to the invention, as it is not characteristic only malicious executables.
  • the output from step 204 is therefore a reduced set S m , which includes only those sequences from S that do not appear in any of the sequences of B r , and therefore are characteristic to only malicious executables.
  • S m is NULL (i.e. contains no sequence)
  • the process continues to step 209 , in which the Current_Support is reduced by 1, and the procedure returns to step 202 .
  • the set S m contains one or more sequences
  • those sequences are added in step 206 into the database M.
  • step 207 all the raw sequences from data set M for which a match has been found within one or more of the S m sequences (i.e.
  • step 201 the Current_Support is again set to 100%, however, in this case the 100% now relates only to the those sequences that remained within M r after the sequences elimination of step 207 . Therefore, the procedure repeats until it is found in step 202 that the Current_Support is equal to 0, or that M r is empty.
  • the result of the completion of the procedure of FIG. 2 is the database M, which contains a collection of sequences of system calls that are characteristic to only malicious executables and never appear during real time execution of benign files.
  • a database M which includes plurality of sequences s 1 -s y is formed, wherein each of said sequences is characteristic to a corresponding group G of malicious executables, but not to any benign executable.
  • Said database M, including all said found sequences s 1 -s y is used in runtime for detecting malicious executables, in a manner as described above with respect to FIG. 1 .
  • the process of the present invention enables the detection of all, or at least most of malicious executables that are entirely new, and that are not known beforehand, as it is assumed that their behavior introduces one of the sequences of system calls within the database M.

Abstract

The invention relates to a method for detecting malicious executables, which comprises: in an offline training phase, finding a collection of system call sequences that are characteristic only to malicious files, when such malicious files are executed, and storing said sequences in a database; and, in runtime, for each running executable, continuously monitoring its issued run-time system calls and comparing with the stored sequences of system calls within the database to determine whether there exists a match between a portion of the sequence of the run-time system calls and one or more of the database sequences, and when such a match is found, declaring said executable as malicious.

Description

FIELD OF THE INVENTION
The field of the invention relates to systems for detecting malicious executables. More particularly, the present invention relates to a system and method for detecting malicious executables, based on the use of a database of system call sequences that are characteristic only to malicious executables.
BACKGROUND OF THE INVENTION
Malicious executables (or malware) which propagate through the Internet can be classified into three main categories: (a) worm-related; (b) non-worm related (i.e. virus, Trojan); and (c) probes (i.e. adware, spyware, spam, phishing). The detection of malicious executables that are known beforehand is typically performed using signature-based techniques. Said signature-based techniques typically rely on the prior explicit knowledge of the malicious executable code, which is in turn represented by one or more signatures or rules that are stored in a database. According to said prior art techniques, the database is frequently updated with new signatures, based on new observations. The main disadvantage of these techniques is the inability to detect totally new un-encountered malicious executables, (i.e. malicious executables whose signatures are not yet stored in the database).
An object of the present invention is to provide a technique which can detect new malicious executables, whose signatures are unknown yet. There are two main prior art approaches for performing such a task: (a) static analysis of executables; and (b) dynamic analysis of executables.
The static analysis approach suggests an inspection of the code of executables without actually running them, while the dynamic analysis approach suggests monitoring during the execution phase of the executable in order to detect anomaly behavior.
The present invention suggests a new technique of the dynamic analysis approach for the detection of new, unknown malicious executables.
Traditionally, anomaly detection techniques that are based on dynamic analysis approach have been used to detect new electronic threats (eThreats). These techniques build models of a normal program behavior during a training phase, and then, using the models the techniques attempt to detect deviations from said normal behavior during a detection phase. For example, S. Forrest, “A Sense of Self for UNIX Processes”, Proceedings of the IEEE Symposium on Security and Privacy, Oakland, Calif. 120-128, 1996, introduces a simple anomaly detection technique which is based on monitoring the system calls issued by specific privileged processes. During a training phase, the system of Forrest records short sequences of system calls that represent a normal process behavior into a “normal dictionary”. During a detection phase which is performed later, sequences of actual system calls are compared with said normal dictionary. An alarm is issued if no match is found.
Several data mining techniques for studying system call sequences have been proposed so far. W. Lee, S. J. Stolfo, and P. K. Chan, “Learning patterns from UNIX process execution traces for intrusion detection”, AAAI Workshop on AI Approaches to Fraud Detection and Risk Management, pages 50-56, AAAI Press, July 1997, and W. Lee and S. J. Stolfo, “Data mining approaches for intrusion detection”, Proceedings of the 7th USENIX Security Symposium, 1998, propose a method for describing “normal” system call sequences by means of a generally small set of rules, wherein the rules cover common elements in those sequences. During real time detection, sequences that are found to violate the rules are considered as anomalies.
The main advantage of said anomaly detection techniques is their ability to detect new, previously un-encountered malicious codes. The main drawback of using these techniques is the necessity to perform a complex and frequent retraining in order to separate “noise” and natural changes to programs from malicious codes. Legitimate program updates may result in false alarms, while malicious code actions that seem to be normal may cause missed detections. Furthermore, most applications that are based on anomaly detection techniques identify malicious behavior of specific processes only.
Another technique which is based on dynamic analysis approach has been proposed in T. Lee, Jigar J. Mody, “Behavioral Classification” Presented at the EICAR Conference, May 2006. Lee and Jigar propose a malicious code classification technique which is based on clustering of system call sequences. In the technique proposed by Lee and Jigar, malicious programs of various classes are represented as sequences of system calls. A K-medoid Clustering algorithm, as described in L. Kaufman and P. J. Rousseeuw, “Finding groups in data: An introduction to cluster analysis, New York: John Wiley & Sons. 1990 is applied to the sequences in order to map the input into a predefined number of different classes. The distance threshold between sequences is defined by as the minimum “cost” required in order to transform one sequence of system calls to another sequence of system calls, by applying a set of predefined operations. The process of Lee and Jigar results in a classifier, which includes plurality of medoids, wherein each medoid is a best representative of each cluster. The classification of new objects is performed using the nearest neighbor classification method as described in K. Beyer, J. Goldstein, R. Ramakhrisnan, and U. Shaft, “When is ‘nearest neighbor’ meaningful?”, Proc. 7th Int. Conf. on Database Theory (ICDT'99), pages 217-235, 1999. A new object is compared to all medoids, and receives a class label of the closest one.
The technique above can be used to classify a given malicious code instance as belonging to one of the predefined number of classes, but cannot be used for a new malicious code detection in real time.
It is therefore an object of the present invention to provide a general, real time detection method and system that is more reliable than prior art methods and systems.
It is still another object of the invention to provide a method which can detect a new malicious code in any executable, and not only in specific previously known programs.
Other objects and advantages will become apparent as the description proceeds.
SUMMARY OF THE INVENTION
The present invention relates to a method for detecting malicious executables, which comprises the steps of: (a) in an offline training phase, finding a collection of system call sequences that are characteristic only to malicious files, when such malicious files are executed, and storing said sequences in a database; (b) in runtime, for each running executable, continuously monitoring its issued run-time system calls and comparing with the stored sequences of system calls within the database to determine whether there exists a match between a portion of the sequence of the run-time system calls and one or more of the database sequences, and when such a match is found, declaring said executable as malicious.
Preferably, each of said system call sequences that are determined during the training phase includes zero or more wildcards, wherein each wildcard defines the existence of zero or more system calls of any undefined type at the location of the wild card within the sequence.
In an embodiment of the invention, said training phase comprises the steps of: (a) providing an Mr dataset which comprises recordings of as many as possible system call sequences of malicious executables, and a Br dataset which comprises recordings of as many as possible system call sequences of benign executables; (b) for a specific support value, and using a SPADE algorithm, finding a set S of system call sequences, each of said sequences is repeated within some group equal or larger than the support value from among all the system call sequences within the malicious recordings in dataset Mr; (c) for each of the sequences found within set S, determining whether it is found within any of the recordings within the dataset Br, and forming a reduced dataset Sm which contains only those sequences that are not included within any of the recorded sequences within benign dataset Br; (d) Adding Sm into database M, and eliminating from dataset Mr all the recordings which have been found to contain any one or more of the sequences of Sm; (e) If, however, dataset Sm is found in step (c) to be empty, reducing the support value, and repeating the procedure from step (b); and (f) Continuing the procedure from step (b) until either the support value is equal to zero, or the dataset Mr is empty, therefore finalizing the procedure with a dataset M containing a group of sequences that each appears within one or more of the run-time sequences of malicious executables, but does not appear within any of the run-time sequences of benign executables.
BRIEF DESCRIPTION OF THE DRAWINGS
In the drawings:
FIG. 1 illustrates the method for detecting malicious executables, as performed in runtime, according to an embodiment of the present invention; and
FIG. 2 illustrates a training procedure which is performed off-line, prior to the performance of the run-time procedure of FIG. 1, and which determines a set of system call sequences that are characteristic only to malicious executables and not to any benign executable.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
The present invention introduces a novel technique for the real-time detection of new malicious executables. According to the present invention, instead of looking for anomalies, or trying to separate between malicious and benign behavior of executables, the present invention finds “behavior signatures” (i.e. sequences of system calls) that are characteristic to malicious executables and not to benign executables. The invention utilizes the observation by the inventors that specific sequences of system calls are characteristic each to only a group of malicious executables while not characteristic to any benign executable. The present invention determines and assigns sequences of system calls as representing the behavior of a malicious program. This is performed during a learning/training phase. During a detection phase, which is performed in run time (i.e. after said learning/training phase), the invention identifies malicious executables by comparing their own run time sequences of system calls with said stored (in the database) sequences of system calls that are characteristic to only malicious executables. As will be demonstrated hereinafter, the present invention, in a first aspect, detects malicious objects by (a) determining during a training phase a group of system calls sequences that are characteristic only to malicious executables, and storing all said sequences in a database; (b) monitoring in runtime the system calls relating to each running executable, and comparing the same in real time with said database of malicious of sequences of system calls; and (c) if a match is found between a monitored sequence and one or more of the sequences that are stored within the “malicious” database, declaring the monitored executable as malicious.
A first aspect of the invention relates to the phase of forming the database M which, as said, includes the sequences that are characteristic only to malicious executables. This phase will be referred to also as the training phase. A second aspect of the invention relates to the run-time phase, which utilizes the database M for determining whether a running executable is malicious or not.
FIG. 1 is a flow diagram illustrating the process for detecting malicious executables according to said first aspect of the present invention. Training phase 101 is a preliminary phase, which is performed off-line. During the training phase an “M determining module” 102 operates to determine as many as possible characteristic M-sequences of system calls that are characteristic only to malicious executables, and not to any benign program. It should be noted that each found M-sequence generally relates to a group of existing malicious executables. Said M determining module 102 produces an “M database” 103 which includes the collection of M-sequences, as determined. The M database 103 forms an input data to comparator 104 which operates in runtime, or more particularly, it is a part of runtime monitoring phase 105. During the runtime monitoring phase 105, comparator 104 continuously receives inputs relating to the system calls that are issued by the currently running executables. More specifically, comparator 104 receives over input bus 109 for each issued system call the system call ID and the file ID (i.e. an indication regarding to the executable that issued said specific system call). Comparator 104, which has an access to M database 103, compares separately for each running program in real time the sequence of system calls 109 that it issues, with each of the sequences stored in the M database, that are characteristic only to malicious executables. If with respect to a specific running program a match is found with one or more of the M-sequences, comparator 104 outputs such an indication (for example, in a form of Malicious, File ID), and this specific executable is declared as malicious and can be terminated. Otherwise, as long as no such an alert signal is issued, this running file is considered as benign.
As is known, presently there are about 1100 different system calls for WINDOWS operating systems. According to the present invention each of the M-sequences of system calls comprises two or more system calls that appear successively or not. Each M-sequence may therefore include wildcards that are indicated by (*). A wildcard that appears within a sequence indicates any number (one or more of unidentified system calls. Just for the sake of convenience of explanations, the various system calls will be indicated herein by one of the letters a-z. Of course, the a-z indications do not represent all of the approximately 1100 existing system calls, but for the sake of the present explanation a reference to only 26 different system calls (as represented by the letters a-z) suffice. The following are only some examples for possible M-sequences of system calls within the M-database 103:
    • a. ab*c*dft*wsyp;
    • b. fgew*uyojf*qlu;
    • c. fg*rt*y*uopegh*edf*w;
    • d. ajkeub;
    • e. etc,
FIG. 2 describes a training phase process for determining the database of M-sequences, according to one aspect of the invention. The process comprises accumulation of as many as possible (for example 50,000) executables that are known to be malicious, and as many as possible (for example 70,000) executables that are known to be benign. At the first stage, each of said benign and malicious executables are activated (i.e. executed), and some selected run-time sequence of system calls is recorded for each of said executables. The result are two datasets, Mr dataset which therefore (according to this example) contains about 50,000 different Mr(1−n) records of “raw” sequences relating respectively to the 50,000 (n=50,000) malicious executables, and Br dataset which similarly contains about 70,000 different records of “raw” sequences Br(1−m) relating respectively to the 70,000 (m=70,000) benign executables. The length of each of said n and m sequence records (within Mr and Br) is relatively long (for example, between 100 and 10000 system calls. For example, said “raw” sequences of system calls may be recorded during about 5 seconds in which the respective benign or malicious file is run. During this exemplary 5 seconds period, a running file typically issues between 100 to 10,000 system calls. It should be noted that there is no necessity for having a same sequence length for all the various “raw” recorded sequences within either Mr and/or Br datasets. As mentioned, the results of the training phase are a database M of M1−q, sequences that are each characteristic only to some group G of malicious files but not to any of the benign files. The lengths of the various M1−q, sequences are not necessary identical, and each of said sequences may comprise zero or more wild cards.
A flow diagram for finding the M1-q sequences, i.e., those which are characteristic only to malicious executables, is shown in FIG. 2.
Initially, in step 200 the Mr and the Br datasets of “raw” malicious and benign sequences respectively are provided. As said each of said datasets includes as many as possible recorded “raw” sequences of system calls of executables that are known to be malicious (in the Mr dataset) and benign (in the Br dataset) respectively. Next, in step 201, an initial support value—CurrSupp is set to 100%. The term “support” relates to the percent of files within the M.sub.r dataset in which a certain specific sequence of system calls is present. For example, the use of a support value of 76% indicates that the process looks for specific sequences of system calls that appears in at least in 76% of the files whose “raw” sequences appear within the Mr dataset. Therefore, the term Current_Support defines a presently used support value. In step 202, a check is made to determine whether the Current_Support is zero, or whether the dataset Mr is empty. If one or more of said two conditions of step 202 is met, the process ends with step 210 in which database M contains a collection of system call sequences that are characteristic to only malicious files (and not to benign files). Otherwise, if none of the two conditions are met in step 202, the procedure continues to step 203. In step 203, a set S of all sequences in Mr having a support=Current_Support is determined. For example, if the Current_Support=76%, the procedure finds all the sequences that repeat within 76% or more of the raw sequences of dataset Mr. Step 203 may apply the SPADE algorithm as described in M. G. Zaki, “SPADE: An Efficient Algorithm for Mining Frequent Sequences”, Machine Learning, 42, 31-60, 2001, or any other suitable algorithm. SPADE is an algorithm for fast mining of sequential patterns in large databases. Given a database and a minimal support value (in the present case Current_Support), SPADE efficiently generates all sequences that repeat (i.e. frequent) in the database with a support equal to or greater than Current_Support. It should be noted that each s.sub.i in the found sequences S may contain one or more wildcards. In step 204, for each sequence si in S the process checks whether the sequence si appears within any of the sequences included within the dataset Br, which as said contains raw sequences of benign executables. If it is found in step 204 that a sequence si appears within one or more of the raw sequences within the Br dataset, that means that si is not a suitable sequence for the purpose of determining malicious executables according to the invention, as it is not characteristic only malicious executables. The output from step 204 is therefore a reduced set Sm, which includes only those sequences from S that do not appear in any of the sequences of Br, and therefore are characteristic to only malicious executables. If the Sm is NULL (i.e. contains no sequence), the process continues to step 209, in which the Current_Support is reduced by 1, and the procedure returns to step 202. If, on the other hand in 205 it is found that the set Sm contains one or more sequences, those sequences are added in step 206 into the database M. Then, in step 207 all the raw sequences from data set M for which a match has been found within one or more of the Sm sequences (i.e. those raw sequences of Mr which contain one or more of sequences in Sm) are eliminated from the dataset Mr and the procedure continues in step 201. In step 201, the Current_Support is again set to 100%, however, in this case the 100% now relates only to the those sequences that remained within Mr after the sequences elimination of step 207. Therefore, the procedure repeats until it is found in step 202 that the Current_Support is equal to 0, or that Mr is empty. The result of the completion of the procedure of FIG. 2 is the database M, which contains a collection of sequences of system calls that are characteristic to only malicious executables and never appear during real time execution of benign files.
More specifically, the process of FIG. 2, as described above repeats until exhausting all the executables within the database Mr. Therefore, at the end of this process, a database M which includes plurality of sequences s1-sy is formed, wherein each of said sequences is characteristic to a corresponding group G of malicious executables, but not to any benign executable. Said database M, including all said found sequences s1-sy is used in runtime for detecting malicious executables, in a manner as described above with respect to FIG. 1.
It should be noted that the process of the present invention, as described above, enables the detection of all, or at least most of malicious executables that are entirely new, and that are not known beforehand, as it is assumed that their behavior introduces one of the sequences of system calls within the database M.
While some embodiments of the invention have been described by way of illustration, it will be apparent that the invention can be carried out with many modifications, variations and adaptations, and with the use of numerous equivalents or alternative solutions that are within the scope of persons skilled in the art, without departing from the spirit of the invention or exceeding the scope of the claims.

Claims (2)

1. A method for detecting malicious executables, which comprises:
(a) creating, in an offline training phase, a database M of system call sequences that are characteristic only to malicious files, when the malicious files are executed, and storing said system call sequences in the database M; and
(b) continuously monitoring in runtime, for each running executable, issued run-time system calls of each running executable and comparing with the system call sequences within the database M to determine whether there exists a match between a portion of a system call sequence of the issued run-time system calls and one or more of the system call sequences in the database M, and when the match is found, declaring said running executable as malicious;
wherein said training phase comprises:
(c) providing an Mr dataset which comprises recordings of system call sequences of malicious executables, and a Br dataset which comprises recordings of system call sequences of benign executables;
(d) finding, using a Sequential Pattern Discovery using Equivalence classes (SPADE) algorithm, a set S of system call sequences for a specific support value, wherein the specific support value corresponds to a percentage value of recordings of system call sequences of malicious executable within the Mr dataset, in which recordings of system call sequences of malicious executables a given sequence of system calls is present, wherein each system call sequence of said set S is repeated within some group of recordings of system call sequences of malicious executables in the Mr dataset equal to or larger than the specific support value from among all the system call sequences of the recordings of system call sequences of malicious executables in dataset Mr;
(e) determining, for each of the system call sequences of set S, whether the system call sequence of set S is found within any of the recordings within the dataset Br, and forming a reduced dataset Sm which contains only system call sequences of set S that are not included within any of the recordings of system call sequences of benign executables in dataset Br;
(f) adding dataset Sm into database M, and eliminating from dataset Mr all recordings which have been found to contain any one or more of the system call sequences of dataset Sm;
(g) when dataset Sm is found in step (e) to be empty, reducing the specific support value, and repeating step (d); and
(h) continuing step (d) until either the specific support value is equal to zero, or the dataset Mr is empty, therefore finalizing the procedure with a database M containing a group of system call sequences that each appears within one or more of the system call sequences of malicious executables, but does not appear within any of the system call sequences of benign executables.
2. A method according to claim 1, wherein each of said system call sequences of the database M, includes zero or more wildcards, wherein each wildcard defines the existence of zero or more system calls of any undefined type at the location of the wild card within the system call sequence.
US12/697,559 2009-03-08 2010-02-01 System and method for detecting new malicious executables, based on discovering and monitoring characteristic system call sequences Expired - Fee Related US8332944B2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IL197477 2009-03-08
IL197477A IL197477A0 (en) 2009-03-08 2009-03-08 System and method for detecting new malicious executables, based on discovering and monitoring of characteristic system call sequences

Publications (2)

Publication Number Publication Date
US20100229239A1 US20100229239A1 (en) 2010-09-09
US8332944B2 true US8332944B2 (en) 2012-12-11

Family

ID=42112279

Family Applications (1)

Application Number Title Priority Date Filing Date
US12/697,559 Expired - Fee Related US8332944B2 (en) 2009-03-08 2010-02-01 System and method for detecting new malicious executables, based on discovering and monitoring characteristic system call sequences

Country Status (3)

Country Link
US (1) US8332944B2 (en)
EP (1) EP2228743B1 (en)
IL (1) IL197477A0 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120124667A1 (en) * 2010-11-12 2012-05-17 National Chiao Tung University Machine-implemented method and system for determining whether a to-be-analyzed software is a known malware or a variant of the known malware
US8613080B2 (en) 2007-02-16 2013-12-17 Veracode, Inc. Assessment and analysis of software security flaws in virtual machines
US9286041B2 (en) 2002-12-06 2016-03-15 Veracode, Inc. Software analysis framework
US9286063B2 (en) 2012-02-22 2016-03-15 Veracode, Inc. Methods and systems for providing feedback and suggested programming methods
US9503465B2 (en) 2013-11-14 2016-11-22 At&T Intellectual Property I, L.P. Methods and apparatus to identify malicious activity in a network
US20200394496A1 (en) * 2019-06-11 2020-12-17 International Business Machines Corporation Detecting Non-Anomalous and Anomalous Sequences of Computer-Executed Operations

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008047351A2 (en) 2006-10-19 2008-04-24 Checkmarx Ltd. Locating security vulnerabilities in source code
US9141806B2 (en) * 2010-08-24 2015-09-22 Checkmarx Ltd. Mining source code for violations of programming rules
RU2454714C1 (en) * 2010-12-30 2012-06-27 Закрытое акционерное общество "Лаборатория Касперского" System and method of increasing efficiency of detecting unknown harmful objects
ES2755780T3 (en) 2011-09-16 2020-04-23 Veracode Inc Automated behavior and static analysis using an instrumented sandbox and machine learning classification for mobile security
US9659173B2 (en) * 2012-01-31 2017-05-23 International Business Machines Corporation Method for detecting a malware
TWI461953B (en) * 2012-07-12 2014-11-21 Ind Tech Res Inst Computing environment security method and electronic computing system
US11126720B2 (en) 2012-09-26 2021-09-21 Bluvector, Inc. System and method for automated machine-learning, zero-day malware detection
US9292688B2 (en) 2012-09-26 2016-03-22 Northrop Grumman Systems Corporation System and method for automated machine-learning, zero-day malware detection
US9239922B1 (en) * 2013-03-11 2016-01-19 Trend Micro Inc. Document exploit detection using baseline comparison
US9602528B2 (en) * 2014-05-15 2017-03-21 Nec Corporation Discovering and constraining idle processes
WO2016081346A1 (en) 2014-11-21 2016-05-26 Northrup Grumman Systems Corporation System and method for network data characterization
US9690606B1 (en) * 2015-03-25 2017-06-27 Fireeye, Inc. Selective system call monitoring
US9930186B2 (en) * 2015-10-14 2018-03-27 Pindrop Security, Inc. Call detail record analysis to identify fraudulent activity
KR20170108330A (en) 2016-03-17 2017-09-27 한국전자통신연구원 Apparatus and method for detecting malware code
US10366234B2 (en) * 2016-09-16 2019-07-30 Rapid7, Inc. Identifying web shell applications through file analysis
US10061921B1 (en) * 2017-02-13 2018-08-28 Trend Micro Incorporated Methods and systems for detecting computer security threats
US11087002B2 (en) 2017-05-10 2021-08-10 Checkmarx Ltd. Using the same query language for static and dynamic application security testing tools
CN108021806B (en) * 2017-11-24 2021-10-22 北京奇虎科技有限公司 Malicious installation package identification method and device
EP3716281A1 (en) * 2019-03-25 2020-09-30 Siemens Healthcare GmbH Sequence mining in medical iot data
US11470194B2 (en) 2019-08-19 2022-10-11 Pindrop Security, Inc. Caller verification via carrier metadata
US11296868B1 (en) 2019-09-17 2022-04-05 Trend Micro Incorporated Methods and system for combating cyber threats using a related object sequence hash
US11556649B2 (en) * 2019-12-23 2023-01-17 Mcafee, Llc Methods and apparatus to facilitate malware detection using compressed data
US11836258B2 (en) 2020-07-28 2023-12-05 Checkmarx Ltd. Detecting exploitable paths in application software that uses third-party libraries
US11811802B2 (en) * 2020-08-21 2023-11-07 Microsoft Technology Licensing, Llc. Cloud security monitoring of applications in PaaS services
WO2022148992A1 (en) * 2021-01-08 2022-07-14 Telefonaktiebolaget Lm Ericsson (Publ) Prevention of abnormal operation in a system
US20230343122A1 (en) * 2022-03-23 2023-10-26 Automation Hero, Inc. Performing optical character recognition based on fuzzy pattern search generated using image transformation
US11818148B1 (en) * 2022-05-15 2023-11-14 Uab 360 It Optimized analysis for detecting harmful content
CN115378702B (en) * 2022-08-22 2024-04-02 重庆邮电大学 Attack detection system based on Linux system call

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030065926A1 (en) * 2001-07-30 2003-04-03 Schultz Matthew G. System and methods for detection of new malicious executables
US6775780B1 (en) * 2000-03-16 2004-08-10 Networks Associates Technology, Inc. Detecting malicious software by analyzing patterns of system calls generated during emulation
US20040255163A1 (en) * 2002-06-03 2004-12-16 International Business Machines Corporation Preventing attacks in a data processing system
US20050283838A1 (en) * 2003-02-26 2005-12-22 Secure Ware Inc. Malicious-process-determining method, data processing apparatus and recording medium
US20060037080A1 (en) * 2004-08-13 2006-02-16 Georgetown University System and method for detecting malicious executable code
US7072876B1 (en) * 2000-09-19 2006-07-04 Cigital System and method for mining execution traces with finite automata
US20070239999A1 (en) * 2002-01-25 2007-10-11 Andrew Honig Systems and methods for adaptive model generation for detecting intrusions in computer systems
US20080120720A1 (en) 2006-11-17 2008-05-22 Jinhong Guo Intrusion detection via high dimensional vector matching
US8205256B2 (en) * 2007-01-31 2012-06-19 Samsung Electronics Co., Ltd. Apparatus for detecting intrusion code and method using the same

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6775780B1 (en) * 2000-03-16 2004-08-10 Networks Associates Technology, Inc. Detecting malicious software by analyzing patterns of system calls generated during emulation
US7072876B1 (en) * 2000-09-19 2006-07-04 Cigital System and method for mining execution traces with finite automata
US20030065926A1 (en) * 2001-07-30 2003-04-03 Schultz Matthew G. System and methods for detection of new malicious executables
US20070239999A1 (en) * 2002-01-25 2007-10-11 Andrew Honig Systems and methods for adaptive model generation for detecting intrusions in computer systems
US20040255163A1 (en) * 2002-06-03 2004-12-16 International Business Machines Corporation Preventing attacks in a data processing system
US20050283838A1 (en) * 2003-02-26 2005-12-22 Secure Ware Inc. Malicious-process-determining method, data processing apparatus and recording medium
US20060037080A1 (en) * 2004-08-13 2006-02-16 Georgetown University System and method for detecting malicious executable code
US8037535B2 (en) * 2004-08-13 2011-10-11 Georgetown University System and method for detecting malicious executable code
US20080120720A1 (en) 2006-11-17 2008-05-22 Jinhong Guo Intrusion detection via high dimensional vector matching
US8205256B2 (en) * 2007-01-31 2012-06-19 Samsung Electronics Co., Ltd. Apparatus for detecting intrusion code and method using the same

Non-Patent Citations (12)

* Cited by examiner, † Cited by third party
Title
European Search report for corresponding EP application-8 pages-mailed on Jun. 10, 2010.
Forrest, S. et al: "Evolution of System-Call Monitoring" Computer Security Applications conference, Dec. 8, 2008, pp. 418-430, XP031376795.
K. Beyer et al., Proc. 7th Int. Conf. on Database Theory (ICDT 99), pp. 217-235, 1999.
Lee, W. et al: "Mining audit data . . . models" Proceedings 4th internat'l conf on knowledge discovery & data mining 1998, pp. 66-72, XP007912921.
S. Forrest., "A sense of Self . . . " Proceeding of the IEEE Symposium on Security and Privacy, CA, 9 Pages, 1996.
S. Forrest., Proceeding of the IEEE Symposium on Security and Privacy, CA, pp. 120-128, 1996.
T. Lee et al., "Behavioral Classification" Presented at the EICAR Conference, 2006.
W. Lee et al., "Data Mining . . . Intrusion Detection" Proceedings of the 7th Usenix Security Symposium, 16 Pages 1998.
W. Lee et al., "Learning Patters . . . " AAAI Workshop on AI Approaches to Fraud Detection and Risk Managment, 7 Pages, AAAI Press Jul. 1997.
W. Lee et al., AAAI Workshop on AI Approaches to Fraud Detection and Risk Managment, pp. 50-60, AAAI Press Jul. 1997.
W. Lee et al., Proceedings of the 7th Usenix Security Symposium' 1998.
Zaki, M.: "Spade: An efficient algorithm . . . sequences"; Machine Learning, vol. 42, No. 1-2, Jan. 2001, pp. 31-60, XP002581024.

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9286041B2 (en) 2002-12-06 2016-03-15 Veracode, Inc. Software analysis framework
US8613080B2 (en) 2007-02-16 2013-12-17 Veracode, Inc. Assessment and analysis of software security flaws in virtual machines
US20120124667A1 (en) * 2010-11-12 2012-05-17 National Chiao Tung University Machine-implemented method and system for determining whether a to-be-analyzed software is a known malware or a variant of the known malware
US8505099B2 (en) * 2010-11-12 2013-08-06 National Chiao Tung University Machine-implemented method and system for determining whether a to-be-analyzed software is a known malware or a variant of the known malware
US9286063B2 (en) 2012-02-22 2016-03-15 Veracode, Inc. Methods and systems for providing feedback and suggested programming methods
US9503465B2 (en) 2013-11-14 2016-11-22 At&T Intellectual Property I, L.P. Methods and apparatus to identify malicious activity in a network
US9769190B2 (en) 2013-11-14 2017-09-19 At&T Intellectual Property I, L.P. Methods and apparatus to identify malicious activity in a network
US20200394496A1 (en) * 2019-06-11 2020-12-17 International Business Machines Corporation Detecting Non-Anomalous and Anomalous Sequences of Computer-Executed Operations
US11763132B2 (en) * 2019-06-11 2023-09-19 International Business Machines Corporation Detecting non-anomalous and anomalous sequences of computer-executed operations

Also Published As

Publication number Publication date
US20100229239A1 (en) 2010-09-09
EP2228743A1 (en) 2010-09-15
EP2228743B1 (en) 2013-04-03
IL197477A0 (en) 2009-12-24

Similar Documents

Publication Publication Date Title
US8332944B2 (en) System and method for detecting new malicious executables, based on discovering and monitoring characteristic system call sequences
CN113661693B (en) Detecting sensitive data exposure via log
Shibahara et al. Efficient dynamic malware analysis based on network behavior using deep learning
US11423146B2 (en) Provenance-based threat detection tools and stealthy malware detection
Griffin et al. Automatic generation of string signatures for malware detection
US7519998B2 (en) Detection of malicious computer executables
US11888881B2 (en) Context informed abnormal endpoint behavior detection
CN111460445B (en) Sample program malicious degree automatic identification method and device
US10839074B2 (en) System and method of adapting patterns of dangerous behavior of programs to the computer systems of users
WO2017040957A1 (en) Process launch, monitoring and execution control
Park et al. Antibot: Clustering common semantic patterns for bot detection
US11003772B2 (en) System and method for adapting patterns of malicious program behavior from groups of computer systems
KR102318991B1 (en) Method and device for detecting malware based on similarity
Alosefer et al. Predicting client-side attacks via behaviour analysis using honeypot data
RU2747464C2 (en) Method for detecting malicious files based on file fragments
Rozenberg et al. A method for detecting unknown malicious executables
Dolgikh et al. Using behavioral modeling and customized normalcy profiles as protection against targeted cyber-attacks
Pranav et al. Detection of botnets in IoT networks using graph theory and machine learning
Ravula et al. Learning attack features from static and dynamic analysis of malware
CN114039744B (en) Abnormal behavior prediction method and system based on user feature labels
US20240129327A1 (en) Context informed abnormal endpoint behavior detection
CN116627466B (en) Service path extraction method, system, equipment and medium
Zhang et al. Using RS and SVM to detect new malicious executable codes
Rafiqul Islam et al. Detecting unknown anomalous program behavior using API system calls
Ahmad A Comparative Analysis of Malware Detection Methods Traditional vs. Machine Learning

Legal Events

Date Code Title Description
AS Assignment

Owner name: DEUTSCHE TELEKOM AG, GERMANY

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:BEN-GURION UNIVERSITY OF THE NEGEV RESEARCH AND DEVELOPMENT AUTHORITY;REEL/FRAME:023878/0536

Effective date: 20091225

Owner name: BEN-GURION UNIVERSITY OF THE NEGEV RESEARCH AND DE

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ROZENBERG, BORIS;GUDES, EHUD;ELOVICI, YUVAL;SIGNING DATES FROM 20090406 TO 20090421;REEL/FRAME:023878/0505

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20161211