A positional keyword-based approach to inferring fine-grained message formats
Introduction
The automatic inference of protocol message formats from raw network traces is an important problem with widespread applications, such as in the domains of service virtualization and network security. Service virtualization is a critical technology for DevOps and Continuous Delivery [1], to enable automated testing in production-like conditions of software systems against their dependent systems. Service virtualization [2], [3], [4] requires the understanding of message formats in order to decode request messages and formulate appropriate response messages. In the domain of network security [5], [6], [7], intrusion detection systems (IDS) and firewall systems require the knowledge of protocol message formats before performing deep packet inspection. For both service virtualization and network security, however, the message formats required are not always available. This situation may arise in scenarios involving legacy systems, proprietary protocols or just poor documentation [8]. This illustrates the importance of automated inference of message formats used in various system applications.
Over the past few years, researchers have proposed many methods for protocol message format inference. These methods broadly fall into two categories: (1) those based on reverse-engineering, which extract protocol message formats through reverse engineering the executable code of a software application that implements a given protocol, and (2) those based on analyzing network traces, which extract protocol message formats through analyzing raw network messages of a given protocol. As discussed in [9], reverse engineering protocols typically involves manual effort. To address this problem, many methods for automatic protocol reverse engineering have been proposed. Example methods automating this process include Polyglot [10], Prospex [11], HFSM [12], and AUTOGRAM [13]. A common drawback of these approaches is that they become inapplicable when the executable code of the application is not accessible. With the trends toward cloud computing, Software-as-a-Service and container technology, getting access to the executable application code is becoming less common. In this paper, we focus on extracting protocol message formats by analyzing raw network traces.
Methods based on network trace analysis utilize statistical learning techniques from Frequent Pattern Mining [14] and Natural Language Processing [15] to mine keywords patterns from raw message traces so as to group messages into clusters reflecting their types and consequently infer the message format of each type. Examples of these methods include Discoverer [16], SPI [17], SANTaClass [18], AutoReEngine [19], and ProDecoder [20]. However, a challenge faced by these methods is how to reliably discerning which terms are keywords and which terms are part of message “payload”. For example, AutoReEngine extracts message keywords by splitting messages into n-grams of different lengths, and frequent n-grams are treated as keywords. In general, keyword identification in existing methods suffers a number of issues. First, a sub-string or super-string of a keyword may be wrongly identified as a keyword, i.e., keyword under-fitting or keyword over-fitting. Second, certain keyword occurrences may be wrongly treated as payload information, while other occurrences of keyword strings in payload are wrongly identified as keyword occurrences, i.e., mis-treatment of keyword occurrence. These keyword imprecision and mis-treatment issues are some of the main reasons of message mis-clustering, i.e., causing different types of messages being put into one cluster. Another reason causing mis-clustering is the imbalance between different types of messages in the message traces as clustering is generally based on the frequencies of keyword occurrences. Message mis-clustering leads to over-generalization of the derived message formats, resulting in coarse-grained message formats that accept ill-formed messages.
To address the above issues, we propose P-token, a new approach to extract message formats from raw protocol messages. It takes advantage of the properties of protocol messages, particularly those that arise from their template structure, in identifying message keywords. Machine-generated messages are formulated by a computer process or application, according to particular message templates or formats. P-token leverages the positions of keywords in the message template structure to differentiate a keyword occurrence from its string value’s occurrences in message payload and therefore identify message keywords more accurately. P-token has three major steps with corresponding techniques: (1) identifying positional keywords, i.e., keywords together with their positions in the messages, (2) clustering the messages into groups with a two-level hierarchical clustering strategy, such that each group has high homogeneity (representing a particular type of messages), and (3) inferring the message format for each cluster based on the natural positions of keywords in messages. In Step (1), we introduce a new technique called positional token. By associating the position as meta-data with each token, we can more accurately discern which tokens are keywords of the message and which tokens are in the message payload. Hence, positional token can address the aforementioned keyword imprecision and mis-treatment issues faced by existing methods. In Step (2), we present a new two-level clustering technique, based on the extracted positional keywords. After initial clustering, it identifies the clusters that contain messages of different types, and performs a further level of clustering. It, therefore, addresses the mis-clustering and message format over-generalization issue faced by existing methods. In Step (3), the natural positions of keywords in the messages are used to extract the fine-grained format for each cluster. Compared to existing methods, which generate message formats by aligning multiple messages or mining keywords patterns [3], our approach produces more accurate representations of the protocol message formats.
With P-token, we make the following key contributions in inferring protocol message formats:
- •
a new positional keyword identification method, which addresses the keyword imprecision and mis-treatment issues;
- •
a two-level clustering strategy to separate the messages into clusters with high homogeneity, which addresses the mis-clustering and format over-generalization issue;
- •
a new method to derive fine-grained message formats based on the natural positions of the positional keywords, with high accuracy.
To present the effectiveness of P-token, we compare it with two state-of-the-art approaches (ProDecoder [20] and AutoReEngine [19]), and two baseline approaches (“vanilla” token and P-token without second level clustering). Experiments are conducted on real-world software applications using various protocols, including LDAP, SOAP, and IMS and a RESTful service. Our experimental results show that P-token achieves more accurate message formats than existing methods.
The rest of the paper proceeds as follows. We analyze the problem of extracting protocol message formats by using a real-world example in Section 2. We give the rationale of P-token in Section 3. We present the detailed techniques involved in P-token in Section 4. Experimental results on real-world protocol traces are reported in Section 5. We discuss related work in Section 7 and conclude this paper in Section 8.
Section snippets
Problem statement
A communication protocol defines the format or structure of messages that the system or service sends and receives. In this paper, we deal with services and applications with tokenized messaging protocols, i.e., the communication messages are tokenized based on delimiters. A message format can be defined as a sequence of message fields (see Fig. 1). The values of some fields are fixed in the messages of the same type, but the values of other fields vary across messages. In the context of this
Rationale
P-token aims at extracting fine-grained message formats based on the two key insights: (1) keyword position sensitivity, i.e., message keywords appear in relatively fixed positions across messages of the same type, and (2) intra-cluster homogeneity, i.e., the messages of the same type present high homogeneity in terms of their positional keyword-based structure while messages from different types present high dissimilarity. Meanwhile, as we focus on tokenized messaging protocols, messages can
The proposed approach: P-token
In this section, we present the details of the proposed approach, P-token, aimed at obtaining fine-grained message formats from message traces of unknown protocols. The key novelty of P-token lies in its ability to (1) accurately identify message keywords and (2) accurately group the messages into clusters reflecting their types.
Experimental results
In this section, we evaluate our proposed approach, P-token, for extracting protocol message formats on raw message traces collected from real-world services and applications. Here, we first introduce the datasets used to evaluate our approach, then define the evaluation metrics, and finally present the experimental results.
Discussions and limitations
In this section, we further reflect on our approach and discuss limitations of its applicability in the context of different types of protocols. Based on our observations in analyzing message formats, protocols can, in general, be classified into three groups: (i) protocols with fixed message formats, that is, a fixed order of keywords, (ii) protocols with free formats (i.e. keywords can, in principle, appear in arbitrary order) but fixed end-point specific interpretations (i.e. each end-point
Related work
So far, many approaches have been proposed for extracting protocol message formats from raw message traces. Based on the techniques used in extracting message keywords, existing approaches for extracting protocol message formats can be divided into two categories: (1) n-gram based methods and (2) tokenization based methods.
Conclusion and future work
In this paper, we have proposed a novel approach called P-token for extracting protocol message formats from raw message traces. P-token does not assume prior knowledge about message structures, nor does it require access to the executable code of applications implementing the protocols concerned. P-token involves three steps: (i) tokenization-based positional keywords identification, (ii) message clustering based on positional keywords, and (iii) positional keyword-based message format
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by the Australian Research Council Linkage Project LP150100892 Generating Virtual Deploy- ment Environments for Enterprise Software Systems.
Jiaojiao Jiang received the Ph.D. degree in computer science from Deakin University, Australia, in 2017. She was a Postdoctoral Research Fellow with the School of Software and Electrical Engineering, Swinburne University of Technology, Australia, from Jan 2017 to May 2019. She is currently an Early Career Development Fellow with RMIT University, Melbourne, Australia. Her research interests include service virtualization and cyber security.
References (42)
- et al.
Interaction traces mining for efficient system responses generation
ACM SIGSOFT Softw. Eng. Notes
(2015) - et al.
Position-based automatic reverse engineering of network protocols
J. Netw. Comput. Appl.
(2013) Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation
(2010)- et al.
Opaque service virtualisation: a practical tool for emulating endpoint systems
- et al.
Mining accurate message formats for service apis
- et al.
Automatic protocol format reverse engineering through context-aware monitored execution
- et al.
Inferring protocol state machine from network traces: a probabilistic approach
- et al.
Protocol vulnerability detection based on network traffic analysis and binary reverse engineering
PLoS One
(2017) - I. netflow statistics,...
- G. I. M, Client,...
Polyglot: Automatic extraction of protocol message format using dynamic binary analysis
Prospex: Protocol specification extraction
Extracting output formats from executables
Mining input grammars from dynamic taints
Frequent Pattern Mining
Latent dirichlet allocation
J. Mach. Learn. Res.
Discoverer: Automatic protocol reverse engineering from network traces
Stochastic packet inspection for tcp traffic
Santaclass: A self adaptive network traffic classification system
A semantics aware approach to automated reverse engineering unknown protocols
Enhanced playback of automated service emulation models using entropy analysis
Cited by (6)
Smart-contract enabled decentralized knowledge fusion for blockchain-based conversation system
2022, Expert Systems with ApplicationsCitation Excerpt :Such schemes rely on certain authorities rather than on a consensus for all participants, and the rewards may not be fair or trustworthy. In addition, for the knowledge protection management of contribution and conversation contents used to assign the contributions, existing secure content management in conversation systems, such as proactive identification (Bertino et al., 2006), keyword identification (Jiang et al., 2020), and centralised management of security administration (Buszta, 2019) can provide audit trails using stored log files. However, it is difficult to ensure that data is well integrated and secured during the fusion process, owing to the lack of supervision from the public.
Inferring data model from service interactions for response generation in service virtualization
2022, Information and Software TechnologyCitation Excerpt :Note that, we use a message clustering technique [13] to identify the request type field and to infer the format of each type of request messages. On the other hand, a keyword-based clustering technique, P-token [30], is used to cluster the response messages and to infer their formats, as response messages do not always contain a message type field. Finally, each interaction of its type and the inferred format of the corresponding response message are stored as a key–value pair in the ResponseMap with the interaction type as the key and the inferred response message format as the value, for use in synthesizing responses.
SeMiner: Side-Information-Based Semantics Miner for Proprietary Industrial Control Protocols
2022, IEEE Internet of Things JournalExtracting Formats of Service Messages with Varying Payloads
2022, ACM Transactions on Internet TechnologyA message keyword extraction approach by accurate identification of field boundaries
2021, International Journal of Network ManagementA review on the service virtualisation and its structural pillars
2021, Applied Sciences (Switzerland)
Jiaojiao Jiang received the Ph.D. degree in computer science from Deakin University, Australia, in 2017. She was a Postdoctoral Research Fellow with the School of Software and Electrical Engineering, Swinburne University of Technology, Australia, from Jan 2017 to May 2019. She is currently an Early Career Development Fellow with RMIT University, Melbourne, Australia. Her research interests include service virtualization and cyber security.
Steve Versteeg received the Ph.D. degree in computer science from The University of Melbourne. He is currently an Adjunct Fellow with the Swinburne University of Technology. He has published more than 40 international journals and conference articles and 15 U.S. patents pending. His current research interests include service virtualization, information security, and machine learning.
Jun Han received his Ph.D. degree in computer science from the University of Queensland, Australia. Since 2003, he has been a full professor of software engineering at Swinburne University of Technology, Australia. He has published more than 250 peer-reviewed articles. His current research interests include service and cloud systems engineering, adaptive and context-aware software systems, and software architecture and quality.
Md Arafat Hossain received his B.Sc. and M.Sc. degree from Rajshahi University of Engineering and Technology, Bangladesh in 2007 and 2014, respectively. He is currently working toward the Ph.D. degree in the School of Software and Electrical Engineering, Swinburne University of Technology, Australia. He has been working on virtualizing services to create test-bed environments for systems and services.
Jean-Guy Schneider received the M.Sc. and Ph.D. degrees in computer science and applied mathematics from the University of Bern, Switzerland. He was a Lecturer, a Senior Lecturer, and an Associate Professor in software engineering with the Swinburne University of Technology, from 2000 to 2018. He is currently a Full Professor in software engineering with Deakin University, Burwood, VIC, Australia. His research interests include object-oriented and service-oriented systems, and scripting and composition languages.
- 2
Currently also with ”Computer Science and Software Engineering, School of Science, RMIT University, Melbourne, VIC 3000, Australia”.
- 3
Currently also with ”School of Information Technology, Deakin University, Burwood, VIC 3125, Australia”.