A positional keyword-based approach to inferring fine-grained message formats

https://doi.org/10.1016/j.future.2019.08.011Get rights and content

Abstract

Message format extraction, the process of revealing the message syntax without access to the protocol specification, is important for a variety of applications such as service virtualization and network security. In this paper, we propose P-token, which mines fine-grained message formats from network traces. The novelty of our approach is twofold: a ‘positional keyword’ identification technique and a two-level hierarchical clustering strategy. Positional keywords are based on the insight that keywords or reserved words usually occur at relatively fixed positions in the messages. By associating positions as meta-information with keywords, we can more accurately distinguish keywords from message payload data. After identification, the positional keywords are used as features to cluster the messages using density peaks clustering. We then perform another level of clustering to refine the clusters with low homogeneity. Finally, the message format of each cluster is extracted based on the observed ordering of keywords. P-token improves on the current state-of-the-art techniques by successfully addressing two challenges that commonly afflict existing keyword based format extraction methods: message keyword mis-identification and message format over-generalization. We have conducted experiments on services and applications using various protocols, including SOAP, LDAP, IMS and a RESTful service. Our experimental results show that P-token outperforms existing methods in extracting message formats.

Introduction

The automatic inference of protocol message formats from raw network traces is an important problem with widespread applications, such as in the domains of service virtualization and network security. Service virtualization is a critical technology for DevOps and Continuous Delivery [1], to enable automated testing in production-like conditions of software systems against their dependent systems. Service virtualization [2], [3], [4] requires the understanding of message formats in order to decode request messages and formulate appropriate response messages. In the domain of network security [5], [6], [7], intrusion detection systems (IDS) and firewall systems require the knowledge of protocol message formats before performing deep packet inspection. For both service virtualization and network security, however, the message formats required are not always available. This situation may arise in scenarios involving legacy systems, proprietary protocols or just poor documentation [8]. This illustrates the importance of automated inference of message formats used in various system applications.

Over the past few years, researchers have proposed many methods for protocol message format inference. These methods broadly fall into two categories: (1) those based on reverse-engineering, which extract protocol message formats through reverse engineering the executable code of a software application that implements a given protocol, and (2) those based on analyzing network traces, which extract protocol message formats through analyzing raw network messages of a given protocol. As discussed in [9], reverse engineering protocols typically involves manual effort. To address this problem, many methods for automatic protocol reverse engineering have been proposed. Example methods automating this process include Polyglot [10], Prospex [11], HFSM [12], and AUTOGRAM [13]. A common drawback of these approaches is that they become inapplicable when the executable code of the application is not accessible. With the trends toward cloud computing, Software-as-a-Service and container technology, getting access to the executable application code is becoming less common. In this paper, we focus on extracting protocol message formats by analyzing raw network traces.

Methods based on network trace analysis utilize statistical learning techniques from Frequent Pattern Mining [14] and Natural Language Processing [15] to mine keywords patterns from raw message traces so as to group messages into clusters reflecting their types and consequently infer the message format of each type. Examples of these methods include Discoverer [16], SPI [17], SANTaClass [18], AutoReEngine [19], and ProDecoder [20]. However, a challenge faced by these methods is how to reliably discerning which terms are keywords and which terms are part of message “payload”. For example, AutoReEngine extracts message keywords by splitting messages into n-grams of different lengths, and frequent n-grams are treated as keywords. In general, keyword identification in existing methods suffers a number of issues. First, a sub-string or super-string of a keyword may be wrongly identified as a keyword, i.e., keyword under-fitting or keyword over-fitting. Second, certain keyword occurrences may be wrongly treated as payload information, while other occurrences of keyword strings in payload are wrongly identified as keyword occurrences, i.e., mis-treatment of keyword occurrence. These keyword imprecision and mis-treatment issues are some of the main reasons of message mis-clustering, i.e., causing different types of messages being put into one cluster. Another reason causing mis-clustering is the imbalance between different types of messages in the message traces as clustering is generally based on the frequencies of keyword occurrences. Message mis-clustering leads to over-generalization of the derived message formats, resulting in coarse-grained message formats that accept ill-formed messages.

To address the above issues, we propose P-token, a new approach to extract message formats from raw protocol messages. It takes advantage of the properties of protocol messages, particularly those that arise from their template structure, in identifying message keywords. Machine-generated messages are formulated by a computer process or application, according to particular message templates or formats. P-token leverages the positions of keywords in the message template structure to differentiate a keyword occurrence from its string value’s occurrences in message payload and therefore identify message keywords more accurately. P-token has three major steps with corresponding techniques: (1) identifying positional keywords, i.e., keywords together with their positions in the messages, (2) clustering the messages into groups with a two-level hierarchical clustering strategy, such that each group has high homogeneity (representing a particular type of messages), and (3) inferring the message format for each cluster based on the natural positions of keywords in messages. In Step (1), we introduce a new technique called positional token. By associating the position as meta-data with each token, we can more accurately discern which tokens are keywords of the message and which tokens are in the message payload. Hence, positional token can address the aforementioned keyword imprecision and mis-treatment issues faced by existing methods. In Step (2), we present a new two-level clustering technique, based on the extracted positional keywords. After initial clustering, it identifies the clusters that contain messages of different types, and performs a further level of clustering. It, therefore, addresses the mis-clustering and message format over-generalization issue faced by existing methods. In Step (3), the natural positions of keywords in the messages are used to extract the fine-grained format for each cluster. Compared to existing methods, which generate message formats by aligning multiple messages or mining keywords patterns [3], our approach produces more accurate representations of the protocol message formats.

With P-token, we make the following key contributions in inferring protocol message formats:

  • a new positional keyword identification method, which addresses the keyword imprecision and mis-treatment issues;

  • a two-level clustering strategy to separate the messages into clusters with high homogeneity, which addresses the mis-clustering and format over-generalization issue;

  • a new method to derive fine-grained message formats based on the natural positions of the positional keywords, with high accuracy.

To present the effectiveness of P-token, we compare it with two state-of-the-art approaches (ProDecoder [20] and AutoReEngine [19]), and two baseline approaches (“vanilla” token and P-token without second level clustering). Experiments are conducted on real-world software applications using various protocols, including LDAP, SOAP, and IMS and a RESTful service. Our experimental results show that P-token achieves more accurate message formats than existing methods.

The rest of the paper proceeds as follows. We analyze the problem of extracting protocol message formats by using a real-world example in Section 2. We give the rationale of P-token in Section 3. We present the detailed techniques involved in P-token in Section 4. Experimental results on real-world protocol traces are reported in Section 5. We discuss related work in Section 7 and conclude this paper in Section 8.

Section snippets

Problem statement

A communication protocol defines the format or structure of messages that the system or service sends and receives. In this paper, we deal with services and applications with tokenized messaging protocols, i.e., the communication messages are tokenized based on delimiters. A message format can be defined as a sequence of message fields (see Fig. 1). The values of some fields are fixed in the messages of the same type, but the values of other fields vary across messages. In the context of this

Rationale

P-token aims at extracting fine-grained message formats based on the two key insights: (1) keyword position sensitivity, i.e., message keywords appear in relatively fixed positions across messages of the same type, and (2) intra-cluster homogeneity, i.e., the messages of the same type present high homogeneity in terms of their positional keyword-based structure while messages from different types present high dissimilarity. Meanwhile, as we focus on tokenized messaging protocols, messages can

The proposed approach: P-token

In this section, we present the details of the proposed approach, P-token, aimed at obtaining fine-grained message formats from message traces of unknown protocols. The key novelty of P-token lies in its ability to (1) accurately identify message keywords and (2) accurately group the messages into clusters reflecting their types.

Experimental results

In this section, we evaluate our proposed approach, P-token, for extracting protocol message formats on raw message traces collected from real-world services and applications. Here, we first introduce the datasets used to evaluate our approach, then define the evaluation metrics, and finally present the experimental results.

Discussions and limitations

In this section, we further reflect on our approach and discuss limitations of its applicability in the context of different types of protocols. Based on our observations in analyzing message formats, protocols can, in general, be classified into three groups: (i) protocols with fixed message formats, that is, a fixed order of keywords, (ii) protocols with free formats (i.e. keywords can, in principle, appear in arbitrary order) but fixed end-point specific interpretations (i.e. each end-point

Related work

So far, many approaches have been proposed for extracting protocol message formats from raw message traces. Based on the techniques used in extracting message keywords, existing approaches for extracting protocol message formats can be divided into two categories: (1) n-gram based methods and (2) tokenization based methods.

Conclusion and future work

In this paper, we have proposed a novel approach called P-token for extracting protocol message formats from raw message traces. P-token does not assume prior knowledge about message structures, nor does it require access to the executable code of applications implementing the protocols concerned. P-token involves three steps: (i) tokenization-based positional keywords identification, (ii) message clustering based on positional keywords, and (iii) positional keyword-based message format

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the Australian Research Council Linkage Project LP150100892 Generating Virtual Deploy- ment Environments for Enterprise Software Systems.

Jiaojiao Jiang received the Ph.D. degree in computer science from Deakin University, Australia, in 2017. She was a Postdoctoral Research Fellow with the School of Software and Electrical Engineering, Swinburne University of Technology, Australia, from Jan 2017 to May 2019. She is currently an Early Career Development Fellow with RMIT University, Melbourne, Australia. Her research interests include service virtualization and cyber security.

References (42)

  • DuM. et al.

    Interaction traces mining for efficient system responses generation

    ACM SIGSOFT Softw. Eng. Notes

    (2015)
  • LuoJ.-Z. et al.

    Position-based automatic reverse engineering of network protocols

    J. Netw. Comput. Appl.

    (2013)
  • HumbleJ.

    Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation

    (2010)
  • VersteegS. et al.

    Opaque service virtualisation: a practical tool for emulating endpoint systems

  • HossainM.A. et al.

    Mining accurate message formats for service apis

  • LinZ. et al.

    Automatic protocol format reverse engineering through context-aware monitored execution

  • WangY. et al.

    Inferring protocol state machine from network traces: a probabilistic approach

  • WenS. et al.

    Protocol vulnerability detection based on network traffic analysis and binary reverse engineering

    PLoS One

    (2017)
  • I. netflow statistics,...
  • G. I. M, Client,...
  • CaballeroJ. et al.

    Polyglot: Automatic extraction of protocol message format using dynamic binary analysis

  • ComparettiP.M. et al.

    Prospex: Protocol specification extraction

  • LimJ. et al.

    Extracting output formats from executables

  • HöscheleM. et al.

    Mining input grammars from dynamic taints

  • AggarwalC.C. et al.

    Frequent Pattern Mining

    (2014)
  • BleiD.M. et al.

    Latent dirichlet allocation

    J. Mach. Learn. Res.

    (2003)
  • CuiW. et al.

    Discoverer: Automatic protocol reverse engineering from network traces

  • La MantiaG. et al.

    Stochastic packet inspection for tcp traffic

  • TongaonkarA. et al.

    Santaclass: A self adaptive network traffic classification system

  • WangY. et al.

    A semantics aware approach to automated reverse engineering unknown protocols

  • VersteegS. et al.

    Enhanced playback of automated service emulation models using entropy analysis

  • Cited by (6)

    • Smart-contract enabled decentralized knowledge fusion for blockchain-based conversation system

      2022, Expert Systems with Applications
      Citation Excerpt :

      Such schemes rely on certain authorities rather than on a consensus for all participants, and the rewards may not be fair or trustworthy. In addition, for the knowledge protection management of contribution and conversation contents used to assign the contributions, existing secure content management in conversation systems, such as proactive identification (Bertino et al., 2006), keyword identification (Jiang et al., 2020), and centralised management of security administration (Buszta, 2019) can provide audit trails using stored log files. However, it is difficult to ensure that data is well integrated and secured during the fusion process, owing to the lack of supervision from the public.

    • Inferring data model from service interactions for response generation in service virtualization

      2022, Information and Software Technology
      Citation Excerpt :

      Note that, we use a message clustering technique [13] to identify the request type field and to infer the format of each type of request messages. On the other hand, a keyword-based clustering technique, P-token [30], is used to cluster the response messages and to infer their formats, as response messages do not always contain a message type field. Finally, each interaction of its type and the inferred format of the corresponding response message are stored as a key–value pair in the ResponseMap with the interaction type as the key and the inferred response message format as the value, for use in synthesizing responses.

    • Extracting Formats of Service Messages with Varying Payloads

      2022, ACM Transactions on Internet Technology

    Jiaojiao Jiang received the Ph.D. degree in computer science from Deakin University, Australia, in 2017. She was a Postdoctoral Research Fellow with the School of Software and Electrical Engineering, Swinburne University of Technology, Australia, from Jan 2017 to May 2019. She is currently an Early Career Development Fellow with RMIT University, Melbourne, Australia. Her research interests include service virtualization and cyber security.

    Steve Versteeg received the Ph.D. degree in computer science from The University of Melbourne. He is currently an Adjunct Fellow with the Swinburne University of Technology. He has published more than 40 international journals and conference articles and 15 U.S. patents pending. His current research interests include service virtualization, information security, and machine learning.

    Jun Han received his Ph.D. degree in computer science from the University of Queensland, Australia. Since 2003, he has been a full professor of software engineering at Swinburne University of Technology, Australia. He has published more than 250 peer-reviewed articles. His current research interests include service and cloud systems engineering, adaptive and context-aware software systems, and software architecture and quality.

    Md Arafat Hossain received his B.Sc. and M.Sc. degree from Rajshahi University of Engineering and Technology, Bangladesh in 2007 and 2014, respectively. He is currently working toward the Ph.D. degree in the School of Software and Electrical Engineering, Swinburne University of Technology, Australia. He has been working on virtualizing services to create test-bed environments for systems and services.

    Jean-Guy Schneider received the M.Sc. and Ph.D. degrees in computer science and applied mathematics from the University of Bern, Switzerland. He was a Lecturer, a Senior Lecturer, and an Associate Professor in software engineering with the Swinburne University of Technology, from 2000 to 2018. He is currently a Full Professor in software engineering with Deakin University, Burwood, VIC, Australia. His research interests include object-oriented and service-oriented systems, and scripting and composition languages.

    2

    Currently also with ”Computer Science and Software Engineering, School of Science, RMIT University, Melbourne, VIC 3000, Australia”.

    3

    Currently also with ”School of Information Technology, Deakin University, Burwood, VIC 3125, Australia”.

    View full text