Keywords

1 Introduction

The service-oriented architecture (SOA) has been widely employed by many enterprises to build service-based systems (SBSs) [1, 2]. The component services of an SBS collectively realize the functionality of the SBS, which are often offered as SaaS (Software-as-a-Service) to internal and external users in the cloud environment. The development and popularity of e-business, ecommerce, especially the pay-as-you-go business model promoted by cloud computing, have fueled the rapid growth of services and SBSs, shown by statistics published by programmableweb.com, a web service directory. The process for building an SBS consists of three phases: (1) System Planning: the system engineer empirically identifies and determines the system tasks, e.g., flight ticket booking, hotel booking, as well as the execution order of the tasks. (2) Service Discovery: the system engineer, through querying service repositories or service search engines, discovers multiple sets of composable services, each offering one of the required system tasks. (3) Service Selection: the system engineer selects one service from each set of candidate services to compose the target system that fulfills the multi-dimensional constraints and the optimization goal for the system quality, e.g., reliability, response time and cost.

The process above is complicated and requires detailed knowledge of sophisticated SOA techniques in different phases. It has become a major obstacle to broader applications of SOA. There has been a rapid increase in the need for an approach that assists system engineers in quickly finding system solutions for their SBSs, including which services to use and in what order they are composed, without going through the above complicated process [3].

We previously presented KS3 to tackle this challenge [4]. KS3 allows system engineers to query for system solutions by entering only a few keywords that represent the required system tasks. Such a keyword query, i.e., a query containing keywords that represent the required system tasks, is modeled as a constraint optimization problem and employs the integer programming technique to find system solutions. However, KS3 suffers from extremely poor efficiency in processing queries on large web service repositories. According to [4], it takes up to 100 s to answer queries on a repository with 20,000 web services. To address this issue, this paper proposes KS3+, a new, highly efficient approach for building SBSs also based on keyword search techniques.

2 Keyword Search Method

We discuss how KS3+ models keyword queries for system solutions and finds group Steiner trees [4] as answer trees to these keyword queries. We denote the set of keywords in a query Q as K = {k 1, k 2, …, k l } and use k, k x , and k y to denote a non-empty set of K where \( {\mathbf{k}},{\mathbf{k}}_{{\mathbf{x}}} ,{\mathbf{k}}_{{\mathbf{y}}} \; \subseteq \;{\mathbf{K}} \). To represent a group Steiner tree that is rooted at node v and covers a set of keywords k, we use T(v, k). Thus, the group Steiner tree we look for in data graph G(V, E) as answer to Q is T(v, K) where vV represents a web service and eE represents the composability of two web services. For more details about G, see [4].

2.1 Dynamic Programming Model

In this research, a group Steiner tree T(v, K) of height h (the length of the longest downward path from the root of the group Steiner tree to any leaf) can be found by expanding the group Steiner trees of heights h = 0, 1, …, that cover \( {\mathbf{k}}\; \subseteq \;{\mathbf{k}} \). Let T(v, k) be a state in the dynamic programming model, and w(T(v, k)) be the weight of T(v, k), i.e., the total weight of the nodes in T(v, k), the state-transition equation in the dynamic programming model is:

$$ w\left( {T\left( {v,{\mathbf{k}}} \right)} \right) \, = { \hbox{min} }\left( {w\left( {T_{g} \left( {v,{\mathbf{k}}} \right)} \right),w\left( {T_{m} \left( {v,{\mathbf{k}}} \right)} \right)} \right) $$
(1)
$$ w\left( {T_{g} \left( {v,{\mathbf{k}}} \right)} \right) = \mathop {\hbox{min} }\limits_{u \in N(v)} \left\{ {w(T(u,{\mathbf{k}}) + u)} \right\} $$
(2)
$$ w\left( {T_{m} \left( {v,{\mathbf{k}}} \right)} \right) \, = \mathop {\hbox{min} }\limits_{\begin{subarray}{l} {\mathbf{ k}}={\mathbf{k}}_{{\mathbf{1}}}\cup{\mathbf{k}}_{{\mathbf{2}}} \\ \wedge {\mathbf{k}}_{{\mathbf{1}}}\cap{\mathbf{k}}_{{\mathbf{2}}} = \emptyset \end{subarray} } \left\{ {w(T(v,{\mathbf{k}}_{{\mathbf{1}}} ) + T(v,{\mathbf{k}}_{{\mathbf{2}}} ))} \right\} $$
(3)

where “+” is an operation to merge a node into a tree or to merge two trees to a new tree, N(v) is the set of node v’s neighbors in G, i.e., vG(V, E) and e(u, v) ∈ E. Equation (1) indicates that the weight of the a group Steiner tree T(v, k) can be obtained by either of two cases, namely tree growth, i.e. Eq. (2), and tree merging, i.e. Eq. (3). As indicated by Eq. (2), the tree growth case is that T g (v, k) can be obtained by growing a node u from the minimum-weight subtree of T(v, k) that is rooted at u (one of v’s neighbors) and covers all keywords in k. Equation (3) shows that, in the tree merging case, T m (v, k) can be obtained by merging two minimum-weight subtrees, both rooted at v, one covering k 1 and the other covering k 2 such that \( {\mathbf{k}} = {\mathbf{k}}_{{\mathbf{1}}} \; \cup \; {\mathbf{k}}_{{\mathbf{2}}} \) and \( {\mathbf{k}}_{{\mathbf{1}}} \; \cap \;{\mathbf{k}}_{{\mathbf{2}}} = {\emptyset } \).

2.2 Answering Keyword Queries

A keyword query Q contains a set of keywords, K = {k 1, …, k l }. Based on Eqs. (1)–(3), KS3+ employs Algorithm 1 to find the minimum group Steiner tree as the answer to query Q n . In line 1, Algorithm 1 initializes a priority queue of trees Q T to be empty. The trees in Q T are always sorted in ascending order by the total number of nodes in the trees, denoted by |T|. In lines 2–6, the algorithm locates nodes that contain individual keywords in K. For each node v in G, vV, if v contains any keywords k in K, \( {\mathbf{k}}\; \subseteq \;{\mathbf{k}} \), the algorithm enqueues tree T(v, k) into Q T . At this stage, for each such tree in Q T , there is |T(v, k)| = 1 because there is only one node in each of the trees in Q T . In lines 7–33, the algorithm iterates to dequeue trees from and enqueue trees into Q T , and in the meantime grow them with Eq. (2) (lines 12–21) or merge them with Eq. (3) (lines 23–32) to find the minimum group Steiner tree T(v, k), where vV and k = K (lines 9–11). Equation (2) is implemented by lines 12–21. Given a tree T(v, k) just dequeued from Q T (line 8), the algorithm considers all v’s neighbors, denoted by u, and checks whether there is a tree T(u, k) in Q T that can be replaced with T(v, k) + u, which contains the same set of keywords k but with fewer nodes (lines 12–17). If such a T(u, k) does not exist in Q T , T(v, k) + u is enqueued into Q T (lines 18–19). Equation (3) is implemented by lines 23–32. Given a tree T(v, k x ) (line 22), the algorithm attempts to find any existing trees, T(v, k y ), that are also rooted at v and contain keywords k x k y with more nodes than T(v, k x ) + T(v, k y ), where k x  ≠ k y . Any such trees will be replaced with T(v, k x ) + T(v, k y ) in Q T (lines 24–28). If there are no such trees, T(v, k x ) + T(v, k y ) will be enqueued into Q T (lines 29–30).

We now analyze the worst-case scenario complexity of Algorithm 1 when answering a query Q with a set of keywords K = {k 1, …, k l } on a data graph G = (V, E), where |V| = n and |E| = m. Let T(v, k) be the tree with the minimum number of nodes of all trees rooted at v containing a subset of keywords \( {\mathbf{k}}\; \subseteq \;{\mathbf{k}} \). There are 3 major components in complexity of Algorithm 1: queue maintenance, tree growth and tree merging.

Queue maintenance. In total, there are 2l subsets of K. Thus, the maximum length of Q T is 2l n, i.e., every tree rooted at any vV containing any \( {\mathbf{k}}\; \subseteq \;{\mathbf{k}} \) is enqueued into Q T . The complexity of enqueue/update operations and dequeue operations is dependent on the type of the queue. Here, we employ Fibonacci Heap, which has the complexity of O(1) for the enquene/update operations and O(log2l n) for dequeue operations. Because Algorithm 1 will enqueue or dequeue any T(v, k) into/from Q T at most once, the complexity of enqueuing and dequeuing all 2l n trees in Q T is O(2l n(l + logn)).

Tree growth. Lines 12–21 handle the tree growth operations implementing Eq. (2). The for loop iterates for |N(v)| times, trying to find the T(u, k) grown from T(v, k) + u with the minimum number of nodes. Here, |N(v)| is the total number of neighbors of v. Thus, the total time for Algorithm 1 to execute the comparison operations in lines 12–21 is \( O\left( {2^{l} \sum\nolimits_{v \in V} {|N\varvec{(}v\varvec{)}|} \, } \right) = O\left( {2^{l} m} \right) \).

figure a

Tree merging. Lines 23–32 handle the tree merging operations implementing Eq. (3). For each T(v, k x ) dequeued in line 8, the for loop in lines 23–32 enumerates every k y that fulfils k x k y  = Ø, where \( {\mathbf{k}}_{{\mathbf{x}}} ,{\mathbf{k}}_{{\mathbf{y}}} \, \subseteq \,{\mathbf{k}} \). Given |K| = l, the total number of possible k y is 2l−|kx|. Thus, the total time for Algorithm 1 to execute the comparison operations in lines 23–32 is n \( \sum\nolimits_{i = 1}^{l - 1} {C_{{l\mathcal{,}i}} \times 2^{l - i} } = O\left( {3^{l} n} \right) \).

Overall, the complexity of Algorithm 1 is O(2l n(l + logn) + 2l m + 3l n). This indicates that the efficiency of Algorithm 1 relies exponentially on the number of query keywords. In real world problems where l is a small constant, the complexity of Algorithm 1 becomes O(nlogn + m).

3 Experimental Evaluation

We conducted a series of experiments with a prototype of KS3+ implemented using JDK1.6.0 to compare the efficiency (computational overhead) and effectiveness (success rate) of KS3+ with KS3.

3.1 Experimental Setup

The data graphs and queries used in the experiments are randomly generated using a publicly available and widely used dataset named QWS, which contains the functional information about over 2,500 real-world web services [5]. All experiments were conducted on a machine with Intel i5-4570 CPU 3.20 GHz and 8 GB RAM, running Windows 7 ×64 Enterprise. In the experiments, random data graphs are generated based on the Erdős–Rényi model [6]. The relevance between the query keywords determines whether bridging nodes are needed to identify a system solution. In the data graph, directly relevant keywords are composable and hence belong to adjacent nodes. Bridging services are needed when two keywords are not directly relevant. In the experiments, we used the keyword distance to represent the relevance between two query keywords, reflected by the number of hops they are away from each other in the data graph. In the experiments, we fixed the keyword distances at 2 for all queries, which were also randomly generated. To avoid very large solutions, we limited the maximum number of nodes to be included in a solution to twice the number of query keywords.

To comprehensively study the impacts of different parameters on the efficiency and effectiveness of KS3+, we vary four parameters in the experiments, as presented in Table 1. Note that in experiment set #3, the number of edges increases with the number of nodes to maintain the graph density while changing the graph size. For each set of experiments, we average the results obtained from 100 runs.

Table 1. Experiment configuration

3.2 Evaluation Results

Efficiency. Figure 1 shows the computation times taken by KS3+ and KS3 to answer keyword queries for systems solutions under different parameter settings. Overall, KS3+ demonstrates a multiple orders of magnitude advantage in efficiency over KS3 under different parameter settings. While KS3 often takes seconds to minutes to answer queries under different parameter settings, KS3+ takes less than 1 ms in most cases. This demonstrates its significant advantage in efficiency over KS3.

Fig. 1.
figure 1

Computation time under different parameter settings (keyword distance = 2)

Figure 1(a) shows the efficiency of KS3+ in identifying the bridging nodes when the keywords in a query are not directly relevant. When the keyword distance increases from 1 to 10, the average computation time of KS3 increases from 16 ms to 2,899 ms. In the meantime, the average computation time of KS3+ increases from 0.08 ms to 0.40 ms. KS3+ outperforms KS3 significantly, and demonstrates much higher tolerance to the increase in keyword distance. The results shown in Fig. 1(a) demonstrate that KS3+ can efficiently find a system solution even if the keywords entered are only remotely relevant, thanks to its excellent ability to identify bridging nodes.

Figure 1(b) demonstrates the outstanding ability of KS3+ to find a system solution when multiple bridging nodes are needed to connect many keyword nodes. KS3+ demonstrates great performance with an increase from 0.42 ms to 319.69 ms in computation time in response to the increase in the number of query keywords (referred to as l hereafter) from 2 to 5. The corresponding increase in the computation time of KS3 is from 1,645 ms to 12,574 ms. Again, KS3+ outperforms KS3 significantly. In particular, when l reaches 6, it takes KS3+ 2,777.92 ms on average to find a system solution, while KS3 cannot even answer the query within a reasonable amount of time. That is why the corresponding data is missing for KS3 in Fig. 1(b). Figure 1(b) shows that KS3+ has a considerably better ability to find bridging nodes than KS3.

Figure 1(c) shows that the increase in the computation time of KS3 increases rapidly with the graph size, while the increase in the computation time KS3+ is almost negligible. On a very large data graph with 20,000 nodes, KS3 takes a significant amount of time (up to 75,000 ms) to answer a query. In the meantime, KS3+ takes only 1.35 ms on average to answer the same query. In a large data graph, the number of group Steiner trees that cover all the keyword nodes is extremely large even when the number of keywords to cover is small. KS3 needs to identify and inspect all those trees. The extremely large search space inevitably leads to long computation time of KS3. KS3+, on the other hand, does not have to inspect all those trees. It prunes invalid trees and grows or merges only the trees that are likely to be part of the final answer tree. Thus, KS3+ can handle queries over large data graphs much more efficiently than KS3.

Figure 1(d) shows that in a dense data graph, where each service has many neighbors, it takes KS3+ much less time than KS3 to find a system solution. The advantage of KS3+ over KS3 is by multiple orders of magnitudes, similar to the results shown in Fig. 1(a) and (c). As the number of edges increases from 2,000 to 8,000, the average computation time of KS3+ increases accordingly from 0.27 ms to 0.64 ms, versus the increase from 2,256 ms to 20,331 ms for KS3. A higher graph density means more neighbors for each node, leading to more group Steiner trees for KS3 to identify and inspect to answer a query. However, given a tree T(v, k) dequeued in line 8 of Algorithm 1, out of all the neighbors of v, Algorithm 1 would only grow T(v, k) to include those that result in trees containing the same keywords as T(v, k) but with fewer nodes. This prunes most invalid trees and ensures the high efficiency of KS3+.

Effectiveness. We compared the effectiveness of KS3+ and KS3, measured by success rate, i.e., the percentage of cases where an answer to the keyword query can be found. Overall, KS3+ is as effective as KS3, with a consistent success rate of 100% in all experiments under different parameter settings. This indicates that KS3+ can always find a system solution, like KS3. The experimental results demonstrate that KS3+ does not compromise the success rate in finding a solution.

4 Related Work

The process for building an SBS consists of three phases: system planning, service discovery and service selection.

System planning. The system engineer identifies the system tasks required for the target SBS, as well as their execution order. Most system planning techniques are based on artificial intelligence techniques [7]. The general idea is to model the task identification problem as a planning problem. For example, in [7], the authors model the task identification problem as a CSTE planning problem to be solved with an SCP solver.

Service discovery. Through service registries or service portals, the system engineer identifies a set of candidate services for each of the identified system tasks based on the functional and semantic information of candidate services. To improve the accuracy of service matching, several semantic web service languages have been proposed based on ontology techniques, e.g., OWLS-MX [8]. It automates the service matching operation that identifies the services that can perform the required system tasks. Many approaches have been proposed to automate the service discovery process, based on ontology techniques such as logical reasoning and temporal planning [9].

Service selection. The system engineer selects one service from the candidate services for each system task to compose the target SBS. The selected services must collectively fulfil the multi-dimensional quality constraints for the SBS [4], e.g., reliability, response time, cost, etc., which is an NP-complete problem. Integer Programming (IP) is the main technique adopted in this phase. AgFlow [2] is one of the most representative approaches. Following the idea of AgFlow, many researchers have been trying to reduce the computation time for quality-aware service selection [10] or to solve the problem in more complex environments [1, 11].

A planning technique was proposed that explores system solutions by looking up services whose tags match the tags describing the SBS [3]. For each query, the engineer needs to enter a source tag and a destination tag. The proposed technique heuristically identifies the possible service compositions with an entry service according to the source tag and an exit service according to the destination tag. A similar approach is proposed in [12]. A major limitation to these approaches is that each query allows only two tags, i.e., a source tag and a destination tag. Multiple tags can only be entered one by one in different queries that are processed individually until a final solution is found. An error made in an early query can easily make it impossible to find the final solution.

KS3 was proposed in [4]. It overcomes the limitations of the approaches proposed in [3, 12]. However, it suffers from extremely poor efficiency in large-scale scenarios. By modelling keyword queries as dynamic programming problems, KS3+ achieves significantly higher efficiency without sacrificing effectiveness.

5 Conclusions and Future Work

In this paper, we propose KS3+, a novel approach that integrates and automates the system planning, service discovery and service selection operations for building service-based systems (SBSs). It assists system engineers without detailed knowledge of SOA techniques in finding system solutions with only a few keywords that describe the required system tasks. KS3+ offers a new paradigm for building SBSs and can significantly save the time and effort during the process for building SBSs. Making no compromise in effectiveness, KS3+ significantly outperforms KS3 in efficiency.

In our future work, we will enhance KS3+ to answer queries with quality constraints and quality optimization goals.