Service Frameworks on the Cloud

Service Frameworks on the Cloud

From SSRG Annoki

Jump to: navigation, search

The objective of this project is to provide a software framework for supporting distributed query processing on HBase.


Currently, to answer a query that needs to examine the complete BigTable, a developer must use a scanner (i.e., by reusing and extending the scanner HBase implementation) that will scan the table and will apply the "where" close of interest on each BigTable row (perform a filtering-like operation). It is useful for iterating the table but one can't scan a dynamically generated result set.

HBase BigTables are distributed over multiple nodes. Each node holds a fragment of the BigTable (called as Region and the node is called as a RegionServer), namely a sequence of rows in ascending order to the BigTable primary key. A meta-table record the start- and end-primary keys of the rows of each BigTable fragment.

At run time, the HBase infrastructure examines (a) starting/ending row keys that the scanner should inspect (if the query is to be applied to a subset of the BigTable) and (b) any filter(s) associated with it (filter can be treated as WHERE clause). The meta-table is consulted and the scanner is sent off to the nodes that hold BigTable fragments with keys within the scanner key range. The scanner is executed at each indvidual node and the qualifying rows (i.e., the rows which pass the "where" clause) are collected and sent back to the requesting client node where further aggregate-style processing may take place. For example, if one needs to calculate the sum of all values in a specific column C1, where the values of another column C2 are within a range, one would have to develop a scanner to collect all the rows whose C2 values are within a range and then compute the sum at the client. At run time, the process will involve a RPC for each such row plus the computation at client side.

The Proposed Service Framework

In this project, we propose an alternative solution, whereby the HBase platform will be extended with a framework (using coprocessors) that will support the performance of such aggregation computations at each individual node (where the region data is residing) and the sending of a reduced result set to the requesting client; in the case of the example above, the sum of all qualified-column values at each node. Clearly, if more than one region needs to be scanned (depending upon the scope of the scanner) the requesting client will still have to perform an aggregation function but the computation, as well as the result set that the client will receive, will be much smaller. Moreover, the regions involved will be doing the computation in parallel, which should further reduce the query-execution cost. The proposed software framework will work as follows. At compile time, the developer will implement a scanner (i.e., reuse the scanner HBase implementation) to scan the whole table and examine the "where" close of interest on each BigTable row.

At run time, the framework

  1. will ship the special-purpose scanner to all the nodes that contain regions of the BigTable
  2. a map stage will be performed where each node executes the scanner
  3. a combine stage will be performed where each node decides what results to send to the requesting client node by examining the columns of interest to the "select" and "condition" and "order by" clauses of the query
  4. a final aggregation stage will be performed at the requesting client node where the query was issued to produce the desired results

The framework will support a restricted set of query types, for which generalized "combiner" and "aggregation" processes will be developed, such that the framework can specialize them and reuse them at run time, once the developer has provided declarative instructions on the nature of their specialization.