Relational Query Engine

Introduction

This is a simple SQL query evaluator implemented in JAVA using JSQLParser for parsing the SQL queries and EvalLib for evaluating the expressions in the SQL queries. It is also optimised to work within memory constraints. The source data for the queries is in CSV files where each line corresponds to one row of data and each record is separated by a vertical pipe ‘|’ character. The CREATE TABLE statement is used only to define relation’s schema and the data corresponding to this table is stored in a file named <tablename>.csv on the disk.
Example of CREATE TABLE statement and Source Data:

CREATE TABLE A(X int, Y int, Z date);
This implies that data is stored in file named A.csv and its content looks like:
10|27|2011-03-27
18|24|2014-10-24                    
20|15|2019-07-28
                                    

Functionalities

The query engine supports the following functionalities:

  • Insert
  • Delete
  • Select
  • Nested-Select
  • Project
  • Join
  • Bag Union
  • Aggregate Functions (COUNT, MIN, MAX, AVG, SUM)
  • GROUP BY
  • GROUP BY Aggregates
  • ORDER BY
  • LIMIT

Implementation Techniques

  • Join: It is implemented using Hybid Hash Join Algorithm.
  • Order By: For in-memory implementation that is when whole data can reside in memory it is implemented by Java's Collections.sort() method but when data is large and the whole data cannot reside in memory, External Sort, viz., sort on disk is used.

Efficiency

In order to improve the efficiency of the query engine following techniques are implemented:

  • Selection Pushdown
  • Indexing*

Performance of Query Engine

  • Basic Queries
  • $> CREATE TABLE BEIGE(LACE int, SOLVE int, PERMANENT string, NAG decimal);
    
    $> SELECT BEIGE.LACE, BEIGE.PERMANENT, BEIGE.SOLVE, BEIGE.NAG FROM BEIGE;
    ... took 0.025933215 seconds
    
    $> SELECT BEIGE.LACE, BEIGE.PERMANENT FROM BEIGE;
    ... took 0.009938593 seconds
    
    $> SELECT BEIGE.NAG + BEIGE.NAG AS OUTPUT_1, BEIGE.SOLVE + BEIGE.NAG + BEIGE.SOLVE + BEIGE.LACE * BEIGE.NAG + BEIGE.SOLVE AS OUTPUT_2, BEIGE.NAG 
    + BEIGE.SOLVE + BEIGE.NAG AS OUTPUT_3 FROM BEIGE WHERE BEIGE.NAG < 5.066984425478237 AND BEIGE.PERMANENT = 'run dogged besides purple';
    ... took 0.010051404 seconds
    
    $> SELECT BEIGE.LACE + BEIGE.NAG + BEIGE.NAG + BEIGE.SOLVE AS OUTPUT_1, BEIGE.LACE + BEIGE.NAG + BEIGE.LACE + BEIGE.SOLVE AS OUTPUT_2, 
    2.1122694783129226 + BEIGE.SOLVE + BEIGE.NAG + BEIGE.NAG AS OUTPUT_3, BEIGE.LACE + BEIGE.NAG * BEIGE.LACE + BEIGE.SOLVE + BEIGE.LACE AS OUTPUT_4 FROM BEIGE;
    ... took 0.025803686 seconds
    
    $> SELECT BEIGE.SOLVE, BEIGE.LACE FROM BEIGE WHERE BEIGE.LACE <> 43 AND BEIGE.SOLVE > 45 AND BEIGE.SOLVE > 50;
    ... took 0.006737012 seconds
                                                 
    $> CREATE TABLE BLANCHED(CHOCOLATE int, PLUM int, UNWIND string, INSTEADOF decimal);
    
    $> SELECT BLANCHED.PLUM, BLANCHED.UNWIND, BLANCHED.CHOCOLATE FROM BLANCHED WHERE BLANCHED.INSTEADOF <= 4.61483491132217;
    ... took 0.041360969 seconds
    
    $> SELECT BLANCHED.PLUM, BLANCHED.INSTEADOF, BLANCHED.CHOCOLATE, BLANCHED.UNWIND FROM BLANCHED;
    ... took 0.049047762 seconds
    
    $> SELECT BLANCHED.INSTEADOF, BLANCHED.UNWIND, BLANCHED.CHOCOLATE, BLANCHED.PLUM FROM BLANCHED WHERE BLANCHED.UNWIND = 'steel burnished tan' 
    AND BLANCHED.CHOCOLATE <> 44 AND BLANCHED.PLUM <> 46;
    ... took 0.009381359 seconds
    
    $> SELECT BLANCHED.PLUM, BLANCHED.INSTEADOF FROM BLANCHED;
    ... took 0.011077018 seconds
    
    $> SELECT BLANCHED.CHOCOLATE + BLANCHED.CHOCOLATE + BLANCHED.INSTEADOF AS OUTPUT_1, BLANCHED.PLUM + BLANCHED.PLUM + BLANCHED.INSTEADOF AS OUTPUT_2 FROM BLANCHED 
    WHERE BLANCHED.PLUM <= 47 AND BLANCHED.UNWIND = 'peach regularly engage';
    ... took 0.006137591 seconds
    
    $> SELECT BLANCHED.CHOCOLATE + BLANCHED.INSTEADOF + BLANCHED.PLUM + 7.498820738270613 + 4.479660262006677 + BLANCHED.PLUM AS OUTPUT_1, 
    BLANCHED.CHOCOLATE + BLANCHED.CHOCOLATE + BLANCHED.PLUM + BLANCHED.CHOCOLATE + BLANCHED.CHOCOLATE + 1.4274832875746424 AS OUTPUT_2 
    FROM BLANCHED WHERE BLANCHED.UNWIND = 'wake cajole among' AND BLANCHED.UNWIND = 'quickly blush blanched tithes' AND BLANCHED.UNWIND = 'wheat quietly';
    ... took 0.007720231 seconds
    
    $> SELECT 6.6446841030518 + 9.450172465908263 + BLANCHED.INSTEADOF + BLANCHED.CHOCOLATE AS OUTPUT_1, BLANCHED.CHOCOLATE + BLANCHED.PLUM + BLANCHED.CHOCOLATE AS OUTPUT_2 
    FROM BLANCHED WHERE BLANCHED.INSTEADOF <> 5.1120352198730945 AND BLANCHED.PLUM > 47;
    ... took 0.01515984 seconds
    
    $> SELECT BLANCHED.CHOCOLATE + BLANCHED.CHOCOLATE + BLANCHED.INSTEADOF * BLANCHED.INSTEADOF + BLANCHED.PLUM + 4.7508124625203765 AS OUTPUT_1, 
    BLANCHED.INSTEADOF + BLANCHED.PLUM + BLANCHED.PLUM + BLANCHED.PLUM AS OUTPUT_2, BLANCHED.PLUM + BLANCHED.CHOCOLATE + BLANCHED.PLUM AS OUTPUT_3, 
    BLANCHED.CHOCOLATE + BLANCHED.PLUM + 9.843901994450404 + BLANCHED.CHOCOLATE + BLANCHED.CHOCOLATE AS OUTPUT_4 
    FROM BLANCHED WHERE BLANCHED.PLUM >= 41 AND BLANCHED.INSTEADOF <> 5.3397033229358435 AND BLANCHED.UNWIND = 'doggedly silently';
    ... took 0.008991207 seconds
                                                    
    $> CREATE TABLE LOSE(BOLDLY int, BESIDE int, ATOP string, GREY decimal);
    
    $> SELECT LOSE.GREY, LOSE.ATOP, LOSE.BESIDE, LOSE.BOLDLY FROM LOSE WHERE LOSE.ATOP = 'detect realms' 
    AND LOSE.BESIDE >= 41;
    ... took 0.144992789 seconds
    
    $> SELECT LOSE.BOLDLY, LOSE.BESIDE FROM LOSE WHERE LOSE.BOLDLY < 47;
    ... took 0.032149556 seconds
    
    $> SELECT LOSE.BESIDE, LOSE.GREY FROM LOSE;
    ... took 0.040376916 seconds
    
    $> SELECT LOSE.BOLDLY, LOSE.GREY, LOSE.ATOP, LOSE.BESIDE FROM LOSE WHERE LOSE.ATOP = 'pink lime stealthily pending';
    ... took 0.008899229 seconds
    
    $> SELECT LOSE.GREY, LOSE.BOLDLY, LOSE.ATOP, LOSE.BESIDE FROM LOSE WHERE LOSE.BOLDLY < 40 AND LOSE.BOLDLY >= 50 
    AND LOSE.ATOP = 'brown slyly upon';
    ... took 0.010964571 seconds
    
    $> SELECT LOSE.BOLDLY + LOSE.BESIDE * LOSE.BOLDLY + LOSE.BESIDE + LOSE.BOLDLY AS OUTPUT_1, LOSE.GREY + 7.117299694632929 + LOSE.BESIDE + LOSE.BESIDE 
    + 1.028946836169421 + LOSE.GREY + 8.050529699179807 + LOSE.GREY AS OUTPUT_2, LOSE.BOLDLY + LOSE.BESIDE * LOSE.BOLDLY AS OUTPUT_3 FROM LOSE;
    ... took 0.038250794 seconds
    
    
    $> CREATE TABLE WHITE(THISTLE int, PALE int, CORNFLOWER string, NOD decimal);
    
    $> SELECT WHITE.CORNFLOWER, WHITE.THISTLE, WHITE.PALE, WHITE.NOD FROM WHITE WHERE WHITE.THISTLE >= 47 AND WHITE.NOD <= 5.060042952283277;
    ... took 0.021470763 seconds
    
    $> SELECT WHITE.THISTLE, WHITE.NOD, WHITE.CORNFLOWER FROM WHITE;
    ... took 0.028728029 seconds
    
    $> SELECT WHITE.THISTLE, WHITE.NOD, WHITE.PALE FROM WHITE WHERE WHITE.THISTLE > 44 AND WHITE.CORNFLOWER = 'affix doze finally accounts' AND WHITE.NOD < 4.838894193976446;
    ... took 0.008941208 seconds
    
    $> SELECT WHITE.PALE, WHITE.THISTLE, WHITE.CORNFLOWER FROM WHITE WHERE WHITE.CORNFLOWER = 'navajo sly ironic' AND WHITE.THISTLE > 50;
    ... took 0.006027645 seconds
    
    $> SELECT WHITE.PALE + WHITE.PALE * WHITE.THISTLE AS OUTPUT_1, WHITE.NOD + WHITE.PALE * WHITE.PALE AS OUTPUT_2 FROM WHITE;
    .. took 0.017839759 seconds
    
    $> SELECT WHITE.NOD, WHITE.THISTLE, WHITE.CORNFLOWER FROM WHITE WHERE WHITE.NOD < 4.21140636455908 AND WHITE.NOD <= 4.629423335111538 AND WHITE.THISTLE > 45;
    ... took 0.009306568 seconds
    
    $> SELECT 6.022950182012349 + WHITE.THISTLE + 3.1132767269755766 + WHITE.THISTLE + WHITE.THISTLE + WHITE.THISTLE AS OUTPUT_1, WHITE.THISTLE + WHITE.NOD + 4.804549905764231 + 2.023663069290813 + WHITE.THISTLE + WHITE.NOD + WHITE.THISTLE AS OUTPUT_2 FROM WHITE;
    ... took 0.031829143 seconds
                                                    
  • NBA Benchmark Queries
  • $> CREATE TABLE PLAYERS( ID string, FIRSTNAME string, LASTNAME string, FIRSTSEASON int, LASTSEASON int, WEIGHT int, BIRTHDATE date );
    
    NBA-01
    $> SELECT FIRSTNAME, LASTNAME, WEIGHT, BIRTHDATE FROM PLAYERS WHERE WEIGHT>200;
    ... took 0.457499482 seconds
    
    NBA-03
    $> SELECT FIRSTNAME, LASTNAME, FIRSTSEASON, LASTSEASON FROM PLAYERS WHERE LASTSEASON-FIRSTSEASON>5;
    ... took 0.281330305 seconds
                                                    
    $> CREATE TABLE PLAYERS_SMALL( ID string, FIRSTNAME string, LASTNAME string, FIRSTSEASON int, LASTSEASON int, WEIGHT int, BIRTHDATE date );
    
    NBA-06
    $> SELECT P1.FIRSTNAME, P1.LASTNAME, P2.FIRSTNAME, P2.LASTNAME
    FROM PLAYERS_SMALL P1, PLAYERS_SMALL P2 
    WHERE (P1.FIRSTSEASON=P2.FIRSTSEASON) AND (P1.FIRSTNAME != P2.FIRSTNAME) AND (P1.LASTNAME != P2.LASTNAME);
    ... took 8.829009442 seconds
    
    NBA-07
    $> SELECT P1.*, P2.* 
    FROM PLAYERS_SMALL P1, PLAYERS_SMALL P2 
    WHERE (P1.FIRSTSEASONP2.LASTSEASON);
    ... took 7.613429073 seconds
                                                    
  • TPC-H Benchmark Queries (Source: tpc.org)
  • The performance measures shown below are for the 1000MB (SF 1) TPC-H Dataset.

    TABLE LINEITEM - 753.9MB
    $> CREATE TABLE LINEITEM(ORDERKEY INT,PARTKEY INT,SUPPKEY INT,LINENUMBER INT,QUANTITY DECIMAL,
    EXTENDEDPRICE DECIMAL,DISCOUNT DECIMAL,TAX DECIMAL,RETURNFLAG CHAR(1),LINESTATUS CHAR(1),SHIPDATE DATE,
    COMMITDATE DATE,RECEIPTDATE DATE,SHIPINSTRUCT CHAR(25),SHIPMODE CHAR(10),COMMENT VARCHAR(44),PRIMARY KEY (ORDERKEY,LINENUMBER));
    
    TABLE ORDERS - 170.5MB
    $> CREATE TABLE ORDERS(ORDERKEY INT,CUSTKEY INT,ORDERSTATUS CHAR(1),TOTALPRICE DECIMAL,ORDERDATE DATE,ORDERPRIORITY CHAR(15),
    CLERK CHAR(15),SHIPPRIORITY INT,COMMENT VARCHAR(79),PRIMARY KEY (ORDERKEY));
    
    TABLE CUSTOMER - 24.2MB
    $> CREATE TABLE CUSTOMER(CUSTKEY INT,NAME VARCHAR(25),ADDRESS VARCHAR(40),NATIONKEY INT,PHONE CHAR(15),ACCTBAL DECIMAL,
    MKTSEGMENT CHAR(10),COMMENT VARCHAR(117),PRIMARY KEY (CUSTKEY));
    
    TABLE SUPPLIER - 1.4MB
    $> CREATE TABLE SUPPLIER(SUPPKEY INT,NAME CHAR(25),ADDRESS VARCHAR(40),NATIONKEY INT,PHONE CHAR(15),ACCTBAL DECIMAL,COMMENT VARCHAR(101),PRIMARY KEY (SUPPKEY));
    
    TABLE NATION - 2KB
    $> CREATE TABLE NATION(NATIONKEY INT,NAME CHAR(25),REGIONKEY INT,COMMENT VARCHAR(152),PRIMARY KEY (NATIONKEY));
    
    TABLE REGION - 384B
    $> CREATE TABLE REGION(REGIONKEY INT,NAME CHAR(25),COMMENT VARCHAR(152),PRIMARY KEY (REGIONKEY));
    
    
    TPC-H Query 1
    $> select
    lineitem.returnflag,
    lineitem.linestatus,
    sum(lineitem.quantity) as sum_qty,
    sum(lineitem.extendedprice) as sum_base_price,
    sum(lineitem.extendedprice*(1-lineitem.discount)) as sum_disc_price,
    sum(lineitem.extendedprice*(1-lineitem.discount)*(1+lineitem.tax)) as sum_charge,
    avg(lineitem.quantity) as avg_qty,
    avg(lineitem.extendedprice) as avg_price,
    avg(lineitem.discount) as avg_disc,
    count(*) as count_order
    from
    lineitem
    where
    lineitem.shipdate <= DATE('1999-02-25')
    group by
    lineitem.returnflag,
    lineitem.linestatus
    order by
    lineitem.returnflag,
    lineitem.linestatus;
    ... took 32.48 seconds
    

    Note: As 1998-12-01 is the highest possible ship date as defined in the database population 100% of the rows in the table will be scanned for the DATE in the above query.

    TPC-H Query 3
    $> select
    lineitem.orderkey,
    sum(lineitem.extendedprice * (1 - lineitem.discount)) as revenue,
    orders.orderdate,
    orders.shippriority
    from
    customer,
    orders,
    lineitem
    where
    customer.mktsegment = 'BUILDING'
    and customer.custkey = orders.custkey
    and lineitem.orderkey = orders.orderkey
    and orders.orderdate < DATE('1995-03-18')
    and lineitem.shipdate > DATE('1995-03-18')
    group by
    lineitem.orderkey,
    orders.orderdate,
    orders.shippriority
    order by
    revenue desc,
    orders.orderdate,
    lineitem.orderkey
    limit 10;
    ... took 8.88 seconds
    
    TPC-H Query 5
    $> select
    n_name,
    sum(lineitem.extendedprice * (1 - lineitem.discount)) as revenue
    from
    customer,
    orders,
    lineitem,
    supplier,
    nation,
    region
    where
    customer.custkey = orders.custkey
    and lineitem.orderkey = orders.orderkey
    and lineitem.suppkey = supplier.suppkey
    and customer.nationkey = supplier.nationkey
    and supplier.nationkey = nation.nationkey
    and nation.regionkey = region.regionkey
    and region.name = 'ASIA'
    and orders.orderdate >= '1994-01-01'
    and orders.orderdate < '1995-01-01'
    group by
    nation.name
    order by
    revenue desc;
    ... took 7.63 seconds
    
    TPC-H Query 6
    $> select
    sum(lineitem.extendedprice*lineitem.discount) as revenue
    from
    lineitem
    where
    lineitem.shipdate >= date('1995-01-01')
    and lineitem.shipdate < date ('1996-01-01')
    and lineitem.discount > 0.04 and lineitem.discount < 0.06 
    and lineitem.quantity < 25;
    ... took 6.53 seconds
    
    TPC-H Query 12
    $> select
    lineitem.shipmode,
    sum(case
      when orders.orderpriority = '1-URGENT'
        or orders.orderpriority = '2-HIGH'
      then 1
      else 0
    end) as high_line_count,
    sum(case
      when orders.orderpriority <> '1-URGENT'
        and orders.orderpriority <> '2-HIGH'
      then 1
      else 0
    end) as low_line_count
    from
    orders,
    lineitem
    where
    orders.orderkey = lineitem.orderkey
    and lineitem.shipmode in ('MAIL', 'SHIP')
    and lineitem.commitdate < lineitem.receiptdate
    and lineitem.shipdate < lineitem.commitdate
    and lineitem.receiptdate >= '1994-01-01'
    and lineitem.receiptdate < '1995-01-01'
    group by
    lineitem.shipmode
    order by
    lineitem.shipmode;
    ... took 6.73 seconds
    
    

* Query Engine supports Indexing for :
  • TPC-H query 1 on LINEITEM.SHIPDATE
  • TPC-H query 3 on LINEITEM.SHIPDATE, ORDERS.ORDERDATE and CUSTOMER.MKTSEGMENT
  • TPC-H query 5 on ORDERS.ORDERDATE
  • TPC-H query 6 on LINEITEM.SHIPDATE
  • TPC-H query 12 on LINEITEM.RECEIPTDATE