Relational Query Engine
Introduction
This is a simple SQL query evaluator implemented in JAVA using JSQLParser for parsing the SQL queries and EvalLib for evaluating the expressions in the SQL queries. It is also optimised to work within memory constraints. The source data for the queries is in CSV files where each line corresponds to one row of data and each record is separated by a vertical pipe ‘|’ character. The CREATE TABLE statement is used only to define relation’s schema and the data corresponding to this table is stored in a file named <tablename>.csv on the disk.
Example of CREATE TABLE statement and Source Data:
CREATE TABLE A(X int, Y int, Z date);
This implies that data is stored in file named A.csv and its content looks like:
10|27|2011-03-27
18|24|2014-10-24
20|15|2019-07-28
Functionalities
The query engine supports the following functionalities:
- Insert
- Delete
- Select
- Nested-Select
- Project
- Join
- Bag Union
- Aggregate Functions (COUNT, MIN, MAX, AVG, SUM)
- GROUP BY
- GROUP BY Aggregates
- ORDER BY
- LIMIT
Implementation Techniques
- Join: It is implemented using Hybid Hash Join Algorithm.
- Order By: For in-memory implementation that is when whole data can reside in memory it is implemented by Java's Collections.sort() method but when data is large and the whole data cannot reside in memory, External Sort, viz., sort on disk is used.
Efficiency
In order to improve the efficiency of the query engine following techniques are implemented:
- Selection Pushdown
- Indexing*
Performance of Query Engine
- Basic Queries
$> CREATE TABLE BEIGE(LACE int, SOLVE int, PERMANENT string, NAG decimal);
$> SELECT BEIGE.LACE, BEIGE.PERMANENT, BEIGE.SOLVE, BEIGE.NAG FROM BEIGE;
... took 0.025933215 seconds
$> SELECT BEIGE.LACE, BEIGE.PERMANENT FROM BEIGE;
... took 0.009938593 seconds
$> SELECT BEIGE.NAG + BEIGE.NAG AS OUTPUT_1, BEIGE.SOLVE + BEIGE.NAG + BEIGE.SOLVE + BEIGE.LACE * BEIGE.NAG + BEIGE.SOLVE AS OUTPUT_2, BEIGE.NAG
+ BEIGE.SOLVE + BEIGE.NAG AS OUTPUT_3 FROM BEIGE WHERE BEIGE.NAG < 5.066984425478237 AND BEIGE.PERMANENT = 'run dogged besides purple';
... took 0.010051404 seconds
$> SELECT BEIGE.LACE + BEIGE.NAG + BEIGE.NAG + BEIGE.SOLVE AS OUTPUT_1, BEIGE.LACE + BEIGE.NAG + BEIGE.LACE + BEIGE.SOLVE AS OUTPUT_2,
2.1122694783129226 + BEIGE.SOLVE + BEIGE.NAG + BEIGE.NAG AS OUTPUT_3, BEIGE.LACE + BEIGE.NAG * BEIGE.LACE + BEIGE.SOLVE + BEIGE.LACE AS OUTPUT_4 FROM BEIGE;
... took 0.025803686 seconds
$> SELECT BEIGE.SOLVE, BEIGE.LACE FROM BEIGE WHERE BEIGE.LACE <> 43 AND BEIGE.SOLVE > 45 AND BEIGE.SOLVE > 50;
... took 0.006737012 seconds
$> CREATE TABLE BLANCHED(CHOCOLATE int, PLUM int, UNWIND string, INSTEADOF decimal);
$> SELECT BLANCHED.PLUM, BLANCHED.UNWIND, BLANCHED.CHOCOLATE FROM BLANCHED WHERE BLANCHED.INSTEADOF <= 4.61483491132217;
... took 0.041360969 seconds
$> SELECT BLANCHED.PLUM, BLANCHED.INSTEADOF, BLANCHED.CHOCOLATE, BLANCHED.UNWIND FROM BLANCHED;
... took 0.049047762 seconds
$> SELECT BLANCHED.INSTEADOF, BLANCHED.UNWIND, BLANCHED.CHOCOLATE, BLANCHED.PLUM FROM BLANCHED WHERE BLANCHED.UNWIND = 'steel burnished tan'
AND BLANCHED.CHOCOLATE <> 44 AND BLANCHED.PLUM <> 46;
... took 0.009381359 seconds
$> SELECT BLANCHED.PLUM, BLANCHED.INSTEADOF FROM BLANCHED;
... took 0.011077018 seconds
$> SELECT BLANCHED.CHOCOLATE + BLANCHED.CHOCOLATE + BLANCHED.INSTEADOF AS OUTPUT_1, BLANCHED.PLUM + BLANCHED.PLUM + BLANCHED.INSTEADOF AS OUTPUT_2 FROM BLANCHED
WHERE BLANCHED.PLUM <= 47 AND BLANCHED.UNWIND = 'peach regularly engage';
... took 0.006137591 seconds
$> SELECT BLANCHED.CHOCOLATE + BLANCHED.INSTEADOF + BLANCHED.PLUM + 7.498820738270613 + 4.479660262006677 + BLANCHED.PLUM AS OUTPUT_1,
BLANCHED.CHOCOLATE + BLANCHED.CHOCOLATE + BLANCHED.PLUM + BLANCHED.CHOCOLATE + BLANCHED.CHOCOLATE + 1.4274832875746424 AS OUTPUT_2
FROM BLANCHED WHERE BLANCHED.UNWIND = 'wake cajole among' AND BLANCHED.UNWIND = 'quickly blush blanched tithes' AND BLANCHED.UNWIND = 'wheat quietly';
... took 0.007720231 seconds
$> SELECT 6.6446841030518 + 9.450172465908263 + BLANCHED.INSTEADOF + BLANCHED.CHOCOLATE AS OUTPUT_1, BLANCHED.CHOCOLATE + BLANCHED.PLUM + BLANCHED.CHOCOLATE AS OUTPUT_2
FROM BLANCHED WHERE BLANCHED.INSTEADOF <> 5.1120352198730945 AND BLANCHED.PLUM > 47;
... took 0.01515984 seconds
$> SELECT BLANCHED.CHOCOLATE + BLANCHED.CHOCOLATE + BLANCHED.INSTEADOF * BLANCHED.INSTEADOF + BLANCHED.PLUM + 4.7508124625203765 AS OUTPUT_1,
BLANCHED.INSTEADOF + BLANCHED.PLUM + BLANCHED.PLUM + BLANCHED.PLUM AS OUTPUT_2, BLANCHED.PLUM + BLANCHED.CHOCOLATE + BLANCHED.PLUM AS OUTPUT_3,
BLANCHED.CHOCOLATE + BLANCHED.PLUM + 9.843901994450404 + BLANCHED.CHOCOLATE + BLANCHED.CHOCOLATE AS OUTPUT_4
FROM BLANCHED WHERE BLANCHED.PLUM >= 41 AND BLANCHED.INSTEADOF <> 5.3397033229358435 AND BLANCHED.UNWIND = 'doggedly silently';
... took 0.008991207 seconds
$> CREATE TABLE LOSE(BOLDLY int, BESIDE int, ATOP string, GREY decimal);
$> SELECT LOSE.GREY, LOSE.ATOP, LOSE.BESIDE, LOSE.BOLDLY FROM LOSE WHERE LOSE.ATOP = 'detect realms'
AND LOSE.BESIDE >= 41;
... took 0.144992789 seconds
$> SELECT LOSE.BOLDLY, LOSE.BESIDE FROM LOSE WHERE LOSE.BOLDLY < 47;
... took 0.032149556 seconds
$> SELECT LOSE.BESIDE, LOSE.GREY FROM LOSE;
... took 0.040376916 seconds
$> SELECT LOSE.BOLDLY, LOSE.GREY, LOSE.ATOP, LOSE.BESIDE FROM LOSE WHERE LOSE.ATOP = 'pink lime stealthily pending';
... took 0.008899229 seconds
$> SELECT LOSE.GREY, LOSE.BOLDLY, LOSE.ATOP, LOSE.BESIDE FROM LOSE WHERE LOSE.BOLDLY < 40 AND LOSE.BOLDLY >= 50
AND LOSE.ATOP = 'brown slyly upon';
... took 0.010964571 seconds
$> SELECT LOSE.BOLDLY + LOSE.BESIDE * LOSE.BOLDLY + LOSE.BESIDE + LOSE.BOLDLY AS OUTPUT_1, LOSE.GREY + 7.117299694632929 + LOSE.BESIDE + LOSE.BESIDE
+ 1.028946836169421 + LOSE.GREY + 8.050529699179807 + LOSE.GREY AS OUTPUT_2, LOSE.BOLDLY + LOSE.BESIDE * LOSE.BOLDLY AS OUTPUT_3 FROM LOSE;
... took 0.038250794 seconds
$> CREATE TABLE WHITE(THISTLE int, PALE int, CORNFLOWER string, NOD decimal);
$> SELECT WHITE.CORNFLOWER, WHITE.THISTLE, WHITE.PALE, WHITE.NOD FROM WHITE WHERE WHITE.THISTLE >= 47 AND WHITE.NOD <= 5.060042952283277;
... took 0.021470763 seconds
$> SELECT WHITE.THISTLE, WHITE.NOD, WHITE.CORNFLOWER FROM WHITE;
... took 0.028728029 seconds
$> SELECT WHITE.THISTLE, WHITE.NOD, WHITE.PALE FROM WHITE WHERE WHITE.THISTLE > 44 AND WHITE.CORNFLOWER = 'affix doze finally accounts' AND WHITE.NOD < 4.838894193976446;
... took 0.008941208 seconds
$> SELECT WHITE.PALE, WHITE.THISTLE, WHITE.CORNFLOWER FROM WHITE WHERE WHITE.CORNFLOWER = 'navajo sly ironic' AND WHITE.THISTLE > 50;
... took 0.006027645 seconds
$> SELECT WHITE.PALE + WHITE.PALE * WHITE.THISTLE AS OUTPUT_1, WHITE.NOD + WHITE.PALE * WHITE.PALE AS OUTPUT_2 FROM WHITE;
.. took 0.017839759 seconds
$> SELECT WHITE.NOD, WHITE.THISTLE, WHITE.CORNFLOWER FROM WHITE WHERE WHITE.NOD < 4.21140636455908 AND WHITE.NOD <= 4.629423335111538 AND WHITE.THISTLE > 45;
... took 0.009306568 seconds
$> SELECT 6.022950182012349 + WHITE.THISTLE + 3.1132767269755766 + WHITE.THISTLE + WHITE.THISTLE + WHITE.THISTLE AS OUTPUT_1, WHITE.THISTLE + WHITE.NOD + 4.804549905764231 + 2.023663069290813 + WHITE.THISTLE + WHITE.NOD + WHITE.THISTLE AS OUTPUT_2 FROM WHITE;
... took 0.031829143 seconds
$> CREATE TABLE PLAYERS( ID string, FIRSTNAME string, LASTNAME string, FIRSTSEASON int, LASTSEASON int, WEIGHT int, BIRTHDATE date );
NBA-01
$> SELECT FIRSTNAME, LASTNAME, WEIGHT, BIRTHDATE FROM PLAYERS WHERE WEIGHT>200;
... took 0.457499482 seconds
NBA-03
$> SELECT FIRSTNAME, LASTNAME, FIRSTSEASON, LASTSEASON FROM PLAYERS WHERE LASTSEASON-FIRSTSEASON>5;
... took 0.281330305 seconds
$> CREATE TABLE PLAYERS_SMALL( ID string, FIRSTNAME string, LASTNAME string, FIRSTSEASON int, LASTSEASON int, WEIGHT int, BIRTHDATE date );
NBA-06
$> SELECT P1.FIRSTNAME, P1.LASTNAME, P2.FIRSTNAME, P2.LASTNAME
FROM PLAYERS_SMALL P1, PLAYERS_SMALL P2
WHERE (P1.FIRSTSEASON=P2.FIRSTSEASON) AND (P1.FIRSTNAME != P2.FIRSTNAME) AND (P1.LASTNAME != P2.LASTNAME);
... took 8.829009442 seconds
NBA-07
$> SELECT P1.*, P2.*
FROM PLAYERS_SMALL P1, PLAYERS_SMALL P2
WHERE (P1.FIRSTSEASONP2.LASTSEASON);
... took 7.613429073 seconds
The performance measures shown below are for the 1000MB (SF 1) TPC-H Dataset.
TABLE LINEITEM - 753.9MB
$> CREATE TABLE LINEITEM(ORDERKEY INT,PARTKEY INT,SUPPKEY INT,LINENUMBER INT,QUANTITY DECIMAL,
EXTENDEDPRICE DECIMAL,DISCOUNT DECIMAL,TAX DECIMAL,RETURNFLAG CHAR(1),LINESTATUS CHAR(1),SHIPDATE DATE,
COMMITDATE DATE,RECEIPTDATE DATE,SHIPINSTRUCT CHAR(25),SHIPMODE CHAR(10),COMMENT VARCHAR(44),PRIMARY KEY (ORDERKEY,LINENUMBER));
TABLE ORDERS - 170.5MB
$> CREATE TABLE ORDERS(ORDERKEY INT,CUSTKEY INT,ORDERSTATUS CHAR(1),TOTALPRICE DECIMAL,ORDERDATE DATE,ORDERPRIORITY CHAR(15),
CLERK CHAR(15),SHIPPRIORITY INT,COMMENT VARCHAR(79),PRIMARY KEY (ORDERKEY));
TABLE CUSTOMER - 24.2MB
$> CREATE TABLE CUSTOMER(CUSTKEY INT,NAME VARCHAR(25),ADDRESS VARCHAR(40),NATIONKEY INT,PHONE CHAR(15),ACCTBAL DECIMAL,
MKTSEGMENT CHAR(10),COMMENT VARCHAR(117),PRIMARY KEY (CUSTKEY));
TABLE SUPPLIER - 1.4MB
$> CREATE TABLE SUPPLIER(SUPPKEY INT,NAME CHAR(25),ADDRESS VARCHAR(40),NATIONKEY INT,PHONE CHAR(15),ACCTBAL DECIMAL,COMMENT VARCHAR(101),PRIMARY KEY (SUPPKEY));
TABLE NATION - 2KB
$> CREATE TABLE NATION(NATIONKEY INT,NAME CHAR(25),REGIONKEY INT,COMMENT VARCHAR(152),PRIMARY KEY (NATIONKEY));
TABLE REGION - 384B
$> CREATE TABLE REGION(REGIONKEY INT,NAME CHAR(25),COMMENT VARCHAR(152),PRIMARY KEY (REGIONKEY));
TPC-H Query 1
$> select
lineitem.returnflag,
lineitem.linestatus,
sum(lineitem.quantity) as sum_qty,
sum(lineitem.extendedprice) as sum_base_price,
sum(lineitem.extendedprice*(1-lineitem.discount)) as sum_disc_price,
sum(lineitem.extendedprice*(1-lineitem.discount)*(1+lineitem.tax)) as sum_charge,
avg(lineitem.quantity) as avg_qty,
avg(lineitem.extendedprice) as avg_price,
avg(lineitem.discount) as avg_disc,
count(*) as count_order
from
lineitem
where
lineitem.shipdate <= DATE('1999-02-25')
group by
lineitem.returnflag,
lineitem.linestatus
order by
lineitem.returnflag,
lineitem.linestatus;
... took 32.48 seconds
Note: As 1998-12-01 is the highest possible ship date as defined in the database population 100% of the rows in the table will be scanned for the DATE in the above query.
TPC-H Query 3
$> select
lineitem.orderkey,
sum(lineitem.extendedprice * (1 - lineitem.discount)) as revenue,
orders.orderdate,
orders.shippriority
from
customer,
orders,
lineitem
where
customer.mktsegment = 'BUILDING'
and customer.custkey = orders.custkey
and lineitem.orderkey = orders.orderkey
and orders.orderdate < DATE('1995-03-18')
and lineitem.shipdate > DATE('1995-03-18')
group by
lineitem.orderkey,
orders.orderdate,
orders.shippriority
order by
revenue desc,
orders.orderdate,
lineitem.orderkey
limit 10;
... took 8.88 seconds
TPC-H Query 5
$> select
n_name,
sum(lineitem.extendedprice * (1 - lineitem.discount)) as revenue
from
customer,
orders,
lineitem,
supplier,
nation,
region
where
customer.custkey = orders.custkey
and lineitem.orderkey = orders.orderkey
and lineitem.suppkey = supplier.suppkey
and customer.nationkey = supplier.nationkey
and supplier.nationkey = nation.nationkey
and nation.regionkey = region.regionkey
and region.name = 'ASIA'
and orders.orderdate >= '1994-01-01'
and orders.orderdate < '1995-01-01'
group by
nation.name
order by
revenue desc;
... took 7.63 seconds
TPC-H Query 6
$> select
sum(lineitem.extendedprice*lineitem.discount) as revenue
from
lineitem
where
lineitem.shipdate >= date('1995-01-01')
and lineitem.shipdate < date ('1996-01-01')
and lineitem.discount > 0.04 and lineitem.discount < 0.06
and lineitem.quantity < 25;
... took 6.53 seconds
TPC-H Query 12
$> select
lineitem.shipmode,
sum(case
when orders.orderpriority = '1-URGENT'
or orders.orderpriority = '2-HIGH'
then 1
else 0
end) as high_line_count,
sum(case
when orders.orderpriority <> '1-URGENT'
and orders.orderpriority <> '2-HIGH'
then 1
else 0
end) as low_line_count
from
orders,
lineitem
where
orders.orderkey = lineitem.orderkey
and lineitem.shipmode in ('MAIL', 'SHIP')
and lineitem.commitdate < lineitem.receiptdate
and lineitem.shipdate < lineitem.commitdate
and lineitem.receiptdate >= '1994-01-01'
and lineitem.receiptdate < '1995-01-01'
group by
lineitem.shipmode
order by
lineitem.shipmode;
... took 6.73 seconds
* Query Engine supports Indexing for :
- TPC-H query 1 on LINEITEM.SHIPDATE
- TPC-H query 3 on LINEITEM.SHIPDATE, ORDERS.ORDERDATE and CUSTOMER.MKTSEGMENT
- TPC-H query 5 on ORDERS.ORDERDATE
- TPC-H query 6 on LINEITEM.SHIPDATE
- TPC-H query 12 on LINEITEM.RECEIPTDATE