Zetaris Documentation¶
Contents:¶
Lightning¶
Concepts¶
What is a Data Fabric
A data fabric is the zetaris way of stitching together diverse, disparate datasources and datasets across different plaforms (cloud, onpremise, edge etc) . The main components in a Fabric are :
Physcial Datasources
Are Non-Relational datasources based on physical files like csv,json or based on external RESTful API’s.
Logical Datasources
Logical datasources are traditional relational datasources.One can connect any remote database as a ogical datasource.
Data Fabric Builder
Is where we register all datasources and build data fabric
What is a Schema store
It is the repository of all relational schemas within datasources. It also enabe us to build relational logical views and datamarts on top of exsisting objects.
Schema Browser
Help you to navigate the schema store. Different objects within a schema store are :
Databases
Are databases build in lightning under the physcial datasource in data fabrics.
Datasources
Are the registered logical datasources in the data fabric
Data Marts
Data marts are the virtualised equivalent physical datamarts, where no physical copy is made for setting them up.
Views
Just like nomral views, with the adon that you can create create views across different datasource and cache them.
Lightning Quick-Start Guide¶
Welcome to the Zetaris Lightning Quick-Start Guide! The traditional approaches to data integration where the use of ETL ( Extract-Translate-Load) or ELT ( Extract-Load-Translate) are the principal data movement methods is no longer a sustainable approach. Data comes in all shapes, all velocities and different levels of structure and the platform used to manage this level of activity needs to be able to cope with modern business demands placed on it. Traditional methods are simply not fast enough to keep pace with the business and the market that drives business. This document will provide you a fundamental functional understanding of the Zetaris Lightning so you can explore the capabilities of the platform and begin to understand the potential business benefits that can be quickly implemented, tested and productionised.
Accessing the Lightning User Interface All users access their own Lightning through http://<host name>:9001/lightning-gui/login where you can input your specific username and password
Lightning User Interface Tabs Once successfully logged into the user interface, you will see your options at the top of the page. Here is a break down of the functions stored in the top Menu.
Listed from left to right they are:
- Data Fabric Builder
- Schema Store View
- Data Catalog
- Query Builder
- Virtual Data Mart
- Topological View
- Data Quality Management
- Advanced Analytics
- Access Control
- User Management
Data Fabric Builder¶
Data fabric builder lets you create data fabrics which are a combination of data sources consisting of different database or files. In the below example:
- The left panel displays all the available databases and data sources that have been defined.
- Database can be added by simply clicking on small green button as shown in image
Schema Store View¶
Schema Store is where query can be constructed and run using different virtual data source tables to compose unique databases, data marts and views which are defined within the cloud data fabric.
- The left panel of the User interface is known as the Schema Browser and displays all the databases, data sources, data marts and views that are available for use in any query. Any new databases or data marts created will be added to these panels.
- The main panel to the right, the SQL Editor is where all SQL queries are written with the results displayed immediately below in the Results panel.
- The Result panel has two tabs. Result, where output of a query will be shown and History, where the query history run on a particular query sheet will be shown
- The top right corner of the interface has the the following buttons, from left to right
- Plus Button - adds an additional User Interface tab
- File open button - opens any previously saves SQL queries
- Save Button - allows Save or Save-As of any new or edited SQL queries
- Max Result - sets the maximum number of results to be returned by a query
- Play Button - executes any SQL code that has been highlighted in the SQL Editor
- Stop Button - stops SQL query execution
- Format Button - arranges SQL Code in the Editor window into a human-readable format
The user panel, also located in the top right corner of the user interface allows you to change your password as well as logout
- The bottom of the results window will show a fail result message when a query has been unsuccessful.
Clicking the Red triangle in the bottom right corner the Result panel will display any errors relevant to the current failure result.
Data Catalog¶
Data catalog tab allows you to browse the existing defined data sources and look at the tables associated with these.
Adding a Physical Data Source Database tables can be created from Azure BLOB storage, Amazon s3 Buckets or an API end point There are 2 mechanisms available:
- Data Fabric Builder, or
- SQL Editor on the Schema Store View tab
Data Fabric Builder¶
A database needs to be created before any tables can be assigned to that database. A name and description are required along with any additional parameters.
The created database appears under the Physical Data Sources and Data Bases section:
Once the database to which the tables are to be assigned has been created, the “+” addition sign can be pressed to create the table. The table can either be a File Store table (Azure BLOB or AWS s3 bucket) or an API end point File Store
The necessary credentials need to be supplied including:
- Name which the table will be called, which does not need to be the same to that of the file within Azure or AWS
- Path of the file
- Access Key Name
- Access Key Value
The default parameters can be altered if a header is not available
Before the table is created a preview of the records is provided.
The tables are available as soon as they are created.
SQL Editor on the Schema Store View tab¶
The same process is required when creating the tables and database using the command line interface.
CREATE LIGHTNING DATABASE TEST_DATABASE DESCRIBE BY " TEST_DATABASE";
CREATE LIGHTNING FILESTORE TABLE customer FROM TEST_DATABASE FORMAT CSV OPTIONS (
PATH "wasb://zettest-storage-container@zettesstorage.blob.core.windows.net/customer.csv",
inferSchema "true",
fs.azure.account.key.zettesstorage.blob.core.windows.net "XXXXXXXXXXXXXXXXXXXXXXCVCHJHVJLBJCGJHCVJLKNKJVKJHBLKMLKBHBNKLJMKL");
- Data sources can be added by clicking on the small green button next to data source as shown in image below
- A pop-up will appear to guide you through the data source setup process
- In step 1, Pull type is the default data source type, click next
- Complete the Data Source Information page including Database Type, Data Source Name,
Description (Optional),JDBC Driver Class,JDBC URL, Username/Password and Schema type followed by clicking next
- Now select all the tables in the data source which you would like included as part of the virtual data source and click next
- A summary of all the database and driver parameters along with the selected tables will be shown.
To accept these changes and create the virtual data source, click the register button
Query Builder¶
This tab is used for inter-active query builds through a drag and drop method, which auto-generates the SQL statements.
- The left-most panel is the Schema Browser which displays all available databases, data sources, data mart and views which can be used by query builder.
- The main panel is the Query Builder canvas where tables and views can dragged and joined.
- The right-most panel displays the selected columns along with any predicates and group by / order by clauses
- Selected Columns will contain any columns that are dragged from the Schema Browser that you wish to use as part of your query.
- Predicates defines the column on which you want to apply where clauses or filters ie; “where employee_num <1000”
- Group By let you group results on the the basis of particular column allowing you to filter out data. it is very similar to where clause but also allowing aggregation count comparison
- Order By allows you to order your data in descending or ascending order for a particular column you selected.
- The SQL and Result panel below the Schema Browser canvas is where the SQL is shown along with results of your query
- SQL Tab display a dynamic view of the auto-generated SQL built form your selections
- Result Tab displays the query result once the execute button has been clicked
- Queries can be saved at any time by clicking the Save Icon described in section 2
Virtual Datamart¶
The Data Mart is a subset of the Virtual Data Warehouse and is usually oriented to a specific business line or team. Whereas the data warehouse may contain a wide set of data, the Virtual Data Mart can be quickly created to a single department or user requirement. Virtual data mart lets you create different data mart within a data ware house by combing several tables, views etc.
- on the left hand side panel you can see all the available data bases, data sources and all the data mart already created.
- in the middle is where you can drag different table and view to create a datamart. You will be allowed to drag table and view only after you click “create” button on the right and it will ask you for datamart name and description. To complete, click create button.
Topological View¶
The topological view provides the data that the Cloud Data Fabric has access to. The data can be queried by the customer administrator and allocated to warehouse for end users to query. The below topological view indicates that the Cloud Data Fabric has access to Zetaris Fusiondb, Postgres, oracle, Teradata, SAFC, Cassandra.
Data Quality Management¶
The Data Quality Management provides the ability automatically execute data quality management and exception management through the implementation of business logic.
Data can be ingested and target business schemas loaded with exceptions flagged. The type of exceptions and the offending records identified.
Two tabs are available:
Zetaris Data Hub
Files used to populate the target business schema can be uploaded from this location
Data Quality Management Process
The business logic defined in a TFS (Technical Functional Specification), the exceptions types defined, exceptions detected and the processes excecuting the TFS can be viewed.
The detected exceptions can be drilled down to the offending source record for the business logic violation.
Advanced Analytics¶
Advanced Analytics is a feature to be made available shortly.
Access Control and User Management¶
The User Management allows the management of different users and roles.
Users and Roles can be added using the left hand side panel
The top right hand side panel allows the addition or removal of either: • a user to/from a role • a role to/from a user
Lightning SQL Manual¶
This guide provides a reference for SQL Language Manual and a set of example use cases. For information on Spark SQL, see the Apache Spark Spark SQL, DataFrames, and Datasets Guide. For information on Hive Language Manual, see the Hive Manual
1. SELECT¶
SELECT [hints, ...] [* |DISTINCT] named_expression[, named_expression, ...]
FROM relation[, relation, ...]
[lateral_view[, lateral_view, ...]]
[WHERE boolean_expression]
[aggregation [HAVING boolean_expression]]
[WINDOW named_window[, WINDOW named_window, ...]]
[LIMIT num_rows]
named_expression:
: expression [AS alias]
relation:
| join_relation
| (table_name|query|relation) [sample] [AS alias]
: VALUES (expressions)[, (expressions), ...]
[AS (column_name[, column_name, ...])]
expressions:
: expression[, expression, ...]
Output data from one or more relations. A relation refers to any source of input data. It could be the contents of an existing table (or view), the joined result of two existing tables, or a subquery (the result of another SELECT statement).
1.1 * (ALL)¶
Select all matching rows from the relation. Enabled by default.
1.2 DISTINCT¶
Select all matching rows from the relation then remove duplicate
1.3 WHERE¶
Filter rows by predicate.
Like Operator: used to match text values against a pattern using wildcards. If the search expression can be matched to the pattern expression, the LIKE operator will return true, which is 1.
There are two wildcards used in conjunction with the LIKE operator −
- The percent sign (%)
- The underscore (_)
SELECT FROM table_name
WHERE column LIKE 'XXXX%'
SELECT FROM table_name
WHERE column LIKE '%XXXX%'
SELECT FROM table_name
WHERE column LIKE 'XXXX_'
1.4 GROUP BY (Aggregate Functions)¶
It is used in collaboration with the SELECT statement to group together those rows in a table that have identical data. This is done to eliminate redundancy in the output and/or compute aggregates that apply to these groups. The GROUP BY clause follows the WHERE clause in a SELECT statement and precedes the ORDER BY clause.
SELECT column-list
FROM table_name
WHERE [ conditions ]
GROUP BY column1, column2....columnN
ORDER BY column1, column2....columnN
COUNT() Returns the number of rows containing non-NULL values in the specified field. SUM() Returns the sum of the non-NULL values in the specified field. AVG() Returns the average of the non-NULL values in the specified field. MAX() Returns the maximum of the non-NULL values in the specified field. MIN() Returns the minimum of the non-NULL values in the specified field.
1.5 HAVING¶
Filter grouped result by predicate.
1.6 ORDER BY¶
Impose total ordering on a set of expressions. Default sort direction is ascending.
1.7 WINDOW¶
Assign an identifier to a window specification. Refer to section 6, for more information on Window Function.
1.8 LIMIT¶
Limit the number of rows returned.
1.9 VALUES¶
Explicitly specify values instead of reading them from a relation.
1.10 COMMON TABLE EXPRESSIONS (CTE)¶
A common table expression is a temporary result set which you can reference within another SQL statement including SELECT, INSERT, UPDATE OR DELETE Common Table Expressions are temporary in the sense that they only exist during the execution of the query. The following shows the syntax of creating a CTE:
WITH cte_name (column_list) AS (
CTE_query_definition
)
[SELECT STATEMENT]:[INSERT SATEMENT]:[UPDATE SATEMENT]:[DELETE SATEMENT];
For Example:
with cte as
(SELECT
*
from
ipl.individual)
SELECT
count(*)
from
cte
In this syntax
- First, specify the name of the CTE following by an optional column list.
- Second, inside the body of WITH clause, specify a query that returns a result set. If you do not explicitly specify the column list after the CTE name, the select list of the cte query definition will become the column list of the CTE.
- Third, use the CTE like a table or view in the statement which can be a SELECT, INSERT, UPDATE OR DELETE
Common Table Expressions or CTEs are typically used to simplify complex joins and subqueries.
1.11 EXAMPLES:¶
SELECT * FROM ipl.iplmatch;
SELECT matchnumber,hometeam from ipl.iplmatch;
SELECT DISTINCT matchnumber,hometeam from ipl.iplmatch;
SELECT count(*) from ipl.iplmatch;
SELECT * FROM ipl.iplmatch where Location='Bangalore' and IndexNo=419112;
SELECT * FROM ipl.iplmatch LIMIT 10;
SELECT batsman_name as batsman ,max(batsman_runs) as max_runs FROM bbl.bbl01cs
group by 1
order by max_runs desc ;
SELECT batsman_name as batsman ,max(batsman_runs) as max_runs FROM bbl.bbl01cs
where batsman_name like '%Ma%'
group by 1
having max_runs >=4
order by max_runs desc;
2. SAMPLING¶
sample:
| TABLESAMPLE ((integer_expression | decimal_expression) PERCENT): TABLESAMPLE (integer_expression ROWS)
Sample the input data. This can be expressed in terms of either a percentage (must be between 0 and 1 00) or a fixed number of input rows
2.1 EXAMPLES:¶
SELECT
*
FROM
ipl.individual
TABLESAMPLE (3 ROWS);
SELECT
*
FROM
ipl.individual
TABLESAMPLE (25 PERCENT);
3. JOIN¶
join_relation:
| relation join_type JOIN relation (ON boolean_expression | USING (column_name[, column_name, ...]))
: relation NATURAL join_type JOIN relation
join_type:
| INNER
| (LEFT|RIGHT) SEMI
| (LEFT|RIGHT|FULL) [OUTER]
: [LEFT] ANTI
3.1 INNER JOIN¶
Select all rows from both relations where there is match.
3.2 OUTER JOIN¶
Select all rows from both relations, filling with null values on the side that does not have a match.
3.3 RIGHT JOIN¶
Select ALL rows from the RIGHT side and corresponding matching values from left.
3.4 LEFT JOIN¶
Select ALL rows from the left side and corresponding matching values from right.
3.5 EXAMPLES :¶
SELECT
a.game_id,
a.venue,
b.game_id,
b.venue
from
bbl.bbl01cs a
inner join
bbl.bbl02cs b
on
a.competition=b.competition:
SELECT
a.game_id,
a.venue,
b.game_id,
b.venue
from
bbl.bbl01cs a
left join
bbl.bbl02cs b
on a.venue=b.venue
SELECT
a.game_id as wbbl_game_id
, a.competition
, b.game_id
, b.competition
FROM
wbbl.wbbl01c a
inner join
bbl.bbl01cs b
on
a.game_id=b.game_id
group by 1,2,3,4
limit 10
4. LATERAL VIEW¶
lateral_view:
: LATERAL VIEW [OUTER] function_name (expressions)
table_name [AS (column_name[, column_name, ...])]
Generate zero or more output rows for each input row using a table-generating function. The most common built-in function used with LATERAL VIEW is explode.
4.1 LATERAL VIEW OUTER¶
Generate a row with null values even when the function returned zero rows.
4.2 EXAMPLES:¶
SELECT
*
FROM
bbl.bbl01cs
LATERAL VIEW explode(Array(1, 2, 3)) my_view
5. AGGREGATION¶
aggregation:
: GROUP BY expressions [(WITH ROLLUP | WITH CUBE | GROUPING SETS (expressions))]
Group by a set of expressions using one or more aggregate functions. Common built-in aggregate functions include count, avg, min, max, and sum.
5.1 ROLLUP¶
Create a grouping set at each hierarchical level of the specified expressions.
For instance,
GROUP BY a, b, c WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (a), ()).
The total number of grouping sets will be N + 1, where N is the number of group expressions.
5.2 CUBE¶
Create a grouping set for each possible combination of set of the specified expressions.
For instance,
GROUP BY a, b, c WITH CUBE is equivalent to GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ()).
The total number of grouping sets will be 2^N, where N is the number of group expressions.
5.3 GROUPING SETS¶
Perform a group by for each subset of the group expressions specified in the grouping sets.
For instance,
GROUP BY x, y GROUPING SETS (x, y) is equivalent to the result of GROUP BY x unioned with that of GROUP BY y.
5.4 EXAMPLES:¶
SELECT
venue,
COUNT(*) AS num_matches
FROM
wbbl.wbbl01c
GROUP BY venue
SELECT
venue,
AVG(batsman_runs) AS avg_runs
FROM
wbbl.wbbl01c
GROUP BY venue
SELECT
venue,
fixture,
batsman_name
FROM
wbbl.wbbl01c
GROUP BY venue, fixture, batsman_name WITH ROLLUP
SELECT
venue,
fixture,
AVG(batsman_runs)
FROM
wbbl.wbbl01c
GROUP BY venue, fixture GROUPING SETS (venue, fixture)
6. WINDOW FUNCTION¶
window_expression:
: expression OVER window_spec
named_window:
: window_identifier AS window_spec
window_spec:
| window_identifier
: ((PARTITION|DISTRIBUTE) BY expressions
[(ORDER|SORT) BY sort_expressions] [window_frame])
window_frame:
| (RANGE|ROWS) frame_bound
: (RANGE|ROWS) BETWEEN frame_bound AND frame_bound
frame_bound:
| CURRENT ROW
| UNBOUNDED (PRECEDING|FOLLOWING)
: expression (PRECEDING|FOLLOWING)
Compute a result over a range of input rows. A windowed expression is specified using the OVER keyword, which is followed by either an identifier to the window (defined using the WINDOW keyword) or the specification of a window.
6.1 PARTITION BY¶
Specify which rows will be in the same partition, aliased by DISTRIBUTE BY.
6.2 ORDER BY¶
Specify how rows within a window partition are ordered, aliased by SORT BY.
6.3 RANGE BOUND¶
Express the size of the window in terms of a value range for the expression.
6.4 ROWS bound¶
Express the size of the window in terms of the number of rows before and/or after the current row.
6.5 CURRENT ROW¶
Use the current row as a bound.
6.6 UNBOUNDED¶
Use negative infinity as the lower bound or infinity as the upper bound.
6.7 PRECEDING¶
If used with a RANGE bound, this defines the lower bound of the value range. If used with a ROWS bound, this determines the number of rows before the current row to keep in the window.
6.8 FOLLOWING¶
If used with a RANGE bound, this defines the upper bound of the value range. If used with a ROWS bound, this determines the number of rows after the current row to keep in the window.
7. HINTS¶
hints:
: /*+ hint[, hint, ...] */
hint:
: hintName [(expression[, expression, ...])]
Hints can be used to help execute a query better.
For example, you can hint that a table is small enough to be broadcast, which would speed up joins.
You add one or more hints to a SELECT statement inside /*+ … */ comment blocks.
Multiple hints can be specified inside the same comment block, in which case the hints are separated by commas, and there can be multiple such comment blocks. A hint has a name (for example, BROADCAST) and accepts 0 or more parameters.
7.1 EXAMPLES¶
SELECT /*+ BROADCAST(customers) */
a.*,
b.*
FROM
TPCH_AZBLB.customer a,
TPCH_AZBLB.orders b
WHERE a.c_custkey = b.o_custkey
SELECT /*+ SKEW('orders') */
a.*,
b.*
FROM
TPCH_AZBLB.customer a,
TPCH_AZBLB.orders b
WHERE a.c_custkey = b.o_custkey
8. DATA SOURCE VIEW¶
View is virtual table based on the result-set of an SQL statement.
A view contains rows and columns, just like a real table.
The fields in a view are fields from one or more real tables in the database.
Using the Zetaris Data Fabric, tables/views from multiple data sources can be used to create a new view.
You can add SQL functions, WHERE, and JOIN statements to a view and present the data as if the data were coming from one single table.
CREATE VIEW [OR ALTER] schema_name.view_name [(column_list)]
AS
select_statement;
In this syntax:
- First, specify the name of the view after the CREATE VIEW keywords. The schema_name is the name of the schema to which the view belongs.
- Second, specify a SELECT statement (select_statement) that defines the view after the AS keyword. The SELECT statement can refer to one or more tables.
If you don’t explicitly specify a list of columns for the view, SQL Server will use the column list derived from the SELECT statement.
In case you want to redefine the view e.g., adding more columns to it or removing some columns from it, you can use the OR ALTER keywords after the CREATE VIEW keywords.
CREATE DATASOURCE VIEW cust_ordr_count AS
select
b.c_name,
count(distinct(a.o_orderkey))
FROM
TPCH_AZBLB.orders a
INNER JOIN
TPCH_AZBLB.customer b
on
a.o_custkey=b.c_custkey
group by 1
9. CAST OPERATOR¶
There are many cases that you want to convert a value of one data type into another. PostgreSQL provides you with the CAST operator that allows you to do this.
The following illustrates the syntax of type CAST:
CAST ( expression AS target_type );
In this syntax:
- First, specify an expression that can be a constant, a table column, an expression that evaluates to a value.
- Then, specify the target data type to which you want to convert the result of the expression.
9.1 EXAMPLES¶
SELECT
CAST ('100' AS INTEGER);
If the expression cannot be converted to the target type, PostgreSQL will raise an error. See the following example:
SELECT
CAST ('10C' AS INTEGER);
[Err] ERROR: invalid input syntax for integer:
SELECT
CAST ('2015-01-01' AS DATE),
CAST ('01-OCT-2015' AS DATE);
10. Other String Functions¶
10.1 CONCAT( string str1, string str2… )¶
The CONCAT function concatenates all the stings.
Example: CONCAT('hadoop','-','hive') returns 'hadoop-hive'
CONCAT_WS( string delimiter, string str1, string str2... )
The CONCAT_WS function is similar to the CONCAT function. Here you can also provide the delimiter, which can be used in between the strings to concat.
Example: CONCAT_WS('-','hadoop','hive') returns 'hadoop-hive'
10.2 FIND_IN_SET( string search_string, string source_string_list )¶
The FIND_IN_SET function searches for the search string in the source_string_list and returns the position of the first occurrence in the source string list. Here the source string list should be comma delimited one. It returns 0 if the first argument contains comma.
Example: FIND_IN_SET('ha','hao,mn,hc,ha,hef') returns 4
10.3 LENGTH( string str )¶
The LENGTH function returns the number of characters in a string.
Example: LENGTH('hive') returns 4
10.4 LOWER( string str ), LCASE( string str )¶
The LOWER or LCASE function converts the string into lower case letters.
Example: LOWER('HiVe') returns 'hive'
10.5 LPAD( string str, int len, string pad )¶
The LPAD function returns the string with a length of len characters left-padded with pad.
Example: LPAD('hive',6,'v') returns 'vvhive'
10.6 LTRIM( string str )¶
The LTRIM function removes all the trailing spaces from the string.
Example: LTRIM(' hive') returns 'hive'
10.7 REPEAT( string str, int n )¶
The REPEAT function repeats the specified string n times.
Example: REPEAT('hive',2) returns 'hivehive'
10.8 RPAD( string str, int len, string pad )¶
The RPAD function returns the string with a length of len characters right-padded with pad.
Example: RPAD('hive',6,'v') returns 'hivevv'
10.9 REVERSE( string str )¶
The REVERSE function gives the reversed string
Example: REVERSE('hive') returns 'evih'
10.10 RTRIM( string str )¶
The RTRIM function removes all the leading spaces from the string.
Example: LTRIM('hive ') returns 'hive'
10.11 SPACE( int number_of_spaces )¶
The SPACE function returns the specified number of spaces.
Example: SPACE(4) returns ' '
10.12 SPLIT( string str, string pat )¶
The SPLIT function splits the string around the pattern pat and returns an array of strings. You can specify regular expressions as patterns.
Example: SPLIT('hive:hadoop',':') returns ["hive","hadoop"]
10.13 SUBSTR( string source_str, int start_position [,int length] )¶
The SUBSTR or SUBSTRING function returns a part of the source string from the start position with the specified length of characters. If the length is not given, then it returns from the start position to the end of the string.
Example1: SUBSTR('hadoop',4) returns 'oop'
Example2: SUBSTR('hadoop',4,2) returns 'oo'
10.14 TRIM( string str )¶
The TRIM function removes both the trailing and leading spaces from the string.
Example: TRIM(' hive ') returns 'hive'
10.15 UPPER( string str ), UCASE( string str )¶
The UPPER or UCASE function converts the string into upper case letters.
10.16 Example:¶
UPPER('HiVe') returns 'HIVE' UPPER( string str ), UCASE( string str )
11. Other Useful Functions - Quick Reference¶
11.1 Mathematical Functions¶
The following built-in mathematical functions are supported in Lightning; most return NULL when the argument(s) are NULL:
Function Name
Returns the absolute value
Returns double
Returns the arc cosine of x if -1<=a<=1 or null otherwise
Returns double
Returns the arc sin of x if -1<=a<=1 or null otherwise
Returns double
Returns the arctangent of a
Returns double
Returns the number in binary format
Returns string
Returns the minimum BIGINT value that is equal or greater than the double
Returns bigint
Converts a number from a given base to another
Returns string
Returns the cosine of a (a is in radians)
Returns double
Converts value of a from radians to degre
Returns double
Returns ea where e is the base of the natural logarithm
Returns double
Returns the maximum BIGINT value that is equal or less than the double
Returns bigint
If the argument is an int, hex returns the number as a string in hex format.
Otherwise if the number is a string, it converts each character into its hex representation and returns the resulting string.
Returns string
Returns the natural logarithm of the argument
Returns double
Return the base “base” logarithm of the argument
Returns double
Returns the base-10 logarithm of the argument
Returns double
Returns the base-2 logarithm of the argument
Returns double
Returns -a int double
Returns the positive value of a mod b
Returns integer
Returns a int, double
Return ap
Converts value of a from degrees to radians
Returns double
Returns a random number (that changes from row to row) that is distributed uniformly from 0 to 1.
Specifiying the seed will make sure the generated random number sequence is deterministic.
Returns double
Returns the rounded BIGINT value of the double
Returns bigint
Returns the double rounded to d decimal places
Returns double
Returns the sign of a as ‘1.0’ or ‘-1.0’
Returns float
Returns the sine of a (a is in radians)
Returns double
Returns the square root of a
Returns double
Returns the tangent of a (a is in radians)
Inverse of hex.
Interprets each pair of characters as a hexidecimal number and converts to the character represented by the number. string
Returns string
11.2 The following are built-in String functions¶
Function Name
Returns the numeric value of the first character of str
Returns int
Converts the argument from binary to a base- 64 string (as of 0.12.0)
Returns string
Returns the ASCII character having the binary equivalent to A (as of Hive 1.3.0 and 2.1.0). If A is larger than 256 the result is equivalent to chr(A % 256). Example: select chr(88); returns “X”.string
Returns string
Returns the string or bytes resulting from concatenating the strings or bytes passed in as parameters in order. e.g. concat(‘foo’, ‘bar’) results in ‘foobar’. Note that this function can take any number of input strings.
Returns the top-k contextual N-grams from a set of tokenized sentences, given a string of “context”. See StatisticsAndDataMining for more information.
Returns array<struct<string,double>>
Like concat() above, but with custom separator SEP.
Returns string
Like concat_ws() above, but taking an array of strings. (as of Hive 0.9.0)
Returns string
Decodes the first argument into a String using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null. (As of Hive 0.12.0.)
Returns string
Return string at index number. For example elt(2,’hello’,’world’) returns ‘world’. Returns NULL if N is less than 1 or greater than the number of arguments. (See https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_elt)
Returns string
Encodes the first argument into a BINARY using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null. (As of Hive 0.12.0.)
Returns binary
Returns the index of val in the val1,val2,val3,… list or 0 if not found. For example field(‘world’,’say’,’hello’,’world’) returns 3. All primitive types are supported, arguments are compared using str.equals(x). If val is NULL, the return value is 0. (See https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_field)
Returns int
Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. e.g. find_in_set(‘ab’, ‘abc,b,ab,c,def’) returns 3
Returns int
Formats the number X to a format like ‘#,###,###.##’, rounded to D decimal places, and returns the result as a string. If D is 0, the result has no decimal point or fractional part. (as of Hive 0.10.0)
Returns string
Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid. NOTE: The json path can only have the characters [0-9a-z], i.e., no upper-case or special characters. Also, the keys cannot start with numbers. This is due to restrictions on Hive column names.
Returns string
Returns true if the string str appears as an entire line in filename.
Returns boolean
Returns string, with the first letter of each word in uppercase, all other letters in lowercase. Words are delimited by whitespace. (As of Hive 1.1.0.)
Returns int
Returns the position of the first occurence of substr in str
Returns int
Returns the length of the string
Returns int
Returns the Levenshtein distance between two strings (as of Hive 1.2.0). For example, levenshtein(‘kitten’, ‘sitting’) results in 3.
Returns int
Returns the position of the first occurrence of substr in str after position pos
Returns int
Returns the string resulting from converting all characters of B to lower case. For example, lower(‘fOoBaR’) results in ‘foobar’.
Returns string
Returns str, left-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. In case of empty pad string, the return value is null.
Returns string
Returns the string resulting from trimming spaces from the beginning(left hand side) of A e.g. ltrim(‘ foobar ‘) results in ‘foobar ‘
Returns string
Returns the top-k N-grams from a set of tokenized sentences, such as those returned by the sentences() UDAF. See StatisticsAndDataMining for more information.
Returns array<struct<string,double>>
Returns the specified part from the URL. Valid values for partToExtract include HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. e.g. parse_url(‘https://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1’, ‘HOST’) returns ‘facebook.com’. Also a value of a particular key in QUERY can be extracted by providing the key as the third argument, e.g. parse_url(‘https://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1’, ‘QUERY’, ‘k1’) returns ‘v1’.
Returns string
Returns the input formatted according do printf-style format strings (as of Hive 0.9.0)
Returns string
Returns the string extracted using the pattern. e.g. regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 2) returns ‘bar.’’ Note that some care is necessary in using predefined character classes: using ‘s’ as the second argument will match the letter s; ‘s’ is necessary to match whitespace, etc. The ‘index’ parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the ‘index’ or Java regex group() method.
Returns string
Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT, e.g. regexp_replace(“foobar”, “oo|ar”, “”) returns ‘fb.’ Note that some care is necessary in using predefined character classes: using ‘s’ as the second argument will match the letter s; ‘s’ is necessary to match whitespace, etc.
Returns string
Repeat str n times
Returns string
Returns the string A with all non-overlapping occurrences of OLD replaced with NEW (as of Hive 1.3.0 and 2.1.0). Example: select replace(“ababab”, “abab”, “Z”); returns “Zab”.
Returns string
Returns the reversed string
Returns string
Returns str, right-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. In case of empty pad string, the return value is null.
Returns string
Returns the string resulting from trimming spaces from the end(right hand side) of A e.g. rtrim(‘ foobar ‘) results in ‘ foobar’
Tokenizes a string of natural language text into words and sentences, where each sentence is broken at the appropriate sentence boundary and returned as an array of words. The ‘lang’ and ‘locale’ are optional arguments. e.g. sentences(‘Hello there! How are you?’) returns ( (“Hello”, “there”), (“How”, “are”, “you”) )
Returns array<array>
Returns soundex code of the string (as of Hive 1.2.0). For example, soundex(‘Miller’) results in M460.
Returns string
Return a string of n spaces
Returns string
Split str around pat (pat is a regular expression)
Returns array
Splits text into key-value pairs using two delimiters. Delimiter1 separates text into K-V pairs, and Delimiter2 splits each K-V pair. Default delimiters are ‘,’ for delimiter1 and ‘=’ for delimiter2.
Returns map<string,string>
Returns the substring or slice of the byte array of A starting from start position till the end of string A e.g. substr(‘foobar’, 4) results in ‘bar’
Returns string
Returns the substring or slice of the byte array of A starting from start position with length len e.g. substr(‘foobar’, 4, 1) results in ‘b’
Returns string
Translates the input string by replacing the characters present in the from string with the corresponding characters in the to string. This is similar to the translate function in PostgreSQL. If any of the parameters to this UDF are NULL, the result is NULL as well (available as of Hive 0.10.0; char/varchar support added as of Hive 0.14.0.)
Returns string
Returns the string resulting from trimming spaces from both ends of A e.g. trim(‘ foobar ‘) results in ‘foobar’
Returns string
12 Register Logical Datasources¶
To build a virtual data lake, a user need to identify data base to be connected. Lightning supports all JDBC compliant DB, all know NoSQL, RestAPI, CSV, JSON file.
Register Clustered Databases¶
A user need to provide JDBC driver class, url and connectivity credentials including extra parameters database need.
Syntax:
CREATE DATASOURCE ORCL [DESCRIBE BY ["Oracle for Product Master"] OPTIONS (
jdbcdriver "com.oracle.OracleDriver",
jdbcurl "jdbc:oracle:thin@oracle-master:1521:orcl",
username "scott",
password "tiger",
[key "value"]*)
Real-world example
showing a connection to Oracle on port 1521 of a localhost with a database name of orcl with user name scott and password tiger.
CREATE DATASOURCE ORACLE DESCRIBE BY "Oracle for Product Master" OPTIONS (
jdbcdriver "com.oracle.OracleDriver",
jdbcurl "jdbc:oracle:thin@localhost:1521:orcl",
username "scott",
password "tiger")
If the registered database supports cluster base computing such as MPP, then a user can register slave nodes so that Lightning can directly query to slave nodes rather than running it through the master node.
ADD SLAVE DATASOURCE TO ORCL OPTIONS (
jdbcdriver "com.oracle.OracleDriver",
jdbcurl "jdbc:oracle:thin@oracle-slave1;1521:orcl",
username "scott",
password "tiger",
[key "value"])*
ADD SLAVE DATASOURCE TO ORCL OPTIONS (
jdbcdriver "com.oracle.OracleDriver",
jdbcurl "jdbc:oracle:thin@oracle-slave2;1521:orcl",
username "scott",
password "tiger",
[key "value"])*
ADD SLAVE DATASOURCE TO ORCL OPTIONS (
jdbcdriver "com.oracle.OracleDriver",
jdbcurl "jdbc:oracle:thin@oracle-slave3;1521:orcl",
username "scott",
password "tiger",
[key "value"])*
The above example shows registering 1 master node(oracle-master) and 3 slave nodes(oracle-slave1, oracle-slave2, and oracle-slave3).
*ORCL is alias for the target data source which will be decided by a user when creating data source. *If schema is provided, Lightning will only ingest metadata from that schema *If schema_prepended_table is set to true, schema will be prepended to the table name as there may be same tables using the same name across different schemas. For example, role table in zetaris_bi schema will be named zetaris_bi__role
Register RDBMS Datasource¶
CREATE DATASOURCE MSSQL DESCRIBE BY "MSSQL-2017-linux " OPTIONS (
jdbcdriver "com.microsoft.sqlserver.jdbc.SQLServerDriver",
jdbcurl "jdbc:sqlserver://localhost:1433 ",
databaseName "DemoData",
username "scott" ,
password "tiger",
schema “dbo”
)
CREATE DATASOURCE MY_SQL DESCRIBE BY "MySQL " OPTIONS (
jdbcdriver "com.mysql.jdbc.Driver",
jdbcurl "jdbc:mysql://127.0.0.1/test_db",
username "scott" ,
password "tiger
)
CREATE DATASOURCE DB2_DB2INST1 DESCRIBE BY "DB2 Sample DB Schema " OPTIONS (
jdbcdriver "com.ibm.db2.jcc.DB2Driver",
jdbcurl "jdbc:db2://127.0.0.1:50000/db_name",
username "db2inst1" ,
password "db2inst1-pwd",
schema "DB2INST1",
schema_prepended_table "true"
)
CREATE DATASOURCE GREEN_PLUM DESCRIBE BY "GREEN_PLUM " OPTIONS (
jdbcdriver "org.postgresql.Driver",
jdbcurl "jdbc:postgresql://localhost:5432/postgres",
username "gpadmin" ,
password "pivotal",
schema "public"
)
CREATE DATASOURCE TERA_DATA DESCRIBE BY "TERA_DATA " OPTIONS (
jdbcdriver "com.teradata.jdbc.TeraDriver",
jdbcurl "jdbc:teradata://10.128.87.16/DBS_PORT=1025",
username "dbc" ,
password "dbc",
schema "dbcmngr"
)
CREATE DATASOURCE AWS_AURORA DESCRIBE BY "AWS_AURORA " OPTIONS (
jdbcdriver "com.mysql.jdbc.Driver",
jdbcurl "jdbc:mysql://zet-aurora-cluster.cluster-ckh4ncwbhsty.ap-southeast-2.rds.amazonaws.com/your_db?",
username "your_db_account_name" ,
password "your_db_account_password""
)
CREATE DATASOURCE REDSHIFT DESCRIBE BY "AWS RedShift" OPTIONS (
jdbcdriver "com.amazon.redshift.jdbc.Driver",
jdbcurl "jdbc:redshift://zetaris.cyzoanxzdpje.ap-southeast-2.redshift.amazonaws.com:5439/your_db_name",
username "your_db_account_name",
password "your_db_account_password"
)
Register NOSQL Datasource¶
For MongoDB, the below 5 parameters(host, port, db name, user name and password) must be provided.
CREATE DATASOURCE MONGO DESCRIBE BY "MongoDB" OPTIONS (
lightning.datasource.mongodb.host "localhost",
lightning.datasource.mongodb.port "27017",
lightning.datasource.mongodb.database "lightning-demo",
lightning.datasource.mongodb.username "",
lightning.datasource.mongodb.password ""
)
For Cassandra, there is only on parameter for Lightning, which is key space for this connection. the other parameters start with “spark.cassandra” prefix, which is actually provided for Spark Cassandra connector(https://github.com/datastax/spark-cassandra-connector).
CREATE DATASOURCE CSNDR DESCRIBE BY "Cassandra" OPTIONS (
spark.cassandra.connection.host "localhost",
spark.cassandra.connection. port "9042",
spark.cassandra.auth.username "cassandra",
spark.cassandra.auth. password "cassandra",
lightning. datasource .cassandra.keyspace "lightning_demo"
)
CREATE DATASOURCE AWS_DYNAMODB DESCRIBE BY "AWS DynamoDB" OPTIONS (
accessKeyId "Your_aws_accessKeyId",
secretKey "Your_aws_SecretAccessKey" ,
region "ap-southeast-2"
)
13 Create Physical Datasources¶
These are datasources based on physical files residing on on-prem filestore (Local filesystem,nfs filesystem), or cloud filestores like S3 and Azure Blob. One need to first create a lightning database and then register the respective files under this namespace . .. highlight:: sql
CREATE LIGHTNING DATABASE AWS_S3 DESCRIBE BY "AWS S3 bucket" OPTIONS (
[key "value"]
)
Ingest file from local filesystem¶
CREATE LIGHTNING FILESTORE TABLE pref FROM HR FORMAT (CSV | JSON)
OPTIONS (path "file path", header "true", inferSchema "true", [key value pair]);
Ingest RESTful Service¶
For the RESTful service when returns JSON format, a user need to provide end point, HTTP method, encoding type as well as schema.::
REGISTER REST DATASOURCE TABLE SAFC_USERS FROM SAFC SCHEMA (
uid Long,
gender String,
age Integer ,
job String,
ts String)
OPTIONS (
endpoint "/example/users",
method "GET",
requesttype "URLENCODED"
);
Other parameter for the API call, such as security key, can be provided in OPTIONS field.
Ingest file from S3¶
CREATE LIGHTNING FILESTORE TABLE customer FROM TPCH_S3 FORMAT CSV(JSON) OPTIONS (
PATH "s3n://zetaris-lightning-test/csv-data/tpc-h/customer.csv",
inferSchema "true",
AWSACCESSKEYID "AKIAITGIWHBIPE3NU5GA",
AWSSECRETACCESSKEY "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"
)
Ingest file from azure Blob¶
CREATE LIGHTNING FILESTORE TABLE customer FROM TPCH_AZBLB FORMAT CSV(JSON) OPTIONS (
PATH "wasb://zettest-storage-container@zettesstorage.blob.core.windows.net/customer.csv",
inferSchema "true",
fs.azure.account.key.zettesstorage.blob.core.windows.net "bHLzau36KlZ6cYnSrvPzSJVniBDtu819nHTR/+hRyDZEVScQ3wuesst9P5/I7vqG+4czeimuHSrPe2ZtK+b+BQ=="
)
Key name for security key depends on Azure Blob container, refer to Azure Blob service.
14 Ingest Metadata¶
Once a data source is registered in Lightning it will ingest all table, column and constraints metadata.
Ingest all tables from the data source¶
REGISTER DATASOURCE TABLES FROM ORCL
This command will connect to ORCL database, and ingest all metadata(tables, columns, foreign key, index and all other constraints) into Schema Store
Ingest a table from the data source¶
REGISTER DATASOURCE TABLE "USER" [USER_ALIAS] FROM ORCL
This will register “USER” table as USER_ALIAS if alias is provided.
Update Schema¶
When changes were made to the target data source, a user can reflect them using update schema command:
UPDATE DATASOURCE SCHEMA ORCL
Update description and materialised table for each relation in a data source¶
15 Manage Schema Store¶
Lightning provides various commands to manage meta data in the schema store. Also, these will be provided via RESTful service.
Datasource¶
This command shows the data sources registered in the schema store:: .. highlight:: sql
SHOW DATASOURCES
This command drop the registered data source as well as all tables under that.:
DROP DATASOURCE ORCL
DESCRIBE DATASOURCE ORCL
DESCRIBE SLAVE DATASOURCE ORCL
Table¶
DESC ORCL.USERS
SHOW TABLES
SHOW DATASOURCE TABLES ORCL
DROP TABLE ORCL.USERS
View¶
Lightning supports the view capability with query definition on a single data source or across multiple data sources
CREATE DATASOURCE VIEW TEEN_AGER FROM ORCL AS
SELECT * FROM USERS WHERE AGE >= 13 AND AGE < 20
the TEEN_AGER view belongs to ORCL data source. With this capability a user can create a view with DBMS native query, which is really handy :
CREATE DATASOURCE VIEW SALARY_RANK FROM ORCL AS
SELECT department_id, last_name, salary, RANK() OVER (PARTITION BY department_id ORDER BY salary) RANK
FROM employees
WHERE department_id = 60
ORDER BY RANK, last_name
SELECT * FROM ORCL.SALARY_RANK will produce :
DEPARTMENT_ID LAST_NAME SALARY RANK
------------- ------------------------- ---------- ----------
60 Lorentz 4200 1
60 Austin 4800 2
60 Pataballa 4800 2
60 Ernst 6000 4
60 Hunold 9000 5
Also, those views can be join with other tables in other data sources.
This view can across different data sources.
CREATE DATASOURCE VIEW TOP10_MOVIES_FOR_TEENS AS
SELECT movies_from_oracle.title, user_rating.count, user_rating.min, user_rating.max, user_rating.avg
FROM(
SELECT iid, count(*) count, min(pref) min, max(pref) max, avg(pref) avg
FROM TRDT.ratings ratings_from_teradata, PGRS.users users_from_postgres
WHERE users_from_postgres.age >=13AND users_from_postgres.age <20
AND ratings_from_teradata.uid = users_from_postgres.uid
GROUP BY ratings_from_teradata.iid
ORDER BY avg DESC
LIMIT20
) AS user_rating, ORCL.movies movies_from_oracle
WHERE movies_from_oracle.iid = user_rating.iid
This view can be queried like normal table :
SELECT*FROM TOP10_MOVIES_FOR_TEENS
DROP VIEW ORCL.TEEN_AGER;
16 Run Query¶
Lightning supports SQL2003. Also, it can run all 99 TPC-DS queries. As long as a data source registered into schema store, query can across all data sources.
For example the following query run join query across three different data sources(Teradata ↔ Oracle ↔ Cassandra),
SELECT users_from_cassandra.age, users_from_cassandra.gender, movies_from_oracle.title title, ratings_from_teradata.pref, ratings_from_teradata.ts
FROM TRDT.ratings ratings_from_teradata, ORCL.movies movies_from_oracle, CSNDR.users users_from_cassandra
WHERE users_from_postgres.gender = 'F'
AND ratings_from_teradata.uid = users_from_postgres.uid
AND movies_from_oracle.iid = ratings_from_teradata.iid
17 Materialization and Cache¶
For some reasons, for example query performance, all data source tables or views can be materialized by leveraging Zetaris Fusion DB. Also, Lightning support Cache capabilities where a user can load all data into main memory.
Materialization¶
For example the following query materialize all data from RESTful Service to USER_FOR_COPY table in fusion db.:
INSERT INTO FUSIONDB.USERS_FOR_COPY
SELECT uid, gender, age, job , ts FROM SAFC.SAFC_USERS
Cache/Uncache¶
A user can load all data into main memory by leverging cache capability and also, uncache it anytime.
CACHE TABLE pref;
CACHE TABLE ORCL.movies;
The pref, ORCL.moves table are chaned now, and the following query performs a lot better :
SELECT movies_from_oracle.title, hdfs_pref. count , hdfs_pref. min , hdfs_pref. max , hdfs_pref. avg
FROM (
SELECT iid, count ( * ) count , min (pref) min , max (pref) max , avg (pref) avg
FROM pref
GROUP BY iid
) AS hdfs_pref, ORCL.movies movies_from_oracle
WHERE movies_from_oracle.iid = hdfs_pref.iid
These tables uncached any time
UNCACHE TABLE pref
UNCACHE TABLE ORCL.movies
18 Statistics¶
Lighting come up with CBO(Cose Based Optimizer) to reduce data shuffling across cluster. To do this, Lighting keeps statistics for the data source. There are two types of statistics, the one is table level statistics and the other is column level statistics.
Table level statistics¶
ANALYZE DATASOURCE TABLE ORCL.MOVIES
This command generate statistics such as size in bytes, cardinality for the table, and these are browsed by the following command :
SHOW DATASOURCE TABLE STATISTICS ORCL.MOVIES
Column level statistics¶
ANALYZE DATASOURCE TABLE ORCL.MOVIES COMPUTE STATISTICS FOR COLUMNS (IID, TITLE)
This command generate statistics such as cardinality, number of null, min, max, average value, and these are browsed by the following command :
SHOW DATASOURCE COLUMN STATISTICS ORCL.MOVIES;
19 Partitioning¶
Query performance can be improved by partitioning table. What partitioning means here is that all records are splitted into multiple partitions and these are processed independently in each worker node.
CREATE DATASOURCE PARTITION ON ORCL.USERS OPTIONS (
COLUMN "UID",
COUNT "2",
LOWERBOUND "1",
UPPERBOUND "6040")
This command makes two partitions based on the "UID" column. lower/upper bound provides boundary value for the partition. This partition can be removed by :
DROP DATASOURCE PARTITION ON ORCL.USERS
20 Import CSV file¶
Lighting supports CSV file and running query on top of it. A CSV can be imported by either :
Hive syntax¶
CREATE EXTERNAL TABLE pref (uid INT, iid INT, pref FLOAT, ts STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'csv directory'
With the syntax, a user can import other file format than csv(tsv for example). But this doesn’t support :
- Header CSV header must be got rid of.
- LOCATION must be directory. Create a directory, and place csv file there
- For directory in s3 bucket, csv directory looks like “s3n://mys3bucket/perf/”
OR
Lightning Syntax¶
CREATE TABLE pref
USING com.databricks.spark.csv
OPTIONS (path "file path", header "true", inferSchema "true")
With this syntax, user can do :
- infer schema
- support header
- support a single file
file path
in Amazon s3 looks likes3n://mys3bucket/perf/pref.csv
21 Access Control¶
User Management¶
Only admin user or users in “admin” role can add/drop user.
- level : admin | general
- password doesnt allow white spaces
ADD USER WITH (
email 'someone@zetaris.com',
name 'someone',
level 'general',
password '1234567'
)
UPDATE USER user_id SET PASSWORD 'new_password'
DESCRIBE USER user_id
DROP USER user_id
SHOW USERS
Role Based Access(RBA) control¶
The Lightning provides Role Based Access which limits a users to access a specific set of data. This is applied at data source level or table level in each data source. Only admin users or equivalents can run these commands
Privileges
- SELECT privilege - Give read access to the data source or relation
- INSERT privilege - Give insert access to the data source or relation
- CACHE privileges - Give cache access to a relation : (UN)CACHE DATASOURCE TABLE
Predefined Roles
Role is case insensitive - admin - none - all - default
CREATE ROLE role_name [DESCRIBE BY "this is blah~~~"]
DROP ROLE role_name;
SHOW ROLES
ASSIGN USER user_name [, user_name] ...TO ROLE role_name
REVOKE USER user_name[, user_name] ...FROM ROLE role_name
Granted user with GRANT OPTIONS can grant same privilege on the table.
GRANT SELECT | INSERT | CACHEON table_or_view_nameTO principal_spec [,principal_spec] ...[WITH GRANT OPTION]
REVOKE SELECT(INSERT | CACHE) ON table_or_view_name FROM principal_spec [,principal_spec] ...
- Wild card can be only used in table field. for example, ORCL.* is allowed but *.* or *.movies were not allowed.
SHOW GRANT[principal_specification] ON(ALL | [TABLE] table_or_view_name)
principal_specification: USER user | ROLE roleIt will display :
table_identifier | principal_name | principal_type | privilege | grant_option | grant_time | grantor
+------------------------+----------------------+--------------------+-------------+------------------+----------------------------------+----------+
ORCL.movies | ashutosh | USER | DELETE | false | 2018-05-07 11:44:12.301 | thejas
ORCL.movies | ashutosh | USER | INSERT | false | 2018-05-07 11:44:12.301 | thejas
ORCL.ratings | ashutosh | ROLE | SELECT | false | 2018-05-07 11:44:12.301 | thejas
Cloud DataFabric¶
Cloud Data Fabric Quick-Start Guide¶
Welcome to the Zetaris Cloud Data Fabric Quick-Start Guide! The traditional approaches to data integration where the use of ETL ( Extract-Translate-Load) or ELT ( Extract-Load-Translate) are the principal data movement methods is no longer a sustainable approach. Data comes in all shapes, all velocities and different levels of structure and the platform used to manage this level of activity needs to be able to cope with modern business demands placed on it. Traditional methods are simply not fast enough to keep pace with the business and the market that drives business. This document will provide you a fundamental functional understanding of the Zetaris Cloud Data Fabric so you can explore the capabilities of the platform and begin to understand the potential business benefits that can be quickly implemented, tested and productionised.
Accessing the Cloud Data Fabric User Interface All users access their own Cloud Data Fabric through http://ui.datafabric.zetaris.com/lightning-gui/login where you can input your specific username and password
Cloud Data Fabric User Interface Tabs Once successfully logged into the user interface, you will see your options at the top of the page. Here is a break down of the functions stored in the top Menu.
Listed from left to right they are:
- Data Fabric Builder
- Schema Store View
- Data Catalog
- Query Builder
- Virtual Data Mart
- Warehouse Builder
- Topological View
- User Management
Data Fabric Builder¶
Data fabric builder lets you create data fabrics which are a combination of data sources consisting of different database or files. IN the below example:
- The left panel displays all the available databases and data sources that have been defined.
- Database can be added by simply clicking on small green button as shown in image
Schema Store View¶
Schema Store is where query can be constructed and run using different virtual data source tables to compose unique databases, data marts and views which are defined within the cloud data fabric.
- The left panel of the User interface is known as the Schema Browser and displays all the databases, data sources, data marts and views that are available for use in any query. Any new databases or data marts created will be added to these panels.
- The main panel to the right, the SQL Editor is where all SQL queries are written with the results displayed immediately below in the Results panel.
- The Result panel has two tabs. Result, where output of a query will be shown and History, where the query history run on a particular query sheet will be shown
- The top right corner of the interface has the the following buttons, from left to right
- Plus Button - adds an additional User Interface tab
- File open button - opens any previously saves SQL queries
- Save Button - allows Save or Save-As of any new or edited SQL queries
- Max Result - sets the maximum number of results to be returned by a query
- Play Button - executes any SQL code that has been highlighted in the SQL Editor
- Stop Button - stops SQL query execution
- Format Button - arranges SQL Code in the Editor window into a human-readable format
The user panel, also located in the top right corner of the user interface allows you to change your password as well as logout
- The bottom of the results window will show a fail result message when a query has been unsuccessful.
Clicking the Red triangle in the bottom right corner the Result panel will display any errors relevant to the current failure result.
Data Catalog¶
Data catalog tab allows you to browse the existing defined data sources and look at the tables associated with these.
Adding a Physical Data Source Database tables can be created from Azure BLOB storage, Amazon s3 Buckets or an API end point There are 2 mechanisms available:
- Data Fabric Builder, or
- SQL Editor on the Schema Store View tab
Data Fabric Builder¶
A database needs to be created before any tables can be assigned to that database. A name and description are required along with any additional parameters.
The created database appears under the Physical Data Sources and Data Bases section:
Once the database to which the tables are to be assigned has been created, the “+” addition sign can be pressed to create the table. The table can either be a File Store table (Azure BLOB or AWS s3 bucket) or an API end point File Store
The necessary credentials need to be supplied including:
- Name which the table will be called, which does not need to be the same to that of the file within Azure or AWS
- Path of the file
- Access Key Name
- Access Key Value
The default parameters can be altered if a header is not available
Before the table is created a preview of the records is provided.
The tables are available as soon as they are created.
SQL Editor on the Schema Store View tab¶
The same process is required when creating the tables and database using the command line interface.
CREATE LIGHTNING DATABASE TEST_DATABASE DESCRIBE BY " TEST_DATABASE";
CREATE LIGHTNING FILESTORE TABLE customer FROM TEST_DATABASE FORMAT CSV OPTIONS (
PATH "wasb://zettest-storage-container@zettesstorage.blob.core.windows.net/customer.csv",
inferSchema "true",
fs.azure.account.key.zettesstorage.blob.core.windows.net "XXXXXXXXXXXXXXXXXXXXXXCVCHJHVJLBJCGJHCVJLKNKJVKJHBLKMLKBHBNKLJMKL");
- Data sources can be added by clicking on the small green button next to data source as shown in image below
- A pop-up will appear to guide you through the data source setup process
- In step 1, Pull type is the default data source type, click next
- Complete the Data Source Information page including Database Type, Data Source Name,
Description (Optional),JDBC Driver Class,JDBC URL, Username/Password and Schema type followed by clicking next
- Now select all the tables in the data source which you would like included as part of the virtual data source and click next
- A summary of all the database and driver parameters along with the selected tables will be shown.
To accept these changes and create the virtual data source, click the register button
Query Builder¶
This tab is used for inter-active query builds through a drag and drop method, which auto-generates the SQL statements.
- The left-most panel is the Schema Browser which displays all available databases, data sources, data mart and views which can be used by query builder.
- The main panel is the Query Builder canvas where tables and views can dragged and joined.
- The right-most panel displays the selected columns along with any predicates and group by / order by clauses
- Selected Columns will contain any columns that are dragged from the Schema Browser that you wish to use as part of your query.
- Predicates defines the column on which you want to apply where clauses or filters ie; “where employee_num <1000”
- Group By let you group results on the the basis of particular column allowing you to filter out data. it is very similar to where clause
- but also allowing aggregation count comparison
- Order By allows you to order your data in descending or ascending order for a particular column you selected.
- The SQL and Result panel below the Schema Browser canvas is where the SQL is shown along with results of your query
- SQL Tab display a dynamic view of the auto-generated SQL built form your selections
- Result Tab displays the query result once the execute button has been clicked
- Queries can be saved at any time by clicking the Save Icon described in section 2
ddddddddddd
Virtual Datamart¶
The Data Mart is a subset of the Virtual Data Warehouse and is usually oriented to a specific business line or team. Whereas the data warehouse may contain a wide set of data, the Virtual Data Mart can be quickly created to a single department or user requirement. Virtual data mart lets you create different data mart within a data ware house by combing several tables, views etc.
- on the left hand side panel you can see all the available data bases, data sources and all the data mart already created.
- in the middle is where you can drag different table and view to create a datamart. You will be allowed to drag table and view only after you click “create” button on the right and it will ask you for datamart name and description. To complete, click create button.
Warehouse Builder¶
To allow end users to gain access to the data, a warehouse needs to be created and configured. The warehouse builder provides the flexibility to create the following type of warehouses:
- small
- medium, and
- large warehouse.
Each of these warehouses assigned different amounts of CPU and memory to enable queries to be executed. The type of warehouse used will depend upon the data to be processed and the execution response time required. In order to create a warehouse and allocate the necessary data the Create icon should be used:
A name comprising of alphanumeric characters only must be provided and the size of the Warehouse (small, medium or large) needs to be selected:
Users and roles must be associated with the warehouse:
S3 or Blob data (represented as databases) and data sources can be associated with the warehouse:
Data Marts can also be associated with the warehouse:
A review of the configuration of the warehouse can be undertaken, before the “Create” icon is selected:
Once the warehouse is created it will be placed into the PENDING state, before changing to a RUNNING state.
After being created the warehouse can be:
- Stopped,
- Resumed, or
- Deleted
If changes are required to the warehouse this can be done using the tabs at the bottom of the screen to Assign new or Revoke existing data, users or roles:
Once the warehouse is configured there are two methods of accessing the warehouse:
- Web GUI, or
- JDBC access from a third party business intelligence tool such as DBeaver, DBVisualiser, Tableau 2018+
Clicking on the icon next to the warehouse will open another window, enabling the user to login, if they have been provided with access to that warehouse.
To access using a third party requires:
- Download of the Cloud Data Fabric JDBC driver and configuring the tool
- Copying the URL from the second icon next to the warehouse GUI
An example of the URL for this betatestmedium warehouse is: jdbc:zetaris:clouddatafabric@betatestmedium.6049b280cf1be6f1afc01bea29cacf61.datafabric.zetaris.com/RestClient=http
Topological View¶
The topological view provides the data that the Cloud Data Fabric has access to. The data can be queried by the customer administrator and allocated to warehouse for end users to query. The below topological view indicates that the Cloud Data Fabric has access to Zetaris Fusiondb, Postgres, oracle, Teradata, SAFC, Cassandra.
User Management¶
The User Management allows the management of different users and roles.
Users and Roles can be added using the left hand side panel
The top right hand side panel allows the addition or removal of either: • a user to/from a user • a role to/from a role
The bottom right hand side panel allows the addition or removal of: • data warehouses to from the user or role
Frequently Asked Questions¶
Where is my data stored?¶
The only data stored in the Cloud Fabric is data that is cached by the user. The Cloud Data Fabric provides the ability to query the data whilst leaving the data in place. Only query specific data is ever transferred to the Cloud Data Fabric to generate a result. Data that is stored in Azure or Amazon storage or databases is readily accessible. Any database that is accessible over the internet can also be accessed. Data can be cached in memory to improve performance if the response of the source database or storage is slow. Reference on how to cache and uncache data can be found in Zetaris Cloud Data Fabric SQL Guide - ver 2.1.0
My query stops running after 2 seconds¶
The Zetaris Cloud Data Fabric allows users to run a short query on data sources as a way of validating connectivity. Any SQL operation which is more involved needs to occur from within a Virtual Data Warehouse or Virtual DataMart - a construct that is specifically designed for complex or involved data operations.
Information on setting up your virtual data source can be found in Cloud Data Fabric Quick-Start Guide¶
Connecting to test data sources¶
To allow the testing of the Cloud Data Fabric a number of data sets have been created.
Three different databases in Azure and AWS, along with BLOB and s3 bucket files have been created.
Access to these requires the definition of the metadata within the cloud data fabric which is achieved by:
- Creating the data source, and Registering the data source for databases.
- Creating Databases for s3 and BLOB files
- Once the data sources have been defined the data can then be queried using the SQL editor.
The following commands should be executed within the SQL Editor on the Schema Store View page of the GUI to define and register the desired data sources;
AWS_RDSH Datasource:
CREATE DATASOURCE AWS_RDSH DESCRIBE BY "AWS_RDSH" OPTIONS (
jdbcdriver "com.amazon.redshift.jdbc.Driver",
jdbcurl "jdbc:postgresql://zetaris.cyzoanxzdpje.ap-southeast-2.redshift.amazonaws.com:5439/zetredshift",
username "aws_test_data" ,
password "Tcph_Data_1"
);
REGISTER DATASOURCE TABLES FROM AWS_RDSH;
AWS_ORCL Datasource:
CREATE DATASOURCE AWS_ORCL DESCRIBE BY "AWS_ORCL" OPTIONS (
jdbcdriver "com.amazon.redshift.jdbc.Driver",
jdbcurl "jdbc:postgresql://zetaris.cyzoanxzdpje.ap-southeast-2.redshift.amazonaws.com:5439/zetredshift",
username "aws_test_data" ,
password "Tcph_Data_1"
);
REGISTER DATASOURCE TABLES FROM AWS_ORCL;
AZURE_MSSQL Datasource:
CREATE DATASOURCE AZURE_MSSQL DESCRIBE BY "AZURE_MSSQL" OPTIONS (
jdbcdriver "com.microsoft.sqlserver.jdbc.SQLServerDriver",
jdbcurl "jdbc:sqlserver://microsoftsqlserver.database.windows.net:1433 ",
databaseName "DemoData",
username "adminnew" ,
password "WarehouseZet123",
schema "tpch1"
);
REGISTER DATASOURCE TABLES FROM AZURE_MSSQL;
CREATE DATASOURCE AZURE_POSTGRES DESCRIBE BY "TPCH set in Azure PostgreSQL " OPTIONS (
jdbcdriver "org.postgresql.Driver",
jdbcurl "jdbc:postgresql://postgresql-testdata-server.postgres.database.azure.com:5432/testdb",
username "testdb_admin@postgresql-testdata-server" ,
password "aGVsbG90ZXN0ZGIK",
schema "tpch_db"
);
REGISTER DATASOURCE TABLES FROM AZURE_POSTGRES;
TPCH_AZBLB Databases:
CREATE LIGHTNING DATABASE TPCH_AZBLB DESCRIBE BY "TPCH Data set in Azure blob";
CREATE LIGHTNING FILESTORE TABLE customer FROM TPCH_AZBLB FORMAT CSV OPTIONS (
PATH "wasb://zettest-storage-container@zettesstorage.blob.core.windows.net/customer.csv",
inferSchema "true",
fs.azure.account.key.zettesstorage.blob.core.windows.net "bHLzau36KlZ6cYnSrvPzSJVniBDtu819nHTR/+hRyDZEVScQ3wuesst9P5/I7vqG+4czeimuHSrPe2ZtK+b+BQ=="
);
CREATE LIGHTNING FILESTORE TABLE line FROM TPCH_AZBLB FORMAT CSV OPTIONS (
PATH "wasb://zettest-storage-container@zettesstorage.blob.core.windows.net/line.csv",
inferSchema "true",
fs.azure.account.key.zettesstorage.blob.core.windows.net "bHLzau36KlZ6cYnSrvPzSJVniBDtu819nHTR/+hRyDZEVScQ3wuesst9P5/I7vqG+4czeimuHSrPe2ZtK+b+BQ=="
);
CREATE LIGHTNING FILESTORE TABLE nation FROM TPCH_AZBLB FORMAT CSV OPTIONS (
PATH "wasb://zettest-storage-container@zettesstorage.blob.core.windows.net/nation.csv",
inferSchema "true",
fs.azure.account.key.zettesstorage.blob.core.windows.net "bHLzau36KlZ6cYnSrvPzSJVniBDtu819nHTR/+hRyDZEVScQ3wuesst9P5/I7vqG+4czeimuHSrPe2ZtK+b+BQ=="
);
CREATE LIGHTNING FILESTORE TABLE orders FROM TPCH_AZBLB FORMAT CSV OPTIONS (
PATH "wasb://zettest-storage-container@zettesstorage.blob.core.windows.net/orders.csv",
inferSchema "true",
fs.azure.account.key.zettesstorage.blob.core.windows.net "bHLzau36KlZ6cYnSrvPzSJVniBDtu819nHTR/+hRyDZEVScQ3wuesst9P5/I7vqG+4czeimuHSrPe2ZtK+b+BQ=="
);
CREATE LIGHTNING FILESTORE TABLE partsupp FROM TPCH_AZBLB FORMAT CSV OPTIONS (
PATH "wasb://zettest-storage-container@zettesstorage.blob.core.windows.net/partsupp.csv",
inferSchema "true",
fs.azure.account.key.zettesstorage.blob.core.windows.net "bHLzau36KlZ6cYnSrvPzSJVniBDtu819nHTR/+hRyDZEVScQ3wuesst9P5/I7vqG+4czeimuHSrPe2ZtK+b+BQ=="
);
CREATE LIGHTNING FILESTORE TABLE part FROM TPCH_AZBLB FORMAT CSV OPTIONS (
PATH "wasb://zettest-storage-container@zettesstorage.blob.core.windows.net/part.csv",
inferSchema "true",
fs.azure.account.key.zettesstorage.blob.core.windows.net "bHLzau36KlZ6cYnSrvPzSJVniBDtu819nHTR/+hRyDZEVScQ3wuesst9P5/I7vqG+4czeimuHSrPe2ZtK+b+BQ=="
);
CREATE LIGHTNING FILESTORE TABLE region FROM TPCH_AZBLB FORMAT CSV OPTIONS (
PATH "wasb://zettest-storage-container@zettesstorage.blob.core.windows.net/region.csv",
inferSchema "true",
fs.azure.account.key.zettesstorage.blob.core.windows.net "bHLzau36KlZ6cYnSrvPzSJVniBDtu819nHTR/+hRyDZEVScQ3wuesst9P5/I7vqG+4czeimuHSrPe2ZtK+b+BQ=="
);
CREATE LIGHTNING FILESTORE TABLE supplier FROM TPCH_AZBLB FORMAT CSV OPTIONS (
PATH "wasb://zettest-storage-container@zettesstorage.blob.core.windows.net/supplier.csv",
inferSchema "true",
fs.azure.account.key.zettesstorage.blob.core.windows.net "bHLzau36KlZ6cYnSrvPzSJVniBDtu819nHTR/+hRyDZEVScQ3wuesst9P5/I7vqG+4czeimuHSrPe2ZtK+b+BQ=="
);
TPCH_S3 Databases:
CREATE LIGHTNING FILESTORE TABLE customer FROM TPCH_S3 FORMAT CSV OPTIONS (
PATH "s3n://zetaris-lightning-test/csv-data/tpc-h/customer.csv",
inferSchema "true",
AWSACCESSKEYID "AKIAITGIWHBIPE3NU5GA",
AWSSECRETACCESSKEY "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"
);
CREATE LIGHTNING FILESTORE TABLE customer FROM TPCH_S3 FORMAT CSV OPTIONS (
PATH "s3n://zetaris-lightning-test/csv-data/tpc-h/line.csv",
inferSchema "true",
AWSACCESSKEYID "AKIAITGIWHBIPE3NU5GA",
AWSSECRETACCESSKEY "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"
);
CREATE LIGHTNING FILESTORE TABLE customer FROM TPCH_S3 FORMAT CSV OPTIONS (
PATH "s3n://zetaris-lightning-test/csv-data/tpc-h/nation.csv",
inferSchema "true",
AWSACCESSKEYID "AKIAITGIWHBIPE3NU5GA",
AWSSECRETACCESSKEY "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"
);
CREATE LIGHTNING FILESTORE TABLE customer FROM TPCH_S3 FORMAT CSV OPTIONS (
PATH "s3n://zetaris-lightning-test/csv-data/tpc-h/orders.csv",
inferSchema "true",
AWSACCESSKEYID "AKIAITGIWHBIPE3NU5GA",
AWSSECRETACCESSKEY "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"
);
CREATE LIGHTNING FILESTORE TABLE customer FROM TPCH_S3 FORMAT CSV OPTIONS (
PATH "s3n://zetaris-lightning-test/csv-data/tpc-h/partsupp.csv",
inferSchema "true",
AWSACCESSKEYID "AKIAITGIWHBIPE3NU5GA",
AWSSECRETACCESSKEY "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"
);
CREATE LIGHTNING FILESTORE TABLE customer FROM TPCH_S3 FORMAT CSV OPTIONS (
PATH "s3n://zetaris-lightning-test/csv-data/tpc-h/part.csv",
inferSchema "true",
AWSACCESSKEYID "AKIAITGIWHBIPE3NU5GA",
AWSSECRETACCESSKEY "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"
);
CREATE LIGHTNING FILESTORE TABLE region FROM TPCH_AZBLB FORMAT CSV OPTIONS (
PATH "wasb://zettest-storage-container@zettesstorage.blob.core.windows.net/region.csv",
inferSchema "true",
fs.azure.account.key.zettesstorage.blob.core.windows.net "bHLzau36KlZ6cYnSrvPzSJVniBDtu819nHTR/+hRyDZEVScQ3wuesst9P5/I7vqG+4czeimuHSrPe2ZtK+b+BQ=="
);
Can I increase the performance of my queries?¶
No data is stored within the Cloud Data Fabric. Every query will interrogate the source system to generate a result. In some cases the one or more source system can have a very slow response time, causing the Cloud Data Fabric to wait for sufficient data to be received before other data source information can be referenced. For slow responsive source systems it is useful to cache the data, so that the response is much quicker. Instead of referencing the table, view or file in the source systems, the cached copy can be used. Reference on how to cache and uncache data can be found in the SQL Guide : Zetaris Cloud Data Fabric SQL Guide - ver 2.1.0
Cloud Data Fabric GUI vs Connecting using JDBC¶
The Cloud Data Fabric GUI is accessed using a URL of the form: http://betatestmedium-ui.6049b280cf1be6f1afc01bea29cacf61.datafabric.zetaris.com/lightning-gui/login Only the username and password is required to gain access to the necessary warehouse.
Access the Cloud Data Fabric with JDBC requires the JDBC driver to be downloaded and configured within the tool being used. Instructions on downloading and configuring the driver can be found here
Once the driver is installed the URL must be provided along with the username and password. The URL is of the form: jdbc:zetaris:clouddatafabric@betatestsmall.6049b280cf1be6f1afc01bea29cacf61.datafabric.zetaris.com/RestClient=http
The system has stopped running what do I do?¶
Send an email to support(support@zetaris.com) if any of the following scenarios occur; You are unable to get a login screen in your browser when you access the login URL http://ui.datafabric.zetaris.com/lightning-gui/login You are unable to connect to any of your virtual data sources through our tools that were previously available The User Interface seems hung due to long running query, try stopping the query by clicking stop button, if this does not rectify the issue contact support
I’ve asked the chat-bot a question how long does it take to respond?¶
Typically Zetaris will respond to your enquiry within one hour
If the chat-bot is not working how do I contact support?¶
Zetaris provide a support email as a backup process to our chat-bot. Please direct any questions to support@zetaris.com and maybe let us know that we have a problem (smile)
I built a data fabric but I can’t see it, where do I find it?¶
After creating a new data source, it will appear in the Logical Data Sources panel of the Fabric Builder Tab of the Fabric U, shown below. image2018-11-1_14-10-52.png If the data source which has been created is not visible, refresh the display by clicking the reload icon next to the Data Sources Please refer to the Quick Start Guide for more detailed information
How do I get back back to my Data Warehouse once I have built it?¶
Once you have built a Data Warehouse in the Warehouse Builder tab of the Cloud Data Fabric UI, you can refresh the view by clicking the Data Warehouse Builder icon to see your newly created data warehouse running. Refer to Section 6 of the Cloud Data Fabric -Quick Start Guide for detailed information on connecting to your new data warehouse
How do I connect my tools to a Virtual Data Warehouse or Data Mart?¶
Business Intelligence tools that support JDBC can be connected to the Cloud Data Fabric. The following steps should be followed: 1. The JDBC driver must be downloaded from here 2. The BI tool must be configured to use the JDBC driver, full instructions are available here
Can I connect to my data without using the data fabric builder?¶
The unique features of the Cloud Data Fabric enables data within different databases or storage to be queried and accessed concurrently either through a direct SQL query defined in the fabric or through the creation of a virtual Data Warehouse or Data Mart built to service the needs of the users. Existing tools are unable to query data from different systems in a single query, but firstly require the data to be collocated (through some transformation or copy process) in the one environment or database to achieve the same functionality that the Cloud Data Fabric can provide very quickly and securely.
Is my data safe and secure?¶
The Zetaris Cloud Data Fabric does not hold any of your data but instead uses the metadata (information about the data) from connected data sources, ensuring that any security defined in the data sources is maintained. All data source metadata are kept in the Data Fabric Schema Store which is highly secure in terms of both data at rest and data in motion The Cloud Data Fabric uses the Schema Store to perform highly secure queries that are executed at the source with only the query results returned to the Data Fabric. All queries executed from the within the Cloud Data Fabric use connections that are secured by SSL and HTTPS protocols and will benefit from this level of secure communications.
How do I secure my data in the the fabric?¶
The following article explains how data is secured by the data fabric through the inherent lack of data movement associated with the technology
Is my data safe and secure?¶
Further role based security can be applied to every Virtual Data Warehouse or Data Mart that is created from the data sources defined in the Cloud Data Fabric, ensuring that a narrow band of access to any information is maintained.
What is the difference between beta and post beta?¶
With your feedback we expect that:
- Every feature will be more stable and more polished
- We will have more product features
- Our billing system will be up and running and we will switch to a production billing mode.
- If you have been a Beta tester there are significant discounts over the published retail price.
What will happen to my data once the beta finishes?¶
Zetaris will inform you as the Beta close date approaches and will provide you with some options for our continued use of the platform If the data is important to you, Zetaris expect that you make arrangement to transfer this data If you have been using test data that is not important, Zetaris expect that you delete this data prior to the deadline date. Should you choose to continue using our service, your data will be saved and available for you to access in your account.
What should I expect from any functions labelled as Beta?¶
There are a few functions within the Cloud Data Fabric labelled as Beta. This simply means they are our newer functionalities and haven’t been as rigorously tested. We expect all our functions to be working proficiently however we encourage our users to give us feedback should you find any problems.
When is the Cloud Data Fabric Beta platform available?¶
At present, access to the platform is between 9am and 6pm from Monday to Friday, excluding public holiday
What are Push Type and Pull Type Data Sources?¶
A Push type data source is an implementation of the cloud data fabric within an organisation environment typically behind their firewall. Event Hubs are employed to send data from within the organisations firewall to the Cloud Data Fabric, so that it may be combined with the other data sources.
A Pull type data source defines the connection between the source storage or database, registering the metadata of the objects within that source. The source is either available within the cloud or a database open to the internet.
Can I work with other users?¶
Yes. You can set up users after you set up an account for them. As the administrator, you can create their role within your account and give them access to any virtual Warehouses and DataMarts. Refer to item 8 in the Quick Start Guide for more information on how to add users.
Do users I’ve added have to be a part of Beta?¶
Users that you add to your account do not need to be a part of the Beta program. They will however need a username and password in order to access the Zetaris Cloud Data Fabric which you will have to create for them as the administrator. Refer to item 8 in the Quick Start Guide for further information on how to add a user.
Do I need an Azure account?¶
You do not need an Azure account to use Zetaris Cloud Data Fabric. However you will need an account to store any data. You can create a free account here.
How do I create an Azure account?¶
You can create a free Azure account here
Can I change my password?¶
Administrators can change their password in the top right corner of any screen where their user ID is shown. Select from the drop-down menu “Change Password”.
image2019-4-29_16-21-53.png
Refer to section 2 of the Quick Start Guide for further information
What do I do if I forget my password?¶
If you forget your password send us an email at support@zetaris.com. We will send you an email with a password reset link. The link will be valid for 24 hours.
SQL Manual¶
This guide provides a reference for SQL Language Manual and a set of example use cases. For information on Spark SQL, see the Apache Spark Spark SQL, DataFrames, and Datasets Guide. For information on Hive Language Manual, see the Hive Manual
1. SELECT¶
SELECT [hints, ...] [* |DISTINCT] named_expression[, named_expression, ...]
FROM relation[, relation, ...]
[lateral_view[, lateral_view, ...]]
[WHERE boolean_expression]
[aggregation [HAVING boolean_expression]]
[WINDOW named_window[, WINDOW named_window, ...]]
[LIMIT num_rows]
named_expression:
: expression [AS alias]
relation:
| join_relation
| (table_name|query|relation) [sample] [AS alias]
: VALUES (expressions)[, (expressions), ...]
[AS (column_name[, column_name, ...])]
expressions:
: expression[, expression, ...]
Output data from one or more relations. A relation refers to any source of input data. It could be the contents of an existing table (or view), the joined result of two existing tables, or a subquery (the result of another SELECT statement).
1.1 * (ALL)¶
Select all matching rows from the relation. Enabled by default.
1.2 DISTINCT¶
Select all matching rows from the relation then remove duplicate
1.3 WHERE¶
Filter rows by predicate.
Like Operator: used to match text values against a pattern using wildcards. If the search expression can be matched to the pattern expression, the LIKE operator will return true, which is 1.
There are two wildcards used in conjunction with the LIKE operator −
- The percent sign (%)
- The underscore (_)
SELECT FROM table_name
WHERE column LIKE 'XXXX%'
SELECT FROM table_name
WHERE column LIKE '%XXXX%'
SELECT FROM table_name
WHERE column LIKE 'XXXX_'
1.4 GROUP BY (Aggregate Functions)¶
It is used in collaboration with the SELECT statement to group together those rows in a table that have identical data. This is done to eliminate redundancy in the output and/or compute aggregates that apply to these groups. The GROUP BY clause follows the WHERE clause in a SELECT statement and precedes the ORDER BY clause.
SELECT column-list
FROM table_name
WHERE [ conditions ]
GROUP BY column1, column2....columnN
ORDER BY column1, column2....columnN
COUNT() Returns the number of rows containing non-NULL values in the specified field. SUM() Returns the sum of the non-NULL values in the specified field. AVG() Returns the average of the non-NULL values in the specified field. MAX() Returns the maximum of the non-NULL values in the specified field. MIN() Returns the minimum of the non-NULL values in the specified field.
1.5 HAVING¶
Filter grouped result by predicate.
1.6 ORDER BY¶
Impose total ordering on a set of expressions. Default sort direction is ascending.
1.7 WINDOW¶
Assign an identifier to a window specification. Refer to section 6, for more information on Window Function.
1.8 LIMIT¶
Limit the number of rows returned.
1.9 VALUES¶
Explicitly specify values instead of reading them from a relation.
1.10 COMMON TABLE EXPRESSIONS (CTE)¶
A common table expression is a temporary result set which you can reference within another SQL statement including SELECT, INSERT, UPDATE OR DELETE Common Table Expressions are temporary in the sense that they only exist during the execution of the query. The following shows the syntax of creating a CTE:
WITH cte_name (column_list) AS (
CTE_query_definition
)
[SELECT STATEMENT]:[INSERT SATEMENT]:[UPDATE SATEMENT]:[DELETE SATEMENT];
For Example:
with cte as
(SELECT
*
from
ipl.individual)
SELECT
count(*)
from
cte
In this syntax
- First, specify the name of the CTE following by an optional column list.
- Second, inside the body of WITH clause, specify a query that returns a result set. If you do not explicitly specify the column list after the CTE name, the select list of the cte query definition will become the column list of the CTE.
- Third, use the CTE like a table or view in the statement which can be a SELECT, INSERT, UPDATE OR DELETE
Common Table Expressions or CTEs are typically used to simplify complex joins and subqueries.
1.11 EXAMPLES:¶
SELECT * FROM ipl.iplmatch;
SELECT matchnumber,hometeam from ipl.iplmatch;
SELECT DISTINCT matchnumber,hometeam from ipl.iplmatch;
SELECT count(*) from ipl.iplmatch;
SELECT * FROM ipl.iplmatch where Location='Bangalore' and IndexNo=419112;
SELECT * FROM ipl.iplmatch LIMIT 10;
SELECT batsman_name as batsman ,max(batsman_runs) as max_runs FROM bbl.bbl01cs
group by 1
order by max_runs desc ;
SELECT batsman_name as batsman ,max(batsman_runs) as max_runs FROM bbl.bbl01cs
where batsman_name like '%Ma%'
group by 1
having max_runs >=4
order by max_runs desc;
2. SAMPLING¶
sample:
| TABLESAMPLE ((integer_expression | decimal_expression) PERCENT): TABLESAMPLE (integer_expression ROWS)
Sample the input data. This can be expressed in terms of either a percentage (must be between 0 and 1 00) or a fixed number of input rows
2.1 EXAMPLES:¶
SELECT
*
FROM
ipl.individual
TABLESAMPLE (3 ROWS);
SELECT
*
FROM
ipl.individual
TABLESAMPLE (25 PERCENT);
3. JOIN¶
join_relation:
| relation join_type JOIN relation (ON boolean_expression | USING (column_name[, column_name, ...]))
: relation NATURAL join_type JOIN relation
join_type:
| INNER
| (LEFT|RIGHT) SEMI
| (LEFT|RIGHT|FULL) [OUTER]
: [LEFT] ANTI
3.1 INNER JOIN¶
Select all rows from both relations where there is match.
3.2 OUTER JOIN¶
Select all rows from both relations, filling with null values on the side that does not have a match.
3.3 RIGHT JOIN¶
Select ALL rows from the RIGHT side and corresponding matching values from left.
3.4 LEFT JOIN¶
Select ALL rows from the left side and corresponding matching values from right.
3.5 EXAMPLES :¶
SELECT
a.game_id,
a.venue,
b.game_id,
b.venue
from
bbl.bbl01cs a
inner join
bbl.bbl02cs b
on
a.competition=b.competition:
SELECT
a.game_id,
a.venue,
b.game_id,
b.venue
from
bbl.bbl01cs a
left join
bbl.bbl02cs b
on a.venue=b.venue
SELECT
a.game_id as wbbl_game_id
, a.competition
, b.game_id
, b.competition
FROM
wbbl.wbbl01c a
inner join
bbl.bbl01cs b
on
a.game_id=b.game_id
group by 1,2,3,4
limit 10
4. LATERAL VIEW¶
lateral_view:
: LATERAL VIEW [OUTER] function_name (expressions)
table_name [AS (column_name[, column_name, ...])]
Generate zero or more output rows for each input row using a table-generating function. The most common built-in function used with LATERAL VIEW is explode.
4.1 LATERAL VIEW OUTER¶
Generate a row with null values even when the function returned zero rows.
4.2 EXAMPLES:¶
SELECT
*
FROM
bbl.bbl01cs
LATERAL VIEW explode(Array(1, 2, 3)) my_view
5. AGGREGATION¶
aggregation:
: GROUP BY expressions [(WITH ROLLUP | WITH CUBE | GROUPING SETS (expressions))]
Group by a set of expressions using one or more aggregate functions. Common built-in aggregate functions include count, avg, min, max, and sum.
5.1 ROLLUP¶
Create a grouping set at each hierarchical level of the specified expressions.
For instance,
GROUP BY a, b, c WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (a), ()).
The total number of grouping sets will be N + 1, where N is the number of group expressions.
5.2 CUBE¶
Create a grouping set for each possible combination of set of the specified expressions.
For instance,
GROUP BY a, b, c WITH CUBE is equivalent to GROUP BY a, b, c GROUPING SETS ((a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ()).
The total number of grouping sets will be 2^N, where N is the number of group expressions.
5.3 GROUPING SETS¶
Perform a group by for each subset of the group expressions specified in the grouping sets.
For instance,
GROUP BY x, y GROUPING SETS (x, y) is equivalent to the result of GROUP BY x unioned with that of GROUP BY y.
5.4 EXAMPLES:¶
SELECT
venue,
COUNT(*) AS num_matches
FROM
wbbl.wbbl01c
GROUP BY venue
SELECT
venue,
AVG(batsman_runs) AS avg_runs
FROM
wbbl.wbbl01c
GROUP BY venue
SELECT
venue,
fixture,
batsman_name
FROM
wbbl.wbbl01c
GROUP BY venue, fixture, batsman_name WITH ROLLUP
SELECT
venue,
fixture,
AVG(batsman_runs)
FROM
wbbl.wbbl01c
GROUP BY venue, fixture GROUPING SETS (venue, fixture)
6. WINDOW FUNCTION¶
window_expression:
: expression OVER window_spec
named_window:
: window_identifier AS window_spec
window_spec:
| window_identifier
: ((PARTITION|DISTRIBUTE) BY expressions
[(ORDER|SORT) BY sort_expressions] [window_frame])
window_frame:
| (RANGE|ROWS) frame_bound
: (RANGE|ROWS) BETWEEN frame_bound AND frame_bound
frame_bound:
| CURRENT ROW
| UNBOUNDED (PRECEDING|FOLLOWING)
: expression (PRECEDING|FOLLOWING)
Compute a result over a range of input rows. A windowed expression is specified using the OVER keyword, which is followed by either an identifier to the window (defined using the WINDOW keyword) or the specification of a window.
6.1 PARTITION BY¶
Specify which rows will be in the same partition, aliased by DISTRIBUTE BY.
6.2 ORDER BY¶
Specify how rows within a window partition are ordered, aliased by SORT BY.
6.3 RANGE BOUND¶
Express the size of the window in terms of a value range for the expression.
6.4 ROWS bound¶
Express the size of the window in terms of the number of rows before and/or after the current row.
6.5 CURRENT ROW¶
Use the current row as a bound.
6.6 UNBOUNDED¶
Use negative infinity as the lower bound or infinity as the upper bound.
6.7 PRECEDING¶
If used with a RANGE bound, this defines the lower bound of the value range. If used with a ROWS bound, this determines the number of rows before the current row to keep in the window.
6.8 FOLLOWING¶
If used with a RANGE bound, this defines the upper bound of the value range. If used with a ROWS bound, this determines the number of rows after the current row to keep in the window.
7. HINTS¶
hints:
: /*+ hint[, hint, ...] */
hint:
: hintName [(expression[, expression, ...])]
Hints can be used to help execute a query better.
For example, you can hint that a table is small enough to be broadcast, which would speed up joins.
You add one or more hints to a SELECT statement inside /*+ … */ comment blocks.
Multiple hints can be specified inside the same comment block, in which case the hints are separated by commas, and there can be multiple such comment blocks. A hint has a name (for example, BROADCAST) and accepts 0 or more parameters.
7.1 EXAMPLES¶
SELECT /*+ BROADCAST(customers) */
a.*,
b.*
FROM
TPCH_AZBLB.customer a,
TPCH_AZBLB.orders b
WHERE a.c_custkey = b.o_custkey
SELECT /*+ SKEW('orders') */
a.*,
b.*
FROM
TPCH_AZBLB.customer a,
TPCH_AZBLB.orders b
WHERE a.c_custkey = b.o_custkey
8. DATA SOURCE VIEW¶
View is virtual table based on the result-set of an SQL statement.
A view contains rows and columns, just like a real table.
The fields in a view are fields from one or more real tables in the database.
Using the Zetaris Data Fabric, tables/views from multiple data sources can be used to create a new view.
You can add SQL functions, WHERE, and JOIN statements to a view and present the data as if the data were coming from one single table.
CREATE VIEW [OR ALTER] schema_name.view_name [(column_list)]
AS
select_statement;
In this syntax:
- First, specify the name of the view after the CREATE VIEW keywords. The schema_name is the name of the schema to which the view belongs.
- Second, specify a SELECT statement (select_statement) that defines the view after the AS keyword. The SELECT statement can refer to one or more tables.
If you don’t explicitly specify a list of columns for the view, SQL Server will use the column list derived from the SELECT statement.
In case you want to redefine the view e.g., adding more columns to it or removing some columns from it, you can use the OR ALTER keywords after the CREATE VIEW keywords.
CREATE DATASOURCE VIEW cust_ordr_count AS
select
b.c_name,
count(distinct(a.o_orderkey))
FROM
TPCH_AZBLB.orders a
INNER JOIN
TPCH_AZBLB.customer b
on
a.o_custkey=b.c_custkey
group by 1
9. CAST OPERATOR¶
There are many cases that you want to convert a value of one data type into another. PostgreSQL provides you with the CAST operator that allows you to do this.
The following illustrates the syntax of type CAST:
CAST ( expression AS target_type );
In this syntax:
- First, specify an expression that can be a constant, a table column, an expression that evaluates to a value.
- Then, specify the target data type to which you want to convert the result of the expression.
9.1 EXAMPLES¶
SELECT
CAST ('100' AS INTEGER);
If the expression cannot be converted to the target type, PostgreSQL will raise an error. See the following example:
SELECT
CAST ('10C' AS INTEGER);
[Err] ERROR: invalid input syntax for integer:
SELECT
CAST ('2015-01-01' AS DATE),
CAST ('01-OCT-2015' AS DATE);
10. Other String Functions¶
10.1 CONCAT( string str1, string str2… )¶
The CONCAT function concatenates all the stings.
Example: CONCAT('hadoop','-','hive') returns 'hadoop-hive'
CONCAT_WS( string delimiter, string str1, string str2... )
The CONCAT_WS function is similar to the CONCAT function. Here you can also provide the delimiter, which can be used in between the strings to concat.
Example: CONCAT_WS('-','hadoop','hive') returns 'hadoop-hive'
10.2 FIND_IN_SET( string search_string, string source_string_list )¶
The FIND_IN_SET function searches for the search string in the source_string_list and returns the position of the first occurrence in the source string list. Here the source string list should be comma delimited one. It returns 0 if the first argument contains comma.
Example: FIND_IN_SET('ha','hao,mn,hc,ha,hef') returns 4
10.3 LENGTH( string str )¶
The LENGTH function returns the number of characters in a string.
Example: LENGTH('hive') returns 4
10.4 LOWER( string str ), LCASE( string str )¶
The LOWER or LCASE function converts the string into lower case letters.
Example: LOWER('HiVe') returns 'hive'
10.5 LPAD( string str, int len, string pad )¶
The LPAD function returns the string with a length of len characters left-padded with pad.
Example: LPAD('hive',6,'v') returns 'vvhive'
10.6 LTRIM( string str )¶
The LTRIM function removes all the trailing spaces from the string.
Example: LTRIM(' hive') returns 'hive'
10.7 REPEAT( string str, int n )¶
The REPEAT function repeats the specified string n times.
Example: REPEAT('hive',2) returns 'hivehive'
10.8 RPAD( string str, int len, string pad )¶
The RPAD function returns the string with a length of len characters right-padded with pad.
Example: RPAD('hive',6,'v') returns 'hivevv'
10.9 REVERSE( string str )¶
The REVERSE function gives the reversed string
Example: REVERSE('hive') returns 'evih'
10.10 RTRIM( string str )¶
The RTRIM function removes all the leading spaces from the string.
Example: LTRIM('hive ') returns 'hive'
10.11 SPACE( int number_of_spaces )¶
The SPACE function returns the specified number of spaces.
Example: SPACE(4) returns ' '
10.12 SPLIT( string str, string pat )¶
The SPLIT function splits the string around the pattern pat and returns an array of strings. You can specify regular expressions as patterns.
Example: SPLIT('hive:hadoop',':') returns ["hive","hadoop"]
10.13 SUBSTR( string source_str, int start_position [,int length] )¶
The SUBSTR or SUBSTRING function returns a part of the source string from the start position with the specified length of characters. If the length is not given, then it returns from the start position to the end of the string.
Example1: SUBSTR('hadoop',4) returns 'oop'
Example2: SUBSTR('hadoop',4,2) returns 'oo'
10.14 TRIM( string str )¶
The TRIM function removes both the trailing and leading spaces from the string.
Example: TRIM(' hive ') returns 'hive'
10.15 UPPER( string str ), UCASE( string str )¶
The UPPER or UCASE function converts the string into upper case letters.
10.16 Example:¶
UPPER('HiVe') returns 'HIVE' UPPER( string str ), UCASE( string str )
11. Other Useful Functions - Quick Reference¶
11.1 Mathematical Functions¶
The following built-in mathematical functions are supported in Lightning; most return NULL when the argument(s) are NULL:
Function Name
Returns the absolute value
Returns double
Returns the arc cosine of x if -1<=a<=1 or null otherwise
Returns double
Returns the arc sin of x if -1<=a<=1 or null otherwise
Returns double
Returns the arctangent of a
Returns double
Returns the number in binary format
Returns string
Returns the minimum BIGINT value that is equal or greater than the double
Returns bigint
Converts a number from a given base to another
Returns string
Returns the cosine of a (a is in radians)
Returns double
Converts value of a from radians to degre
Returns double
Returns ea where e is the base of the natural logarithm
Returns double
Returns the maximum BIGINT value that is equal or less than the double
Returns bigint
If the argument is an int, hex returns the number as a string in hex format.
Otherwise if the number is a string, it converts each character into its hex representation and returns the resulting string.
Returns string
Returns the natural logarithm of the argument
Returns double
Return the base “base” logarithm of the argument
Returns double
Returns the base-10 logarithm of the argument
Returns double
Returns the base-2 logarithm of the argument
Returns double
Returns -a int double
Returns the positive value of a mod b
Returns integer
Returns a int, double
Return ap
Converts value of a from degrees to radians
Returns double
Returns a random number (that changes from row to row) that is distributed uniformly from 0 to 1.
Specifiying the seed will make sure the generated random number sequence is deterministic.
Returns double
Returns the rounded BIGINT value of the double
Returns bigint
Returns the double rounded to d decimal places
Returns double
Returns the sign of a as ‘1.0’ or ‘-1.0’
Returns float
Returns the sine of a (a is in radians)
Returns double
Returns the square root of a
Returns double
Returns the tangent of a (a is in radians)
Inverse of hex.
Interprets each pair of characters as a hexidecimal number and converts to the character represented by the number. string
Returns string
11.2 The following are built-in String functions¶
Function Name
Returns the numeric value of the first character of str
Returns int
Converts the argument from binary to a base- 64 string (as of 0.12.0)
Returns string
Returns the ASCII character having the binary equivalent to A (as of Hive 1.3.0 and 2.1.0). If A is larger than 256 the result is equivalent to chr(A % 256). Example: select chr(88); returns “X”.string
Returns string
Returns the string or bytes resulting from concatenating the strings or bytes passed in as parameters in order. e.g. concat(‘foo’, ‘bar’) results in ‘foobar’. Note that this function can take any number of input strings.
Returns the top-k contextual N-grams from a set of tokenized sentences, given a string of “context”. See StatisticsAndDataMining for more information.
Returns array<struct<string,double>>
Like concat() above, but with custom separator SEP.
Returns string
Like concat_ws() above, but taking an array of strings. (as of Hive 0.9.0)
Returns string
Decodes the first argument into a String using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null. (As of Hive 0.12.0.)
Returns string
Return string at index number. For example elt(2,’hello’,’world’) returns ‘world’. Returns NULL if N is less than 1 or greater than the number of arguments. (See https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_elt)
Returns string
Encodes the first argument into a BINARY using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null. (As of Hive 0.12.0.)
Returns binary
Returns the index of val in the val1,val2,val3,… list or 0 if not found. For example field(‘world’,’say’,’hello’,’world’) returns 3. All primitive types are supported, arguments are compared using str.equals(x). If val is NULL, the return value is 0. (See https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_field)
Returns int
Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. e.g. find_in_set(‘ab’, ‘abc,b,ab,c,def’) returns 3
Returns int
Formats the number X to a format like ‘#,###,###.##’, rounded to D decimal places, and returns the result as a string. If D is 0, the result has no decimal point or fractional part. (as of Hive 0.10.0)
Returns string
Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid. NOTE: The json path can only have the characters [0-9a-z], i.e., no upper-case or special characters. Also, the keys cannot start with numbers. This is due to restrictions on Hive column names.
Returns string
Returns true if the string str appears as an entire line in filename.
Returns boolean
Returns string, with the first letter of each word in uppercase, all other letters in lowercase. Words are delimited by whitespace. (As of Hive 1.1.0.)
Returns int
Returns the position of the first occurence of substr in str
Returns int
Returns the length of the string
Returns int
Returns the Levenshtein distance between two strings (as of Hive 1.2.0). For example, levenshtein(‘kitten’, ‘sitting’) results in 3.
Returns int
Returns the position of the first occurrence of substr in str after position pos
Returns int
Returns the string resulting from converting all characters of B to lower case. For example, lower(‘fOoBaR’) results in ‘foobar’.
Returns string
Returns str, left-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. In case of empty pad string, the return value is null.
Returns string
Returns the string resulting from trimming spaces from the beginning(left hand side) of A e.g. ltrim(‘ foobar ‘) results in ‘foobar ‘
Returns string
Returns the top-k N-grams from a set of tokenized sentences, such as those returned by the sentences() UDAF. See StatisticsAndDataMining for more information.
Returns array<struct<string,double>>
Returns the specified part from the URL. Valid values for partToExtract include HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. e.g. parse_url(‘https://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1’, ‘HOST’) returns ‘facebook.com’. Also a value of a particular key in QUERY can be extracted by providing the key as the third argument, e.g. parse_url(‘https://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1’, ‘QUERY’, ‘k1’) returns ‘v1’.
Returns string
Returns the input formatted according do printf-style format strings (as of Hive 0.9.0)
Returns string
Returns the string extracted using the pattern. e.g. regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 2) returns ‘bar.’’ Note that some care is necessary in using predefined character classes: using ‘s’ as the second argument will match the letter s; ‘s’ is necessary to match whitespace, etc. The ‘index’ parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the ‘index’ or Java regex group() method.
Returns string
Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT, e.g. regexp_replace(“foobar”, “oo|ar”, “”) returns ‘fb.’ Note that some care is necessary in using predefined character classes: using ‘s’ as the second argument will match the letter s; ‘s’ is necessary to match whitespace, etc.
Returns string
Repeat str n times
Returns string
Returns the string A with all non-overlapping occurrences of OLD replaced with NEW (as of Hive 1.3.0 and 2.1.0). Example: select replace(“ababab”, “abab”, “Z”); returns “Zab”.
Returns string
Returns the reversed string
Returns string
Returns str, right-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. In case of empty pad string, the return value is null.
Returns string
Returns the string resulting from trimming spaces from the end(right hand side) of A e.g. rtrim(‘ foobar ‘) results in ‘ foobar’
Tokenizes a string of natural language text into words and sentences, where each sentence is broken at the appropriate sentence boundary and returned as an array of words. The ‘lang’ and ‘locale’ are optional arguments. e.g. sentences(‘Hello there! How are you?’) returns ( (“Hello”, “there”), (“How”, “are”, “you”) )
Returns array<array>
Returns soundex code of the string (as of Hive 1.2.0). For example, soundex(‘Miller’) results in M460.
Returns string
Return a string of n spaces
Returns string
Split str around pat (pat is a regular expression)
Returns array
Splits text into key-value pairs using two delimiters. Delimiter1 separates text into K-V pairs, and Delimiter2 splits each K-V pair. Default delimiters are ‘,’ for delimiter1 and ‘=’ for delimiter2.
Returns map<string,string>
Returns the substring or slice of the byte array of A starting from start position till the end of string A e.g. substr(‘foobar’, 4) results in ‘bar’
Returns string
Returns the substring or slice of the byte array of A starting from start position with length len e.g. substr(‘foobar’, 4, 1) results in ‘b’
Returns string
Translates the input string by replacing the characters present in the from string with the corresponding characters in the to string. This is similar to the translate function in PostgreSQL. If any of the parameters to this UDF are NULL, the result is NULL as well (available as of Hive 0.10.0; char/varchar support added as of Hive 0.14.0.)
Returns string
Returns the string resulting from trimming spaces from both ends of A e.g. trim(‘ foobar ‘) results in ‘foobar’
Returns string
12 Register Logical Datasources¶
To build a virtual data lake, a user need to identify data base to be connected. Lightning supports all JDBC compliant DB, all know NoSQL, RestAPI, CSV, JSON file.
Register Clustered Databases¶
A user need to provide JDBC driver class, url and connectivity credentials including extra parameters database need.
Syntax:
CREATE DATASOURCE ORCL [DESCRIBE BY ["Oracle for Product Master"] OPTIONS (
jdbcdriver "com.oracle.OracleDriver",
jdbcurl "jdbc:oracle:thin@oracle-master:1521:orcl",
username "scott",
password "tiger",
[key "value"]*)
Real-world example
showing a connection to Oracle on port 1521 of a localhost with a database name of orcl with user name scott and password tiger.
CREATE DATASOURCE ORACLE DESCRIBE BY "Oracle for Product Master" OPTIONS (
jdbcdriver "com.oracle.OracleDriver",
jdbcurl "jdbc:oracle:thin@localhost:1521:orcl",
username "scott",
password "tiger")
If the registered database supports cluster base computing such as MPP, then a user can register slave nodes so that Lightning can directly query to slave nodes rather than running it through the master node.
ADD SLAVE DATASOURCE TO ORCL OPTIONS (
jdbcdriver "com.oracle.OracleDriver",
jdbcurl "jdbc:oracle:thin@oracle-slave1;1521:orcl",
username "scott",
password "tiger",
[key "value"])*
ADD SLAVE DATASOURCE TO ORCL OPTIONS (
jdbcdriver "com.oracle.OracleDriver",
jdbcurl "jdbc:oracle:thin@oracle-slave2;1521:orcl",
username "scott",
password "tiger",
[key "value"])*
ADD SLAVE DATASOURCE TO ORCL OPTIONS (
jdbcdriver "com.oracle.OracleDriver",
jdbcurl "jdbc:oracle:thin@oracle-slave3;1521:orcl",
username "scott",
password "tiger",
[key "value"])*
The above example shows registering 1 master node(oracle-master) and 3 slave nodes(oracle-slave1, oracle-slave2, and oracle-slave3).
*ORCL is alias for the target data source which will be decided by a user when creating data source. *If schema is provided, Lightning will only ingest metadata from that schema *If schema_prepended_table is set to true, schema will be prepended to the table name as there may be same tables using the same name across different schemas. For example, role table in zetaris_bi schema will be named zetaris_bi__role
Register RDBMS Datasource¶
CREATE DATASOURCE MSSQL DESCRIBE BY "MSSQL-2017-linux " OPTIONS (
jdbcdriver "com.microsoft.sqlserver.jdbc.SQLServerDriver",
jdbcurl "jdbc:sqlserver://localhost:1433 ",
databaseName "DemoData",
username "scott" ,
password "tiger",
schema “dbo”
)
CREATE DATASOURCE MY_SQL DESCRIBE BY "MySQL " OPTIONS (
jdbcdriver "com.mysql.jdbc.Driver",
jdbcurl "jdbc:mysql://127.0.0.1/test_db",
username "scott" ,
password "tiger
)
CREATE DATASOURCE DB2_DB2INST1 DESCRIBE BY "DB2 Sample DB Schema " OPTIONS (
jdbcdriver "com.ibm.db2.jcc.DB2Driver",
jdbcurl "jdbc:db2://127.0.0.1:50000/db_name",
username "db2inst1" ,
password "db2inst1-pwd",
schema "DB2INST1",
schema_prepended_table "true"
)
CREATE DATASOURCE GREEN_PLUM DESCRIBE BY "GREEN_PLUM " OPTIONS (
jdbcdriver "org.postgresql.Driver",
jdbcurl "jdbc:postgresql://localhost:5432/postgres",
username "gpadmin" ,
password "pivotal",
schema "public"
)
CREATE DATASOURCE TERA_DATA DESCRIBE BY "TERA_DATA " OPTIONS (
jdbcdriver "com.teradata.jdbc.TeraDriver",
jdbcurl "jdbc:teradata://10.128.87.16/DBS_PORT=1025",
username "dbc" ,
password "dbc",
schema "dbcmngr"
)
CREATE DATASOURCE AWS_AURORA DESCRIBE BY "AWS_AURORA " OPTIONS (
jdbcdriver "com.mysql.jdbc.Driver",
jdbcurl "jdbc:mysql://zet-aurora-cluster.cluster-ckh4ncwbhsty.ap-southeast-2.rds.amazonaws.com/your_db?",
username "your_db_account_name" ,
password "your_db_account_password""
)
CREATE DATASOURCE REDSHIFT DESCRIBE BY "AWS RedShift" OPTIONS (
jdbcdriver "com.amazon.redshift.jdbc.Driver",
jdbcurl "jdbc:redshift://zetaris.cyzoanxzdpje.ap-southeast-2.redshift.amazonaws.com:5439/your_db_name",
username "your_db_account_name",
password "your_db_account_password"
)
Register NOSQL Datasource¶
For MongoDB, the below 5 parameters(host, port, db name, user name and password) must be provided.
CREATE DATASOURCE MONGO DESCRIBE BY "MongoDB" OPTIONS (
lightning.datasource.mongodb.host "localhost",
lightning.datasource.mongodb.port "27017",
lightning.datasource.mongodb.database "lightning-demo",
lightning.datasource.mongodb.username "",
lightning.datasource.mongodb.password ""
)
For Cassandra, there is only on parameter for Lightning, which is key space for this connection. the other parameters start with “spark.cassandra” prefix, which is actually provided for Spark Cassandra connector(https://github.com/datastax/spark-cassandra-connector).
CREATE DATASOURCE CSNDR DESCRIBE BY "Cassandra" OPTIONS (
spark.cassandra.connection.host "localhost",
spark.cassandra.connection. port "9042",
spark.cassandra.auth.username "cassandra",
spark.cassandra.auth. password "cassandra",
lightning. datasource .cassandra.keyspace "lightning_demo"
)
CREATE DATASOURCE AWS_DYNAMODB DESCRIBE BY "AWS DynamoDB" OPTIONS (
accessKeyId "Your_aws_accessKeyId",
secretKey "Your_aws_SecretAccessKey" ,
region "ap-southeast-2"
)
13 Create Physical Datasources¶
These are datasources based on physical files residing on on-prem filestore (Local filesystem,nfs filesystem), or cloud filestores like S3 and Azure Blob. One need to first create a lightning database and then register the respective files under this namespace . .. highlight:: sql
CREATE LIGHTNING DATABASE AWS_S3 DESCRIBE BY "AWS S3 bucket" OPTIONS (
[key "value"]
)
Ingest RESTful Service¶
For the RESTful service when returns JSON format, a user need to provide end point, HTTP method, encoding type as well as schema.::
REGISTER REST DATASOURCE TABLE SAFC_USERS FROM SAFC SCHEMA (
uid Long,
gender String,
age Integer ,
job String,
ts String)
OPTIONS (
endpoint "/example/users",
method "GET",
requesttype "URLENCODED"
);
Other parameter for the API call, such as security key, can be provided in OPTIONS field.
Ingest file from S3¶
CREATE LIGHTNING FILESTORE TABLE customer FROM TPCH_S3 FORMAT CSV(JSON) OPTIONS (
PATH "s3n://zetaris-lightning-test/csv-data/tpc-h/customer.csv",
inferSchema "true",
AWSACCESSKEYID "AKIAITGIWHBIPE3NU5GA",
AWSSECRETACCESSKEY "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"
)
Ingest file from azure Blob¶
CREATE LIGHTNING FILESTORE TABLE customer FROM TPCH_AZBLB FORMAT CSV(JSON) OPTIONS (
PATH "wasb://zettest-storage-container@zettesstorage.blob.core.windows.net/customer.csv",
inferSchema "true",
fs.azure.account.key.zettesstorage.blob.core.windows.net "bHLzau36KlZ6cYnSrvPzSJVniBDtu819nHTR/+hRyDZEVScQ3wuesst9P5/I7vqG+4czeimuHSrPe2ZtK+b+BQ=="
)
Key name for security key depends on Azure Blob container, refer to Azure Blob service.
14 Ingest Metadata¶
Once a data source is registered in Lightning it will ingest all table, column and constraints metadata.
Ingest all tables from the data source¶
REGISTER DATASOURCE TABLES FROM ORCL
This command will connect to ORCL database, and ingest all metadata(tables, columns, foreign key, index and all other constraints) into Schema Store
Ingest a table from the data source¶
REGISTER DATASOURCE TABLE "USER" [USER_ALIAS] FROM ORCL
This will register “USER” table as USER_ALIAS if alias is provided.
Update Schema¶
When changes were made to the target data source, a user can reflect them using update schema command:
UPDATE DATASOURCE SCHEMA ORCL
Update description and materialised table for each relation in a data source¶
15 Manage Schema Store¶
Lightning provides various commands to manage meta data in the schema store. Also, these will be provided via RESTful service.
Datasource¶
This command shows the data sources registered in the schema store:: .. highlight:: sql
SHOW DATASOURCES
This command drop the registered data source as well as all tables under that.:
DROP DATASOURCE ORCL
DESCRIBE DATASOURCE ORCL
DESCRIBE SLAVE DATASOURCE ORCL
Table¶
DESC ORCL.USERS
SHOW TABLES
SHOW DATASOURCE TABLES ORCL
DROP TABLE ORCL.USERS
View¶
Lightning supports the view capability with query definition on a single data source or across multiple data sources
CREATE DATASOURCE VIEW TEEN_AGER FROM ORCL AS
SELECT * FROM USERS WHERE AGE >= 13 AND AGE < 20
the TEEN_AGER view belongs to ORCL data source. With this capability a user can create a view with DBMS native query, which is really handy :
CREATE DATASOURCE VIEW SALARY_RANK FROM ORCL AS
SELECT department_id, last_name, salary, RANK() OVER (PARTITION BY department_id ORDER BY salary) RANK
FROM employees
WHERE department_id = 60
ORDER BY RANK, last_name
SELECT * FROM ORCL.SALARY_RANK will produce :
DEPARTMENT_ID LAST_NAME SALARY RANK
------------- ------------------------- ---------- ----------
60 Lorentz 4200 1
60 Austin 4800 2
60 Pataballa 4800 2
60 Ernst 6000 4
60 Hunold 9000 5
Also, those views can be join with other tables in other data sources.
This view can across different data sources.
CREATE DATASOURCE VIEW TOP10_MOVIES_FOR_TEENS AS
SELECT movies_from_oracle.title, user_rating.count, user_rating.min, user_rating.max, user_rating.avg
FROM(
SELECT iid, count(*) count, min(pref) min, max(pref) max, avg(pref) avg
FROM TRDT.ratings ratings_from_teradata, PGRS.users users_from_postgres
WHERE users_from_postgres.age >=13AND users_from_postgres.age <20
AND ratings_from_teradata.uid = users_from_postgres.uid
GROUP BY ratings_from_teradata.iid
ORDER BY avg DESC
LIMIT20
) AS user_rating, ORCL.movies movies_from_oracle
WHERE movies_from_oracle.iid = user_rating.iid
This view can be queried like normal table :
SELECT*FROM TOP10_MOVIES_FOR_TEENS
DROP VIEW ORCL.TEEN_AGER;
16 Run Query¶
Lightning supports SQL2003. Also, it can run all 99 TPC-DS queries. As long as a data source registered into schema store, query can across all data sources.
For example the following query run join query across three different data sources(Teradata ↔ Oracle ↔ Cassandra),
SELECT users_from_cassandra.age, users_from_cassandra.gender, movies_from_oracle.title title, ratings_from_teradata.pref, ratings_from_teradata.ts
FROM TRDT.ratings ratings_from_teradata, ORCL.movies movies_from_oracle, CSNDR.users users_from_cassandra
WHERE users_from_postgres.gender = 'F'
AND ratings_from_teradata.uid = users_from_postgres.uid
AND movies_from_oracle.iid = ratings_from_teradata.iid
17 Materialization and Cache¶
For some reasons, for example query performance, all data source tables or views can be materialized by leveraging Zetaris Fusion DB. Also, Lightning support Cache capabilities where a user can load all data into main memory.
Materialization¶
For example the following query materialize all data from RESTful Service to USER_FOR_COPY table in fusion db.:
INSERT INTO FUSIONDB.USERS_FOR_COPY
SELECT uid, gender, age, job , ts FROM SAFC.SAFC_USERS
Cache/Uncache¶
A user can load all data into main memory by leverging cache capability and also, uncache it anytime.
CACHE TABLE pref;
CACHE TABLE ORCL.movies;
The pref, ORCL.moves table are chaned now, and the following query performs a lot better :
SELECT movies_from_oracle.title, hdfs_pref. count , hdfs_pref. min , hdfs_pref. max , hdfs_pref. avg
FROM (
SELECT iid, count ( * ) count , min (pref) min , max (pref) max , avg (pref) avg
FROM pref
GROUP BY iid
) AS hdfs_pref, ORCL.movies movies_from_oracle
WHERE movies_from_oracle.iid = hdfs_pref.iid
These tables uncached any time
UNCACHE TABLE pref
UNCACHE TABLE ORCL.movies
18 Statistics¶
Lighting come up with CBO(Cose Based Optimizer) to reduce data shuffling across cluster. To do this, Lighting keeps statistics for the data source. There are two types of statistics, the one is table level statistics and the other is column level statistics.
Table level statistics¶
ANALYZE DATASOURCE TABLE ORCL.MOVIES
This command generate statistics such as size in bytes, cardinality for the table, and these are browsed by the following command :
SHOW DATASOURCE TABLE STATISTICS ORCL.MOVIES
Column level statistics¶
ANALYZE DATASOURCE TABLE ORCL.MOVIES COMPUTE STATISTICS FOR COLUMNS (IID, TITLE)
This command generate statistics such as cardinality, number of null, min, max, average value, and these are browsed by the following command :
SHOW DATASOURCE COLUMN STATISTICS ORCL.MOVIES;
19 Partitioning¶
Query performance can be improved by partitioning table. What partitioning means here is that all records are splitted into multiple partitions and these are processed independently in each worker node.
CREATE DATASOURCE PARTITION ON ORCL.USERS OPTIONS (
COLUMN "UID",
COUNT "2",
LOWERBOUND "1",
UPPERBOUND "6040")
This command makes two partitions based on the "UID" column. lower/upper bound provides boundary value for the partition. This partition can be removed by :
DROP DATASOURCE PARTITION ON ORCL.USERS
20 Import CSV file¶
Lighting supports CSV file and running query on top of it. A CSV can be imported by either :
Hive syntax¶
CREATE EXTERNAL TABLE pref (uid INT, iid INT, pref FLOAT, ts STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'csv directory'
With the syntax, a user can import other file format than csv(tsv for example). But this doesn’t support :
- Header CSV header must be got rid of.
- LOCATION must be directory. Create a directory, and place csv file there
- For directory in s3 bucket, csv directory looks like “s3n://mys3bucket/perf/”
OR
Lightning Syntax¶
CREATE TABLE pref
USING com.databricks.spark.csv
OPTIONS (path "file path", header "true", inferSchema "true")
With this syntax, user can do :
- infer schema
- support header
- support a single file
file path
in Amazon s3 looks likes3n://mys3bucket/perf/pref.csv
21 Access Control¶
User Management¶
Only admin user or users in “admin” role can add/drop user.
- level : admin | general
- password doesnt allow white spaces
ADD USER WITH (
email 'someone@zetaris.com',
name 'someone',
level 'general',
password '1234567'
)
UPDATE USER user_id SET PASSWORD 'new_password'
DESCRIBE USER user_id
DROP USER user_id
SHOW USERS
Role Based Access(RBA) control¶
The Lightning provides Role Based Access which limits a users to access a specific set of data. This is applied at data source level or table level in each data source. Only admin users or equivalents can run these commands
Privileges
- SELECT privilege - Give read access to the data source or relation
- INSERT privilege - Give insert access to the data source or relation
- CACHE privileges - Give cache access to a relation : (UN)CACHE DATASOURCE TABLE
Predefined Roles
Role is case insensitive - admin - none - all - default
CREATE ROLE role_name [DESCRIBE BY "this is blah~~~"]
DROP ROLE role_name;
SHOW ROLES
ASSIGN USER user_name [, user_name] ...TO ROLE role_name
REVOKE USER user_name[, user_name] ...FROM ROLE role_name
Granted user with GRANT OPTIONS can grant same privilege on the table.
GRANT SELECT | INSERT | CACHEON table_or_view_nameTO principal_spec [,principal_spec] ...[WITH GRANT OPTION]
REVOKE SELECT(INSERT | CACHE) ON table_or_view_name FROM principal_spec [,principal_spec] ...
- Wild card can be only used in table field. for example, ORCL.* is allowed but *.* or *.movies were not allowed.
SHOW GRANT[principal_specification] ON(ALL | [TABLE] table_or_view_name)
principal_specification: USER user | ROLE roleIt will display :
table_identifier | principal_name | principal_type | privilege | grant_option | grant_time | grantor
+------------------------+----------------------+--------------------+-------------+------------------+----------------------------------+----------+
ORCL.movies | ashutosh | USER | DELETE | false | 2018-05-07 11:44:12.301 | thejas
ORCL.movies | ashutosh | USER | INSERT | false | 2018-05-07 11:44:12.301 | thejas
ORCL.ratings | ashutosh | ROLE | SELECT | false | 2018-05-07 11:44:12.301 | thejas