Data Units
Data Types
- Primitive Data Types
  - Timestamp
- Complex Data Types
Statement Execution Order/Priority
Operations
Operators & Functions
- Built-in Operators
- Built-in Functions
  - Time
  - Condition
  - String
  - Statistic

Data Units

Database, Table, Partition, Bucket

Data Types

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

Primitive Data Types

Category	Data Type	Description
Integers	TINYINT	1 byte integer
	SMALLINT	2 byte integer
	INT	4 byte integer
	BIGINT	8 byte integer
Boolean Type	BOOLEAN	TRUE/FALSE
Floating Point Numbers	FLOAT	Single precision
	DOUBLE	Double precision
Fixed Point Numbers	DECIMAL	A fixed point value of user-defined scale and precision
String Types	STRING	Sequence of characters in a specified character set s
	VARCHAR	Sequence of characters in a specified character set with a maximum length
	CHAR	Sequence of characters in a specified character set with a defined length
Date and Time Types	TIMESTAMP	A date and time without a timezone ("LocalDateTime" semantics)
	TIMESTAMP WITH LOCAL TIME ZONE	A point in time measured down to nanoseconds ("Instant" semantics)
	DATE	A date
Binary Types	BINARY	A sequence of bytes

Timestamp

If the application needs to handle globally consistent points in time, you should use timestamps with local time zone ("Instant" semantics). If you only need to record a combination of date and time without dealing with cross-time zone issues, you can use "LocalDateTime" semantics timestamps.

Complex Data Types

Complex types built up from primitive types and other composite types.

Category	Description
Structs	The elements within the type can be accessed using the DOT (.) notation. For example, for a column `c` of type `STRUCT {a INT; b INT}`, the `a` field is accessed by the expression `c.a`.
Maps (key-value tuples)	The elements are accessed using `['element name']` notation. For example, in a map `M` comprising of a mapping from `'group' -> gid`, the `gid` value can be accessed using `M['group']`.
Arrays (indexable lists)	The elements in the array have to be of the same type. Elements can be accessed using the `[n]` notation where `n` is an index (zero-based) into the array. For example, for an array `A` having the elements `['a', 'b', 'c']`, `A[1]` returns `'b'`.

Examples:

Structs

-- Create
CREATE TABLE employees (
  id INT,
  name STRING,
  address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
);
-- Insert
INSERT INTO employees VALUES (1, 'John Doe', NAMED_STRUCT('street', '123 Main St', 'city', 'Springfield', 'state', 'IL', 'zip', 62701));
INSERT INTO employees VALUES (2, 'Jane Smith', NAMED_STRUCT('street', '456 Elm St', 'city', 'Metropolis', 'state', 'NY', 'zip', 10001));
-- Select & Filter
SELECT id, name, address.city AS city FROM employees;
SELECT name FROM employees WHERE address.state = 'NY';

Maps

-- Create
CREATE TABLE departments (
  dept_name STRING,
  employees MAP<STRING, INT>
);
-- Insert
INSERT INTO departments VALUES ('HR', MAP('Alice', 101, 'Bob', 102));
INSERT INTO departments VALUES ('Engineering', MAP('Charlie', 201, 'David', 202));
-- Select & Filter
SELECT dept_name, employees['Alice'] AS alice_id FROM departments;
SELECT dept_name FROM departments WHERE employees['Charlie'] IS NOT NULL;

Arrays

-- Create
CREATE TABLE students (
  name STRING,
  scores ARRAY<INT>
);
-- Insert
INSERT INTO students VALUES ('Alice', ARRAY(85, 90, 95));
INSERT INTO students VALUES ('Bob', ARRAY(78, 82, 88));
-- Select & Filter
SELECT name, scores[0] AS first_score FROM students;
SELECT name FROM students WHERE scores[1] > 80;

Statement Execution Order/Priority

Step	Description
FROM	Determine the data source (table or view) and load the data.
JOIN	Perform join operations between tables.
WHERE	Apply filtering conditions to exclude rows that do not meet the criteria.
GROUP BY	Group rows based on specified columns.
HAVING	Apply filtering conditions to the grouped results.
SELECT	Select the required columns, compute expressions, and apply `DISTINCT` to remove duplicates at this stage.
DISTRIBUTE BY	Specify the column(s) by which to distribute the data, affecting data partitioning.
SORT BY	Sort the data within each partition.
CLUSTER BY	Partition and sort the data (equivalent to `DISTRIBUTE BY` + `SORT BY`).
ORDER BY	Perform global sorting on the entire result set.
LIMIT	Limit the number of rows returned.

Operations

DDL: Data Definition Language

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Create Tables

CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

Browse Through Tables

show tables;
show tables '.*end';
desc invites;

Alter & Drop Tables

alter table invites rename to invites_new;
alter table invites add columns (new_col INT COMMENT 'new col comm');
alter table invites replace columns (first_col INT, second_col STRING);
drop table invites;

Metastore

Reference: https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration

Metadata is in an embedded Derby database whose disk storage location is determined by the Hive configuration variable named javax.jdo.option.ConnectionURL. By default this location is ./metastore_db (see conf/hive-default.xml).

DML: Data Manipulation Language

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML

-- Load data from local file system into hive.
LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
-- Load data from HDFS file/directory into hive.
LOAD DATA INPATH '/user/name/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

SQL

SELECT & FILTER

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

select a.foo from invites a where a.ds='2008-08-15';
insert overwrite directory '/tmp/hdfs_out' select xxx;
insert overwrite local directory '/tmp/hdfs_out' select xxx;
insert overwrite table invites select xxx;

GROUP BY

GROUP BY is used to aggregate data across multiple rows based on one or more specified columns.

INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;

JOIN

JOIN is used to combine rows from two or more tables based on a related column between them.

FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

MULTITABLE INSERT

FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

STREAMING

FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';

Operators & Functions

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

show functions;
desc function <func_name>;
desc function extended <func_name>;

Built-in Operators

Relational

Description	Syntax
Equality Comparison	`=`
Inequality Comparison	`<>`, `!=`
Other Comparison	`< <=`, `> >=`
Null Comparison	`IS NULL`
Non-null Comparison	`IS NOT NULL`
LIKE Comparison	`_` represents any single character, `%` represents any number of characters
Java's LIKE/REGEXP operations	`RLIKE` or `REGEXP`

Arithmetic

+, -, *, /, &, |, ^, ~

Logical

AND &&, OR ||, NOT !

Operators on Complex Types

A[n]: ARRAY<INT>
M[k]: MAP<K, V>
S.x: S{int x}

Built-in Functions

Time

Description	Function
UNIX timestamp to date	`from_unixtime`
Get current UNIX timestamp	`unix_timestamp`
Date to UNIX timestamp	`unix_timestamp`
Datetime to date	`to_date`
Date to year	`year`
Date to week	`weekofyear`
Date comparison	`datediff`
Date addition	`date_add`
Date subtraction	`date_sub`

Condition

Description	Function
If function	`if`
Non-null lookup function	`COALESCE`
Conditional function	`CASE WHEN a = b THEN c [WHEN d THEN e]* [ELSE f] END`

String

Description	Function
String length function	`length`
String reverse function	`reverse`
String concatenation function	`concat`
String concatenation with separator	`concat_ws`
String substring function	`substr`, `substring`
String to uppercase function	`upper`, `ucase`
String to lowercase function	`lower`, `lcase`
Trim whitespace function	`trim`
Left trim function	`ltrim`
Right trim function	`rtrim`
Regular expression replace function	`regexp_replace`
Regular expression extract function	`regexp_extract`
URL parsing function	`parse_url`
JSON parsing function	`get_json_object`
Space string function	`space`
Repeat string function	`repeat`
ASCII of first character function	`ascii`
Left pad function	`lpad`
Right pad function	`rpad`
Split string function	`split`
Set lookup function	`find_in_set`

Statistic

Description	Function
Count function	`count`
Count non-null values	`count(expr)`
Count distinct non-null values	`count(DISTINCT expr[, expr_...])`
Sum function	`sum`
Average function	`avg`
Minimum value function	`min`
Maximum value function	`max`
Median function	`percentile(BIGINT col, p)`

Hive Crash Course