Hive Crash Course


Data Units

Database, Table, Partition, Bucket

Data Types

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

Primitive Data Types

Category Data Type Description
Integers TINYINT 1 byte integer
SMALLINT 2 byte integer
INT 4 byte integer
BIGINT 8 byte integer
Boolean Type BOOLEAN TRUE/FALSE
Floating Point Numbers FLOAT Single precision
DOUBLE Double precision
Fixed Point Numbers DECIMAL A fixed point value of user-defined scale and precision
String Types STRING Sequence of characters in a specified character set s
VARCHAR Sequence of characters in a specified character set with a maximum length
CHAR Sequence of characters in a specified character set with a defined length
Date and Time Types TIMESTAMP A date and time without a timezone ("LocalDateTime" semantics)
TIMESTAMP WITH LOCAL TIME ZONE A point in time measured down to nanoseconds ("Instant" semantics)
DATE A date
Binary Types BINARY A sequence of bytes

Timestamp

If the application needs to handle globally consistent points in time, you should use timestamps with local time zone ("Instant" semantics). If you only need to record a combination of date and time without dealing with cross-time zone issues, you can use "LocalDateTime" semantics timestamps.

Complex Data Types

Complex types built up from primitive types and other composite types.

Category Description
Structs The elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT}, the a field is accessed by the expression c.a.
Maps (key-value tuples) The elements are accessed using ['element name'] notation. For example, in a map M comprising of a mapping from 'group' -> gid, the gid value can be accessed using M['group'].
Arrays (indexable lists) The elements in the array have to be of the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example, for an array A having the elements ['a', 'b', 'c'], A[1] returns 'b'.

Examples:

  • Structs
-- Create
CREATE TABLE employees (
  id INT,
  name STRING,
  address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
);
-- Insert
INSERT INTO employees VALUES (1, 'John Doe', NAMED_STRUCT('street', '123 Main St', 'city', 'Springfield', 'state', 'IL', 'zip', 62701));
INSERT INTO employees VALUES (2, 'Jane Smith', NAMED_STRUCT('street', '456 Elm St', 'city', 'Metropolis', 'state', 'NY', 'zip', 10001));
-- Select & Filter
SELECT id, name, address.city AS city FROM employees;
SELECT name FROM employees WHERE address.state = 'NY';
  • Maps
-- Create
CREATE TABLE departments (
  dept_name STRING,
  employees MAP<STRING, INT>
);
-- Insert
INSERT INTO departments VALUES ('HR', MAP('Alice', 101, 'Bob', 102));
INSERT INTO departments VALUES ('Engineering', MAP('Charlie', 201, 'David', 202));
-- Select & Filter
SELECT dept_name, employees['Alice'] AS alice_id FROM departments;
SELECT dept_name FROM departments WHERE employees['Charlie'] IS NOT NULL;
  • Arrays
-- Create
CREATE TABLE students (
  name STRING,
  scores ARRAY<INT>
);
-- Insert
INSERT INTO students VALUES ('Alice', ARRAY(85, 90, 95));
INSERT INTO students VALUES ('Bob', ARRAY(78, 82, 88));
-- Select & Filter
SELECT name, scores[0] AS first_score FROM students;
SELECT name FROM students WHERE scores[1] > 80;

Statement Execution Order/Priority

Step Description
FROM Determine the data source (table or view) and load the data.
JOIN Perform join operations between tables.
WHERE Apply filtering conditions to exclude rows that do not meet the criteria.
GROUP BY Group rows based on specified columns.
HAVING Apply filtering conditions to the grouped results.
SELECT Select the required columns, compute expressions, and apply DISTINCT to remove duplicates at this stage.
DISTRIBUTE BY Specify the column(s) by which to distribute the data, affecting data partitioning.
SORT BY Sort the data within each partition.
CLUSTER BY Partition and sort the data (equivalent to DISTRIBUTE BY + SORT BY).
ORDER BY Perform global sorting on the entire result set.
LIMIT Limit the number of rows returned.

Operations

DDL: Data Definition Language

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Create Tables

CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

Browse Through Tables

show tables;
show tables '.*end';
desc invites;

Alter & Drop Tables

alter table invites rename to invites_new;
alter table invites add columns (new_col INT COMMENT 'new col comm');
alter table invites replace columns (first_col INT, second_col STRING);
drop table invites;

Metastore

Reference: https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration

Metadata is in an embedded Derby database whose disk storage location is determined by the Hive configuration variable named javax.jdo.option.ConnectionURL. By default this location is ./metastore_db (see conf/hive-default.xml).

DML: Data Manipulation Language

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML

-- Load data from local file system into hive.
LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
-- Load data from HDFS file/directory into hive.
LOAD DATA INPATH '/user/name/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

SQL

SELECT & FILTER

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

select a.foo from invites a where a.ds='2008-08-15';
insert overwrite directory '/tmp/hdfs_out' select xxx;
insert overwrite local directory '/tmp/hdfs_out' select xxx;
insert overwrite table invites select xxx;

GROUP BY

GROUP BY is used to aggregate data across multiple rows based on one or more specified columns.

INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;

JOIN

JOIN is used to combine rows from two or more tables based on a related column between them.

FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

MULTITABLE INSERT

FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

STREAMING

FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';

Operators & Functions

Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

show functions;
desc function <func_name>;
desc function extended <func_name>;

Built-in Operators

Relational

Description Syntax
Equality Comparison =
Inequality Comparison <>, !=
Other Comparison < <=, > >=
Null Comparison IS NULL
Non-null Comparison IS NOT NULL
LIKE Comparison _ represents any single character, % represents any number of characters
Java's LIKE/REGEXP operations RLIKE or REGEXP

Arithmetic

+, -, *, /, &, |, ^, ~

Logical

AND &&, OR ||, NOT !

Operators on Complex Types

  • A[n]: ARRAY<INT>
  • M[k]: MAP<K, V>
  • S.x: S{int x}

Built-in Functions

Time

Description Function
UNIX timestamp to date from_unixtime
Get current UNIX timestamp unix_timestamp
Date to UNIX timestamp unix_timestamp
Datetime to date to_date
Date to year year
Date to week weekofyear
Date comparison datediff
Date addition date_add
Date subtraction date_sub

Condition

Description Function
If function if
Non-null lookup function COALESCE
Conditional function CASE WHEN a = b THEN c [WHEN d THEN e]* [ELSE f] END

String

Description Function
String length function length
String reverse function reverse
String concatenation function concat
String concatenation with separator concat_ws
String substring function substr, substring
String to uppercase function upper, ucase
String to lowercase function lower, lcase
Trim whitespace function trim
Left trim function ltrim
Right trim function rtrim
Regular expression replace function regexp_replace
Regular expression extract function regexp_extract
URL parsing function parse_url
JSON parsing function get_json_object
Space string function space
Repeat string function repeat
ASCII of first character function ascii
Left pad function lpad
Right pad function rpad
Split string function split
Set lookup function find_in_set

Statistic

Description Function
Count function count
Count non-null values count(expr)
Count distinct non-null values count(DISTINCT expr[, expr_...])
Sum function sum
Average function avg
Minimum value function min
Maximum value function max
Median function percentile(BIGINT col, p)