Data Units
Database, Table, Partition, Bucket
Data Types
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
Primitive Data Types
Category | Data Type | Description |
---|---|---|
Integers | TINYINT | 1 byte integer |
SMALLINT | 2 byte integer | |
INT | 4 byte integer | |
BIGINT | 8 byte integer | |
Boolean Type | BOOLEAN | TRUE/FALSE |
Floating Point Numbers | FLOAT | Single precision |
DOUBLE | Double precision | |
Fixed Point Numbers | DECIMAL | A fixed point value of user-defined scale and precision |
String Types | STRING | Sequence of characters in a specified character set s |
VARCHAR | Sequence of characters in a specified character set with a maximum length | |
CHAR | Sequence of characters in a specified character set with a defined length | |
Date and Time Types | TIMESTAMP | A date and time without a timezone ("LocalDateTime" semantics) |
TIMESTAMP WITH LOCAL TIME ZONE | A point in time measured down to nanoseconds ("Instant" semantics) | |
DATE | A date | |
Binary Types | BINARY | A sequence of bytes |
Timestamp
If the application needs to handle globally consistent points in time, you should use timestamps with local time zone ("Instant" semantics). If you only need to record a combination of date and time without dealing with cross-time zone issues, you can use "LocalDateTime" semantics timestamps.
Complex Data Types
Complex types built up from primitive types and other composite types.
Category | Description |
---|---|
Structs | The elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT} , the a field is accessed by the expression c.a . |
Maps (key-value tuples) | The elements are accessed using ['element name'] notation. For example, in a map M comprising of a mapping from 'group' -> gid , the gid value can be accessed using M['group'] . |
Arrays (indexable lists) | The elements in the array have to be of the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example, for an array A having the elements ['a', 'b', 'c'] , A[1] returns 'b' . |
Examples:
- Structs
-- Create
CREATE TABLE employees (
id INT,
name STRING,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
);
-- Insert
INSERT INTO employees VALUES (1, 'John Doe', NAMED_STRUCT('street', '123 Main St', 'city', 'Springfield', 'state', 'IL', 'zip', 62701));
INSERT INTO employees VALUES (2, 'Jane Smith', NAMED_STRUCT('street', '456 Elm St', 'city', 'Metropolis', 'state', 'NY', 'zip', 10001));
-- Select & Filter
SELECT id, name, address.city AS city FROM employees;
SELECT name FROM employees WHERE address.state = 'NY';
- Maps
-- Create
CREATE TABLE departments (
dept_name STRING,
employees MAP<STRING, INT>
);
-- Insert
INSERT INTO departments VALUES ('HR', MAP('Alice', 101, 'Bob', 102));
INSERT INTO departments VALUES ('Engineering', MAP('Charlie', 201, 'David', 202));
-- Select & Filter
SELECT dept_name, employees['Alice'] AS alice_id FROM departments;
SELECT dept_name FROM departments WHERE employees['Charlie'] IS NOT NULL;
- Arrays
-- Create
CREATE TABLE students (
name STRING,
scores ARRAY<INT>
);
-- Insert
INSERT INTO students VALUES ('Alice', ARRAY(85, 90, 95));
INSERT INTO students VALUES ('Bob', ARRAY(78, 82, 88));
-- Select & Filter
SELECT name, scores[0] AS first_score FROM students;
SELECT name FROM students WHERE scores[1] > 80;
Statement Execution Order/Priority
Step | Description |
---|---|
FROM | Determine the data source (table or view) and load the data. |
JOIN | Perform join operations between tables. |
WHERE | Apply filtering conditions to exclude rows that do not meet the criteria. |
GROUP BY | Group rows based on specified columns. |
HAVING | Apply filtering conditions to the grouped results. |
SELECT | Select the required columns, compute expressions, and apply DISTINCT to remove duplicates at this stage. |
DISTRIBUTE BY | Specify the column(s) by which to distribute the data, affecting data partitioning. |
SORT BY | Sort the data within each partition. |
CLUSTER BY | Partition and sort the data (equivalent to DISTRIBUTE BY + SORT BY ). |
ORDER BY | Perform global sorting on the entire result set. |
LIMIT | Limit the number of rows returned. |
Operations
DDL: Data Definition Language
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
Create Tables
CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
Browse Through Tables
show tables;
show tables '.*end';
desc invites;
Alter & Drop Tables
alter table invites rename to invites_new;
alter table invites add columns (new_col INT COMMENT 'new col comm');
alter table invites replace columns (first_col INT, second_col STRING);
drop table invites;
Metastore
Reference: https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration
Metadata is in an embedded Derby database whose disk storage location is determined by the Hive configuration variable named javax.jdo.option.ConnectionURL. By default this location is ./metastore_db (see conf/hive-default.xml).
DML: Data Manipulation Language
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
-- Load data from local file system into hive.
LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
-- Load data from HDFS file/directory into hive.
LOAD DATA INPATH '/user/name/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
SQL
SELECT & FILTER
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select
select a.foo from invites a where a.ds='2008-08-15';
insert overwrite directory '/tmp/hdfs_out' select xxx;
insert overwrite local directory '/tmp/hdfs_out' select xxx;
insert overwrite table invites select xxx;
GROUP BY
GROUP BY is used to aggregate data across multiple rows based on one or more specified columns.
INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
JOIN
JOIN is used to combine rows from two or more tables based on a related column between them.
FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
MULTITABLE INSERT
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
STREAMING
FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';
Operators & Functions
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
show functions;
desc function <func_name>;
desc function extended <func_name>;
Built-in Operators
Relational
Description | Syntax |
---|---|
Equality Comparison | = |
Inequality Comparison | <> , != |
Other Comparison | < <= , > >= |
Null Comparison | IS NULL |
Non-null Comparison | IS NOT NULL |
LIKE Comparison | _ represents any single character, % represents any number of characters |
Java's LIKE/REGEXP operations | RLIKE or REGEXP |
Arithmetic
+
, -
, *
, /
, &
, |
, ^
, ~
Logical
AND &&
, OR ||
, NOT !
Operators on Complex Types
A[n]
:ARRAY<INT>
M[k]
:MAP<K, V>
S.x
:S{int x}
Built-in Functions
Time
Description | Function |
---|---|
UNIX timestamp to date | from_unixtime |
Get current UNIX timestamp | unix_timestamp |
Date to UNIX timestamp | unix_timestamp |
Datetime to date | to_date |
Date to year | year |
Date to week | weekofyear |
Date comparison | datediff |
Date addition | date_add |
Date subtraction | date_sub |
Condition
Description | Function |
---|---|
If function | if |
Non-null lookup function | COALESCE |
Conditional function | CASE WHEN a = b THEN c [WHEN d THEN e]* [ELSE f] END |
String
Description | Function |
---|---|
String length function | length |
String reverse function | reverse |
String concatenation function | concat |
String concatenation with separator | concat_ws |
String substring function | substr , substring |
String to uppercase function | upper , ucase |
String to lowercase function | lower , lcase |
Trim whitespace function | trim |
Left trim function | ltrim |
Right trim function | rtrim |
Regular expression replace function | regexp_replace |
Regular expression extract function | regexp_extract |
URL parsing function | parse_url |
JSON parsing function | get_json_object |
Space string function | space |
Repeat string function | repeat |
ASCII of first character function | ascii |
Left pad function | lpad |
Right pad function | rpad |
Split string function | split |
Set lookup function | find_in_set |
Statistic
Description | Function |
---|---|
Count function | count |
Count non-null values | count(expr) |
Count distinct non-null values | count(DISTINCT expr[, expr_...]) |
Sum function | sum |
Average function | avg |
Minimum value function | min |
Maximum value function | max |
Median function | percentile(BIGINT col, p) |