Importing Data Into Splice Machine
Dive into importing and upserting data into Splice Machine. This guide covers the process as well as handling challenges like timestamps and special characters.
Join the DZone community and get the full member experience.
Join For FreeThis topic describes how to import data into your Splice Machine database using built-in procedures for importing and upserting data.
About Bulk Loading of Data Into Splice Machine
There are several ways of inserting data into your Splice Machine database. You can:
- Use an ETL tool that can connect with Splice Machine via ODBC or JDBC
- Insert data into existing tables using SQL statements such as Insert
Those methods work, but they do not leverage the parallel processing power of your cluster. Many of our customers need to import large amounts of data using a bulk load operation. For example, when moving data from an existing database into Splice Machine after initial installation, or for situations like when a financial trading firm wants to update the data for stock trading transactions in the past 24 hours.
The Splice Machine SYSCS_UTIL.IMPORT_DATA and SYSCS_UTIL.UPSERT_DATA_FROM_FILE built-in system procedures allow you to perform bulk imports that utilize the parallel processing power of Hadoop’s scale-out architecture while distributing the data evenly into different region servers to prevent “hot spots.” We utilize a custom-built, asynchronous write pipeline, which is an extension to the native HBase pipeline, to accomplish this effort.
Our bulk import process provides a number of significant advantages in comparison to traditional Hadoop ETL processes:
- Includes error and consistency checking of incoming data allows for rejection of incorrect data sets; for example, primary key violations and invalid data such as a wrong data type.
- You can monitor progress using the Splice Machine Management Console.
- Provides a consistent view of data even when numerous concurrent updates are occurring.
- Our write processes do not block read processes.
- Incremental updates and upserts are allowed.
Importing a Subset of Data From a File
When you import data from a file into a table, all of the data in the file is not necessarily imported. This can happen in either of these circumstances:
- If the table into which you’re importing contains fewer columns than does the data file, the “extra” columns of data at the end of each line are ignored. For example, if your table contains columns (a, b, c) and your file contains columns (a, b, c, d, e), then the data in your file’s d and e columns will be ignored. See Example 1: Importing data into a table with fewer columns than in the file at the end of this topic.
- If the insertColumnList in your import call specifies only a subset of the columns in the data file, then only those columns are imported; however, intervening columns in your table are then overwritten with default values. For example, if your table contains columns (a, b, c) and you only want to import columns (a, c), then the data in table’s b column will be replaced with the default value for that column. See Example 2: Importing a subset of data from a file into a table at the end of this topic
Note that this means that you cannot import multiple subsets into different tables from the same input file.
Using Our Import and Upsert Procedures
The SYSCS_UTIL.IMPORT_DATA and SYSCS_UTIL.UPSERT_DATA_FROM_FILE built-in system procedures are nearly identical in syntax and function. The difference is the SYSCS_UTIL.UPSERT_DATA_FROM_FILE procedure first determines if the database already contains a record that matches an incoming record:
- If a matching record is found in the database, that record is updated with column values from the incoming record.
- If no matching record is found in the database, the incoming record is added to the database as a new record, exactly as it would be if had you called SYSCS_UTIL.IMPORT_DATA.
The SYSCS_UTIL.IMPORT_DATA and SYSCS_UTIL.UPSERT_DATA_FROM_FILE built-in system procedures both import data to a subset of columns in a table. You choose the subset of columns by specifying insert columns.
Note: The SYSCS_UTIL.SYSCS_IMPORT_DATA system procedure, which operates similarly to SYSCS_UTIL.IMPORT_DATAand was available in earlier versions of Splice Machine, is targeted for eventual deprecation and is no longer documented.
After a successful import completes, a simple report shows how many files were imported or checked, and how many record imports or checks succeeded or failed. (Note that you can ignore the numtasks value shown in this report).
IMPORTANT: On a cluster, the files to be imported MUST be on HDFS (or MapR-FS), and must be readable by the HBase user.
Syntax
call SYSCS_UTIL.IMPORT_DATA (
schemaName,
tableName,
insertColumnList | null,
fileOrDirectoryName,
columnDelimiter | null,
characterDelimiter | null,
timestampFormat | null,
dateFormat | null,
timeFormat | null,
badRecordsAllowed,
badRecordDirectory | null,
oneLineRecords | null,
charset | null
);
call SYSCS_UTIL.UPSERT_DATA_FROM_FILE (
schemaName,
tableName,
insertColumnList | null,
fileOrDirectoryName,
columnDelimiter | null,
characterDelimiter | null,
timestampFormat | null,
dateFormat | null,
timeFormat | null,
badRecordsAllowed,
badRecordDirectory | null,
oneLineRecords | null,
charset | null
);
schemaName
The name of the schema of the table in which to import.
tableName
The name of the table in which to import.
insertColumnList
The names, in single quotes, of the columns to import. If this is null, all columns are imported.
If you don’t specify a specify an insertColumnList and your input file contains more columns than are in the table, then the extra columns at the end of each line in the input file are ignored. For example, if your table contains columns (a, b, c) and your file contains columns (a, b, c, d, e), then the data in your file’s d and e columns will be ignored.
If you do specify an insertColumnList, and the number of columns doesn’t match your table, then any other columns in your table will be replaced by the default value for the table column (or NULL if there is no default for the column). For example, if your table contains columns (a, b, c) and you only want to import columns (a, c), then the data in table’s b column will be replaced with the default value for that column.
fileOrDirectoryName
Either a single file or a directory. If this is a single file, that file is imported; if this is a directory, all of the files in that directory are imported. Note that files can be compressed or uncompressed.
columnDelimiter
The character used to separate columns, Specify null if using the comma (,) character as your delimiter.
In addition to using single characters, you can specify the following special characters as delimiters:
Special character | Display |
---|---|
\t | Tab |
\f | Formfeed |
\b | Backspace |
\\ | Backslash |
Control-a, ^a (or ^A) |
If you are using a script file from the splice> command line, your script can contain the actual Control-a character as the value of this parameter. |
characterDelimiter
Specifies which character is used to delimit strings in the imported data. You can specify null or the empty string to use the default string delimiter, which is the double-quote (“).
In addition to using single characters, you can specify the following special characters as delimiters:
Special character | Display |
---|---|
\t | Tab |
\f | Formfeed |
\b | Backspace |
\\ | Backslash |
Control-a, ^a (or ^A) |
If you are using a script file from the splice> command line, your script can contain the actual Control-a character as the value of this parameter. |
If your input contains control characters such as newline characters, make sure that those characters are embedded within delimited strings.
To use the single quote (‘) character as your string delimiter, you need to escape that character. This means that you specify four quotes (””) as the value of this parameter. This is standard SQL syntax.
The Examples section below contains an example that uses the single quote as the string delimiter character.
timestampFormat
The format of timestamps stored in the file. You can set this to null if there are no timestamps in the file, or if the format of any timestamps in the file match the Java.sql.Timestamp default format, which is: "yyyy-MM-dd HH:mm:sd". See the About Timestamp Formats section below for more information about timestamps.
All of the timestamps in the file you are importing must use the same format.
dateFormat
The format of datestamps stored in the file. You can set this to null if there are no date columns in the file, or if the format of any dates in the file match pattern: "yyyy-MM-dd".
timeFormat
The format of timeFormats stored in the file. You can set this to null if there are no time columns in the file, or if the format of any times in the file match pattern: "HH:mm:ss".
badRecordsAllowed
The number of rejected (bad) records that are tolerated before the import fails. If this count of rejected records is reached, the import fails, and any successful record imports are rolled back.
- If you specify -1 as the value of this parameter, all record import failures are tolerated and logged.
- If you specify 0 as the value of this parameter, the import will fail if even one record is bad.
In previous releases of Splice Machine, this parameter was named failBadRecordCount, and a value of 0 meant that all record import failures were tolerated. This has been changed, and you now must specify a value of -1 for badRecordsAllowed to tolerate all bad records.
badRecordDirectory
The directory in which bad record information is logged. Splice Machine logs this information to the <import_file_name>.bad file in this directory; for example, bad records in an input file named foo.csv would be logged to a file named badRecordDirectory/foo.csv.bad.
The default value is the directory in which the import files are found.
oneLineRecords
A Boolean value that specifies whether each line in the import file contains one complete record:
- If you specify true or null, then each record is expected to be found on a single line in the file.
- If you specify false, records can span multiple lines in the file.
charset
The character encoding of the import file. The default value is UTF-8. Currently, any other value is ignored and UTF-8 is used.
Handling Generated Column Values in Imported Files
If you’re importing data into a table with generated columns (see generated-column-spec in the SQL Reference Manual), you should know that imported records are handled in exactly the same manner as are records inserted using the INSERT statement.
Here’s a simple summary of what happens for generated columns, including DEFAULT values, in imported records:
- If your importColumnList includes the column name and the imported column value is empty, NULL is inserted into the database table column.
- If your importColumnList includes the column name and the imported column value is not empty, the column value is imported unless the value is not valid in the table.
- If you importColumnList does not include the column name, the generated value is inserted into the database table column.
Generated Column Import Examples
To illustrate what happens with generated column values in imported records, we’ll use this simple database table created with this statement:
CREATE TABLE myTable (
colA INT,
colB CHAR(4) DEFAULT 'myDefaultVal',
colC INT);
insertColumnList | Values in import record | Values inserted into database | Notes |
---|---|---|---|
“A,B,C” | 1,,2 | [1,NULL,2] | |
“A,B,C” | 3,de,4 | [3,de,4] | |
“A,B,C” | 1,2 | Error: column B wrong type | |
“A,B,C” | 1,DEFAULT,2 | [1,”DEFAULT”,2] | DEFAULT is imported as a literal value |
Empty | 1,,2 | [1,NULL,2] | |
Empty | 3,de,4 | [3,de,4] | |
Empty | 1,2 | Error: column B wrong type | |
“A,C” | 1,2 | [1,myDefaultVal,2] | |
“A,C” | 3,4 | [3,myDefaultVal,4] |
Note that the value DEFAULT in the imported file is not interpreted to mean that the default value should be applied to that column; instead:
- If the target column in your database has a string data type, such as CHAR or VARCHAR, the literal value “DEFAULT” is inserted into your database..
- If the target column is not a string data type, an error will occur.
How to Use Generated Values for a Column in Some (But Not All) Imported Records
If you are importing a file into a table with a generated column, and you want to import some records with actual values and apply generated values to other records, you need to split your import file into two files and import each:
- Import the file containing records with non-default values with the column name included in the insertColumnList.
- Import the file containing records with default values with the column name excluded from the insertColumnList.
When you export a table with generated columns to a file, the actual column values are exported, so importing that same file into a different database will accurately replicate the original table values.
Record Import Failure Reasons
When upserting data from a file, the input file you generate must contain:
- the columns to be changed
- all NON_NULL columns
Typical reasons for a row (record) import to fail include:
- Improper data expected for a column.
- Improper number of columns of data.
- A primary key violation: SYSCS_UTIL.IMPORT_DATA and SYSCS_UTIL.UPSERT_DATA_FROM_FILE will only work correctly if the table into which you are inserting/updating has primary keys.
Compacting Tables After Bulk Imports
We recommend that you run a full compaction on tables into which you have imported a large amount of data, using the SYSCS_UTIL.SYSCS_PERFORM_MAJOR_COMPACTION_ON_TABLE system procedure.
About Timestamp Formats
Splice Machine uses the following Java date and time pattern letters to construct timestamps:
Pattern Letter | Description | Format(s) |
---|---|---|
y | year | yy or yyyy |
M | month | MM |
d | day in month | dd |
h | hour (0-12) | hh |
H | hour (0-23) | HH |
m | minute in hour | mm |
s | seconds | ss |
S | tenths of seconds | SSS (up to 6 decimal digits: SSSSSS) |
z | time zone text | e.g. Pacific Standard time |
Z | time zone, time offset | e.g. -0800 |
The default timestamp format for Splice Machine imports is: yyyy-MM-dd HH:mm:ss, which uses a 24-hour clock, does not allow for decimal digits of seconds, and does not allow for time zone specification.
Note: The standard Java library does not support microsecond precision, so you cannot specify millisecond (S) values in a custom timestamp format and import such values with the desired precision.
Timestamps and Importing Data at Different Locations
Note that timestamp values are relative to the geographic location at which they are imported, or more specifically, relative to the timezone setting and daylight saving time status where the data is imported.
This means that timestamp values from the same data file may appear differently after being imported in different time zones.
Examples
The following tables shows valid examples of timestamps and their corresponding format (parsing) patterns:
Timestamp value | Format Pattern | Notes |
---|---|---|
2013-03-23 09:45:00 | yyyy-MM-dd HH:mm:ss | This is the default pattern. |
2013-03-23 19:45:00.98-05 | yyyy-MM-dd HH:mm:ss.SSZ | This pattern allows up to 2 decimal digits of seconds, and requires a time zone specification. |
2013-03-23 09:45:00-07 | yyyy-MM-dd HH:mm:ssZ | This patterns requires a time zone specification, but does not allow for decimal digits of seconds. |
2013-03-23 19:45:00.98-0530 | yyyy-MM-dd HH:mm:ss.SSZ | This pattern allows up to 2 decimal digits of seconds, and requires a time zone specification. |
2013-03-23 19:45:00.123 2013-03-23 19:45:00.12 |
yyyy-MM-dd HH:mm:ss.SSS | This pattern allows up to 3 decimal digits of seconds, but does not allow a time zone specification. Note that if your data specifies more than 3 decimal digits of seconds, an error occurs. |
2013-03-23 19:45:00.1298 | yyyy-MM-dd HH:mm:ss.SSSS | This pattern allows up to 4 decimal digits of seconds, but does not allow a time zone specification. |
Please see Working With Date and Time Values in the Splice Machine Documentation for information working with timestamps, dates, and times.
Examples
The examples in this section illustrate using different timestamp formats and different string delimiter characters.
Note that these examples work for either importing or upserting data: you can simply substitute UPSERT_DATA_FROM_FILE in place of IMPORT_DATA in any of the system procedure calls below.
Example 1: Using the Default Timestamp Formats
Use up to four default timestamp formats on any column in the same table without the need to specify the format string explicitly in the import command.
For example, given the CSV file below which contains different timestamp formats across
different columns:
Mike,2013-04-21 09:21:24.98-05,2013-03-23 16:24:37.651-05
Mike,2013-04-21 09:15:32,2013-04-21 09:15:32.005-05
Mike,2013-03-23 09:45:00-05,2013-03-23 09:45:00+07
Create your sample table with the following statement:
create table tabx (c1 char(30),c2 timestamp, c3 timestamp,
primary key(c2));
You can then import the data with the following call
call SYSCS_UTIL.IMPORT_DATA('splice', 'tabx', 'c1, c2',
'/path/to/ts.csv',
',', '''',
null, null, null, 0, null, true, null);
In this example, we specify null for the timestamp formats. Then, when each row is imported, the format of any timestamp in the data is automatically detected, since they all conform to the four default formats.
Example 2: Importing Data Into a Table With Fewer Columns Than in the File
If the table into which you’re importing data has fewer columns than the data file that you’re importing, the “extra” data columns in the file are ignored. For example, if you create a table with this statement:
CREATE TABLE playerTeams(ID int primary key, Team VARCHAR(32));
And your data file looks like this:
1,Cards,Molina,Catcher
2,Giants,Posey,Catcher
3,Royals,Perez,Catcher
When you import the file into playerTeams, only the first two columns are imported:
call SYSCS_UTIL.IMPORT_DATA('SPLICE','playerTeams',null, 'myData.csv',
null, null, null, null, null, 0, 'importErrsDir', true, null);
SELECT * FROM playerTeams ORDER by ID;
ID |TEAM
--------------
1 |Cards
2 |Giants
3 |Royals
3 rows selected
Example 3: Importing a Subset of Data From a File Into a Table
This example uses the same table and import file as does the previous example, and it produces the same results, The difference between these two examples is that this one explicitly imports only the first two columns (which are named ID and TEAM) of the file:
call SYSCS_UTIL.IMPORT_DATA('SPLICE','playerTeams', null, 'myData.csv',
'ID, TEAM', null, null, null, null, 0, 'importErrsDir', true, null);
SELECT * FROM playerTeams ORDER by ID;
ID |TEAM
--------------
1 |Cards
2 |Giants
3 |Royals
3 rows selected
Example 4: Specifying a Timestamp Format for an Entire Table
Use a single timestamp format for the entire table by explicitly specifying a single timeStampFormat.
Mike,2013-04-21 09:21:24.98-05
Mike,2013-04-21 09:15:32.78-04
Mike,2013-03-23 09:45:00.68-05
You can then import the data with the following call:
call SYSCS_UTIL.IMPORT_DATA('app','tabx','c1,c2',
'/path/to/ts3.csv',
',', '''',
'yyyy-MM-dd HH:mm:ss.SSZ',
null, null, 0, null, true, null);
Note that for any import use case shown above, the time shown in the imported table depends on the timezone setting where the data is imported. In other words, given the same CSV file, if imported at locations with different time zones, the value in the table shown will be different. Additionally, daylight savings time may account for a 1-hour difference if the time zone is specified.
Example 5: Importing Strings With Embedded Special Characters
This example imports a CSV file that includes newline (Ctrl-M) characters in some of the input strings. We use the default double-quote as our character delimiter to import data such as the following:
1,This field is one line,Able
2,"This field has two lines
This is the second line of the field",Baker
3,This field is also just one line,Charlie
We then use the following call to import the data:
SYSCS_UTIL.IMPORT_DATA('SPLICE', 'MYTABLE', null, 'data.csv', '\t', null, null, null, null, 0, 'importErrsDir', true, null);
We can also explicitly specify double quotes (or any other character) as our delimiter character for strings:
SYSCS_UTIL.IMPORT_DATA('SPLICE', 'MYTABLE', null, 'data.csv', '\t', '"', null, null, null, 0, 'importErrsDir', true, null);
Example 6: Using Single Quotes to Delimit Strings
This example performs the same import as the previous example, simply substituting single quotes for double quotes as the character delimiter in the input:
1,This field is one line,Able
2,'This field has two lines
This is the second line of the field',Baker
3,This field is also just one line,Charlie
Note that you must escape single quotes in SQL, which means that you actually define the character delimiter parameter with four single quotes, as follow
SYSCS_UTIL.IMPORT_DATA('SPLICE', 'MYTABLE', null, 'data.csv', '\t', '''', null, null, null, 0, 'importErrsDir', true, null);
Published at DZone with permission of Erin Driggers, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments