Loading Data Into Azure SQL Data Warehouse
I’m not an ETL expert. In fact, I haven’t done any professional ETL work for several years. My skills are, at best, rusty. With this in mind, I knew I’d have a hard time extracting data from a local database in order to move it up to Azure SQL Data Warehouse. Read on to hear about my journey.
Join the DZone community and get the full member experience.
Join For FreeLet’s start with the level set. I’m not an ETL expert. In fact, I haven’t done any professional ETL work for several years. My skills are, at best, rusty. With this in mind, I knew I’d have a hard time extracting data from a local database in order to move it up to Azure SQL Data Warehouse. I expected to be fumbling and slow and to constantly feel more than a little stupid as I made one mistake after another. All of this came to pass.
Yet, oh my god, was it difficult!
Here’s how I started. I defined a bcp command for the tables I was interested in. I ensured it was working correctly, then wrote a little PowerShell script so I could supply a list of tables and get them all exported at once. Here’s an example:
$tables | foreach-object {
$filename = $_.Substring($_.IndexOf(".") + 1,($_.Length - ($_.IndexOf(".") + 1)))
bcp $_ out "c:\data\bcp\$filename.rtf" -S'WIN-3SRG45GBF97\dojo' -d'ODS' -T -n -c
}
That worked fine. Next up, I used azcopy to load the files I created up to blob storage. Again, a little Powershell work to move all the files at once:
$tables | foreach-object {
.\AzCopy.exe /Source:c:\data\bcp /Dest:"https://xxx.blob.core.windows.net/sales/data/$_" /DestKey:xx /Pattern:"$_.rtf" /Z:"C:\data\bcp" /Y
}
No issues. Well, OK, I screwed up both these commands like twenty times until I got them right and then did the quick automation. However, the screw-ups were mine, not external issues. Next up, setting up external tables. There’s a good article on how to do this provided from Microsoft. Up until I tried to create the table, everything worked just fine. Then I saw this:
Msg 104051, Level 16, State 1, Line 11
Unsupported data type error. Statement references a data type that is unsupported in Parallel Data Warehouse, or there is an expression that yields an unsupported data type. Modify the statement and re-execute it.
Oh boy.
Luckily, there’s another blog post from Microsoft on getting the schema right. NUMERIC out and DECIMAL in, I got my next error:
Msg 105001, Level 16, State 1, Line 11
CREATE EXTERNAL TABLE failed because the location URI could not be reached. Revise the URI and try again.
My fault. I tried to get fancy and did this:
create external data source AzureStoreageGF
with (type = Hadoop,
location='wasbs://sales/[email protected]',
credential=AzureStorageCredential);
Changing it to just [email protected] and then making my WITH statement on the CREATE EXTERNAL TABLE command read like this:
WITH (
LOCATION='data/Account/',
DATA_SOURCE=AzureStorage,
FILE_FORMAT=TextFile);
Worked! I’m a winner. All I have to do now is query from the Polybase table I just created.
Msg 107090, Level 16, State 1, Line 132
Query aborted– the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
(/data/Account/Account.rtf)Column ordinal: 1, Expected data type: NVARCHAR(1300) collate SQL_Latin1_General_CP1_CI_AS, Offending value: xxx (Tokenization failed), Error: Not enough columns in this line.
RASSIN FRASSIN… Except I didn’t use those words. I substituted some other Anglo-Saxon terms. Anyway, what’s it mean? It means, probably, although it’s hard to know, that I must have messed up the data export in some fashion. Or, that there’s something in the data that’s breaking the interpretation of the code... only, I can’t find it. So, old school, I opened this up in Excel and found two columns that had carriage returns in them. Fixed that. Going again:
Query aborted– the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 33 rows processed.
(/data/Account/Account.rtf)Column ordinal: 0, Expected data type: DECIMAL(18,2), Offending value: , Error: UTF8 decode failed.
A win! What? You didn’t see the win? You’re looking at the error message. 33 rows processed. Yeah, that’s out of 96,000 rows, but it’s more than 1. That means that 32 rows processed successfully. Now, you might be wondering about that file format. Last time I got that was because I had bcp out to a non-character format. That suggests there’s a data type in there that is giving me more problems in Polybase. Back to SQL Server to see what’s up with this data (because everything worked perfectly in Excel this time). And, found it. This is the problem:
‘å’
I’m using bcp. It goes to a UTF-16 file format. Polybase supports the Hadoop standard of UTF-8. This must be blocking it……. Great. There it is in black & white:
bcp does not support exporting to UTF-8. This may prevent using PolyBase on bcp exported data
Fine. Powershell to the rescue:
Get-ChildItem -rec -fi *.rtf | %{(get-content $_ -enc string) | set-content -enc utf8 $_.fullname}
This time I processed 197 rows before I found another problem:
Query aborted– the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 197 rows processed.
(/data/Account/Account.rtf)Column ordinal: 110, Expected data type: INT NOT NULL, Offending value: |45178|1043866 (Tokenization failed), Error: Too many columns in the line.
What follows is a tedious and dull march through various & sundry text columns that, for some strange reason, have a pipe (|) character in them (why the phone number, by Odin’s Eye Patch, why the phone number). Suddenly… I have 96,000 rows returning from my COUNT(*) query (using Visual Studio, SSMS was twitchy when connected to ADW). Run a CREATE TABLE FROM SELECT query and I have a new table. Ta-da.
Let’s bottom line this. Data is dirty. Any ETL process is likely to expose the trashy data, so what are the real issues here? First up, Powershell is my best-est buddy. Scripting everything through Posh right out of the gate made a huge difference in my ability to constantly reset. The fact that our most common processes spit out UTF16, while not a problem, is something you need to get on top of (see Posh above). AZCopy works really well and the command line is easy to implement, but you’ll need to go through the process of setting up all the appropriate connections. Polybase is wicked cool. Yes, I had to hop through hoops to get the data clean, but, what you should note in the above descriptions, at no point did I need to go and recreate my EXTERNAL TABLE from Azure SQL Data Warehouse. I just kept updating the files through my scripts and as soon as they were loaded up to blob storage, I ran a query against them. No additional work required. That’s nice. You’re going to want to go through your data types on your data before you start the migration. You’re going to have to plan for a lot of time cleaning your data if you haven’t already.
In summary, this process is well documented, but has a lot of moving parts and lots of places for things to break down. If you’re like me and ETL is a distant memory, plan for a very steep learning curve. If you’re good at ETL, but not so hot at automation, your mouse clicking finger is going to get tired. Microsoft has a decent overview of the process posted here.
Now, I need to get started cleaning the data on the next table… pity me.
Published at DZone with permission of Grant Fritchey, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments