I have talked about human filters and my plan for digital curation. These items are the fruits of those ideas, the items I deemed worthy from my Google Reader feeds. These items are a combination of tech business news, development news and programming tools and techniques. Getting Visual: Your Secret Weapon For Storytelling & Persuasion (The Future Buzz) My Clojure Workflow, Reloaded (Hacker News) Replacing Clever Code with Unremarkable Code in Go (Hacker News) Unit Test like a Secret Agent with Sinon.js (Web Dev .NET) Bliki: EmbeddedDocument (Martin Fowler) How we use ZFS to back up 5TB of MySQL data every day (Royal Pingdom) IMB to acquire Softlayer for a rumored $2-2.5 billion (Hacker News) Cloud SQL API: YOU get a database! And YOU get a database! And YOU get a database! (Cloud Platform Blog) You Should Write Ugly Code (Hacker News) How many lights can you turn on? (The Endeavour) Python Big Picture — What's the "roadmap"? (S.Lott-Software Architect) Salesforce announces deal to buy digital marketing firm ExactTarget for $2.5 billion (The Next Web) Dew Drop – June 4, 2013 (#1,560) (Alvin Ashcraft's Morning Dew) New Technologies Change the Way we Engage with Culture (Conversation Agent) Free Python ebook: Bayesian Methods for Hackers (Hacker News) How Go uses Go to build itself (Hacker News) Sustainable Automated Testing (Javalobby – The heart of the Java developer community) Breaking Down IBM’s Definition of DevOps (Javalobby – The heart of the Java developer community) Big Data is More than Correlation and Causality (Javalobby – The heart of the Java developer community) So, What’s in a Story? (Agile Zone – Software Methodologies for Development Managers) The Real Lessons of Lego (for Software) (Agile Zone – Software Methodologies for Development Managers) The Daily Six Pack: June 4, 2013 (Dirk Strauss) Get your mobile application backed by the cloud with the Mobile Backend Starter (Cloud Platform Blog) Open for Big Data: When Mule Meets the Elephant (Javalobby – The heart of the Java developer community) I hope you enjoy today’s items, and please participate in the discussions on those sites.
I have talked about human filters and my plan for digital curation. These items are the fruits of those ideas, the items I deemed worthy from my Google Reader feeds. These items are a combination of tech business news, development news and programming tools and techniques. Real-Time Ad Impression Bids Using DynamoDB (Amazon Web Services Blog) The mother of all M&A rumors: AT&T, Verizon to jointly buy Vodafone (GigaOM) Is this the future of memory? A Hybrid Memory Cube spec makes its debut. (GigaOM) Dew Drop – April 2, 2013 (#1,518) (Alvin Ashcraft's Morning Dew) Rosetta Stone acquires Livemocha for $8.5m to move its language learning platform into the cloud (The Next Web) Double Shot #1098 (A Fresh Cup) Extending git (Atlassian Blogs) A Thorough Introduction To Backbone.Marionette (Part 2) (Smashing Magazine Feed) 60 Problem Solving Strategies (Javalobby – The heart of the Java developer community) Why asm.js is a big deal for game developers (HTML5 Zone) Implementing DAL in Play 2.x (Scala), Slick, ScalaTest (Javalobby – The heart of the Java developer community) “It’s Open Source, So the Source is, You Know, Open.” (Javalobby – The heart of the Java developer community) How to Design a Good, Regular API (Javalobby – The heart of the Java developer community) Scalding: Finding K Nearest Neighbors for Fun and Profit (Javalobby – The heart of the Java developer community) The Daily Six Pack: April 2, 2013 (Dirk Strauss) Usually When Developers Are Mean, It Is About Power (Agile Zone – Software Methodologies for Development Managers) Do Predictive Modelers Need to Know Math? (Data Mining and Predictive Analytics) Heroku Forces Customer Upgrade To Fix Critical PostgreSQL Security Hole (TechCrunch) DYNAMO (Lambda the Ultimate – Programming Languages Weblog) FitNesse your ScalaTest with custom Scala DSL (Java Code Geeks) LinkBench: A database benchmark for the social graph (Facebook Engineering's Facebook Notes) Khan Academy Checkbook Scaling to 6 Million Users a Month on GAE (High Scalability) Famo.us, The Framework For Fast And Beautiful HTML5 Apps, Will Be Free Thanks To “Huge Hardware Vendor Interest” (TechCrunch) Why We Need Lambda Expressions in Java – Part 2 (Javalobby – The heart of the Java developer community) I hope you enjoy today’s items, and please participate in the discussions on those sites.
In my last post, I described a new feature in TokuMX 1.5—partitioned collections—that’s aimed at making it easier and faster to work with time series data. Feedback from that post made me realize that some users may not immediately understand the differences between partitioning a collection and sharding a collection. In this post, I hope to clear that up. On the surface, partitioning a collection and sharding a collection seem similar. Both actions take a collection and break it into smaller pieces for some performance benefit. Also, the terms are sometimes used interchangeably when discussing other technologies. But for TokuMX, the two features are very different in purpose and implementation. In describing each feature’s purpose and implementation, I hope to clarify the differences between the two features. Let’s address sharding first. The purpose of sharding is to to distribute a collection across several machines (i.e. “scale-out”) so that writes and queries on the collection will be distributed. The main idea is that for big data, a single machine can only do so much. No matter how powerful your one machine is, that machine will still be limited by some resource, be it IOPS, CPU, or disk space. So, to get better performance for a collection, one can use sharding to distribute the collection across several machines, and thereby improve performance by increasing the amount of hardware. To perform these tasks, a sharded collection ought to have a relatively even distribution across shards. Therefore, it should have the following properties: User’s writes ought to be distributed amongst machines (or shards). After all, if all writes are targeted at a single shard, then they are not distributed and we are not scaling To keep data distribution relatively even, background process migrate data between shards if a shard is found to have too much or too little data Because of these properties, each shard contains a random subset of the collection’s data. Now let’s address partitioning. The purpose of partitioning is to break the collection into smaller collections so that large chunks of data may be removed very efficiently. A typical example is keeping a rolling period of 6 months of log data for a website. Another example is keeping the last 14 days of oplog data, as we do via partitioning in TokuMX 1.4. In such examples, typically only one partition (the latest one) is getting new data. Periodically, but infrequently, we drop the oldest partition to reclaim space. For the log data example, once a month we may drop a month’s worth of data. For the oplog, once a day we drop a day’s worth of data. To perform these tasks, we are not concerned with load distribution, as nearly all writes are typically going to the last partition. We are not spreading partitions across machines. With partitioning, each partition holds a continuous range of the data (e.g. all data from the month of February), whereas with sharding, each shard holds small random chunks of data from across the key space. With all this being said, there are still similarities when thinking of schema design with a partitioned collection and a sharded collection. As I touched on in my last post, designing a partition key has similarities to designing a shard key as far as queries are concerned. Queries on a sharded collection perform better if they target single shards. Similarly, queries on a partitioned collection perform better if they target a single partition. Queries that don’t can be thought of as “scatter/gather” for both sharded and partitioned collections. Hopefully this illuminates the difference between a partitioned collection and a sharded collection.
see also: part i: when to build your data warehouse part ii: building a new schema part iii: location of your data warehouse part iv: extraction, transformation, and load in part i we looked at the advantages of building a data warehouse independent of cubes/a bi system and in part ii we looked at how to architect a data warehouse’s table schema. in part iii, we looked at where to put the data warehouse tables. in part iv, we are going to look at how to populate those tables and keep them in sync with your oltp system. today, our last part in this series, we will take a quick look at the benefits of building the data warehouse before we need it for cubes and bi by exploring our reporting and other options. as i said in part i, you should plan on building your data warehouse when you architect your system up front. doing so gives you a platform for building reports, or even application such as web sites off the aggregated data. as i mentioned in part ii, it is much easier to build a query and a report against the rolled up table than the oltp tables. to demonstrate, i will make a quick pivot table using sql server 2008 r2 powerpivot for excel (or just powerpivot for short!). i have showed how to use powerpivot before on this blog , however, i usually was going against a sql server table, sql azure table, or an odata feed. today we will use a sql server table, but rather than build a powerpivot against the oltp data of northwind, we will use our new rolled up fact table. to get started, i will open up powerpivot and import data from the data warehouse i created in part ii. i will pull in the time, employee, and product dimension tables as well as the fact table. once the data is loaded into powerpivot, i am going to launch a new pivottable. powerpivot understands the relationships between the dimension and fact tables and places the tables in the designed shown below. i am going to drag some fields into the boxes on the powerpivot designer to build a powerful and interactive pivot table. for rows i will choose the category and product hierarchy and sum on the total sales. i’ll make the columns (or pivot on this field) the month from the time dimension to get a sum of sales by category/product by month. i will also drag in year and quarter in my vertical and horizontal slicers for interactive filtering. lastly i will place the employee field in the report filter pane, giving the user the ability to filter by employee. the results look like this, i am dynamically filtering by 1997, third quarter and employee name janet leverling. this is a pretty powerful interactive report build in powerpivot using the four data warehouse tables. if there was no data warehouse, this pivot table would have been very hard for an end user to build. either they or a developer would have to perform joins to get the category and product hierarchy as well as more joins to get the order details and sum of the sales. in addition, the breakout and dynamic filtering by year and quarter, and display by month, are only possible by the dimtime table, so if there were no data warehouse tables, the user would have had to parse out those dateparts. just about the only thing the end user could have done without assistance from a developer or sophisticated query is the employee filter (and even that would have taken some powerpivot magic to display the employee name, unless the user did a join.) of course pivot tables are not the only thing you can create from the data warehouse tables you can create reports, ad hoc query builders, web pages, and even an amazon style browse application. (amazon uses its data warehouse to display inventory and oltp to take your order.) i hope you have enjoyed this series, enjoy your data warehousing.
In Part I we looked at the advantages of building a data warehouse independent of cubes/a BI system and in Part II we looked at how to architect a data warehouse’s table schema. Today we are going to look at where to put your data warehouse tables. Let’s look at the location of your data warehouse. Usually as your system matures, it follows this pattern: Segmenting your data warehouse tables into their own isolated schema inside of the OLTP database Moving the data warehouse tables to their own physical database Moving the data warehouse database to its own hardware When you bring a new system online, or start a new BI effort, to keep things simple you can put your data warehouse tables inside of your OLTP database, just segregated from the other tables. You can do this a variety of ways, most easily is using a database schema (ie dbo), I usually use dwh as the schema. This way it is easy for your application to access these tables as well as fill them and keep them in sync. The advantage of this is that your data warehouse and OLTP system is self-contained and it is easy to keep the systems in sync. As your data warehouse grows, you may want to isolate your data warehouse further and move it to its own database. This will add a small amount of complexity to the load and synchronization, however, moving the data warehouse tables to their own table brings some benefits that make the move worth it. The benefits include implementing a separate security scheme. This is also very helpful if your OLTP database scheme locks down all of the tables and will not allow SELECT access and you don’t want to create new users and roles just for the data warehouse. In addition, you can implement a separate backup and maintenance plan, not having your date warehouse tables, which tend to be larger, slow down your OLTP backup (and potential restore!). If you only load data at night, you can even make the data warehouse database read only. Lastly, while minor, you will have less table clutter, making it easier to work with. Once your system grows even further, you can isolate the data warehouse onto its own hardware. The benefits of this are huge, you can have less I/O contention on the database server with the OLTP system. Depending on your network topology, you can reduce network traffic. You can also load up on more RAM and CPUs. In addition you can consider different RAID array techniques for the OLTP and data warehouse servers (OLTP would be better with RAID 5, data warehouse RAID 1.) Once you move your data warehouse to its own database or its own database server, you can also start to replicate the data warehouse. For example, let’s say that you have an OLTP that works worldwide but you have management in offices in different parts of the world. You can reduce network traffic by having all reporting (and what else do managers do??) run on a local network against a local data warehouse. This only works if you don’t have to update the date warehouse more than a few times a day. Where you put your data warehouse is important, I suggest that you start small and work your way up as the needs dictate.
Functions are one of the most important aspects of JavaScript. This article will explore the top nine commonly used JavaScript functions with examples.
Learn how to build a modern data stack with cloud-native technologies, such as data warehouse, data lake, and data streaming, to solve business problems.
The Decorator pattern is a great fit for modifying the behaviour of a microservice. Native language support can help with applying it quickly and modularly.
Discover the key differences between Databricks and Snowflake around architecture, pricing, security, compliance, data support, data protection, performance, and more.
Learn what data ingestion is, why it matters, and how you can use it to power your analytics and activate your data as an essential part of the modern data stack.
The widely adopted SCM tools we use today, GitHub and Gitlab, are built on the dated architecture and design of git, but this has some security gaps we'll explore.