How Much Is One Terabyte of Data?
Assuming each record takes up 100 bytes or 0.1 KB, one terabyte of storage can hold up to ten billion records.
Join the DZone community and get the full member experience.
Join For FreeIt seems that a one-mile distance isn’t very long and that a cubic mile isn’t that big if compared with the size of the earth. You may be surprised if I tell you the entire world’s population can fit into a cubic mile of space. The statement is not from me; Hendrik Willem van Loon, a Dutch-American writer, once wrote this in his book.
Teradata is a famous data warehouse product. Over 30 years ago, such a brand name aimed to impress people with its ability to handle massive amounts of data. Today, TB is already the smallest unit many database vendors use when talking about the amount of data they can handle. And PB, even ZB, is often used. It seems that TB is not a big unit, and hundreds of terabytes of data, even a petabyte of data, is not intimidating at all.
Actually, one TB, like one cubic mile, is of a rather large size. As many people do not have much intuitive grasp of its size, we take a new angle to examine what 1TB of data means to the database.
Databases mainly process structured data, among which the ever-increasing transaction data takes up the largest space. The size of each piece of transaction data isn’t big, from dozens to one hundred bytes when only the key information is stored. For example, the banking transaction information only includes account, date, and amount, and a telecom company’s call records only contain phone numbers, times, and duration. Suppose each record occupies 100 bytes, which is 0.1 KB; a terabyte of storage space can accommodate about ten billion records.
What does this mean? There are about 30 million seconds in a year. To accumulate one terabyte of data in a year, about 300 records per second will be generated around the clock!
It isn’t a ridiculously large number. In a large country like the U.S., it is easy for businesses of national telecom operators, national banks, and internet giants to reach that scale. But for a city-wide or even some state-wide institutions, it is really a big number. It is not probable that the tax information collected by local tax bureaus, the purchase data of a local chain store, or the transaction data of a city commercial bank increases by 300 records per second. Besides, many organizations generate data only on days or weekdays. To have dozens of, even one hundred terabytes of data, business volume should be one or two orders of magnitude bigger.
Just talking about a TB scale of data may be too abstract to make sense of it. But by translating it to the corresponding business volume per second, we can have a clear idea of it.
On the other hand, some not-so-large-scale organizations also have a data volume ranging from hundreds of terabytes to even one petabyte. How does it happen?
Any piece of audio and video unstructured data can be several, even dozens of, megabytes in size. It is easy to reach the PB level, but the database won’t compute this data.
Different info-systems of an organization have together accumulated a huge volume of data in N years – each contributing 200 GB per year, 50 GB per year, or others. In addition, there are redundant intermediate computing results. Put them all together, and we probably get a total of hundreds of terabytes, even one petabyte. Maybe they are stored in the database, but generally, they won’t be used at one time in the same computing task.
It is normal to generate hundreds of, even ten thousand, records per second if the machine automatically collects the data or if it is the user behavior data. The total volume of data may reach hundreds of terabytes, even the PB level. In this case, the database should be able to process TB-level data or above. Yet such type of trivial data is of little use and has very simple computing logic. Basically, we just need to find and retrieve them.
Now, let’s look at what a database that can process TB-level data looks like. Some database vendors claim that their products can handle TB-level, even PB-level, data in seconds, and that’s what users often expect them to do. But is it true?
To process data, we need to read them through once at least. The high-speed SSD reads 300 megabytes of data per second (the technical parameters hard disk manufacturers provide cannot be fully achievable under the operating system). It takes 3000 seconds, which is nearly an hour, to retrieve one terabyte of data without performing any other operations. How can one TB data be processed in seconds? It is simply done by putting 1000 hard disks in place, and one TB of data can be retrieved in about three seconds.
That is the ideal estimate. In reality, it’s unlikely that data is stored in neat order, and that performance becomes terrible when data is retrieved from the hard disk discontinuously. Obviously, the 1000 hard disks won’t be equipped in one machine. For a cluster, there’s network latency; there may be some computations involving the writeback operation, such as the sorting operation and join operation, and concurrent requests often accompany instant query access. Considering all these factors, it is not surprising that processing becomes several times slower.
Now we know that one terabyte of data means several hours or 1000 hard disks. As we said, this is just about one terabyte. You can imagine what dozens or one hundred terabytes of data will bring.
You can understand that it is difficult to move one TB of data if you have any experience of transferring files online. Probably the quickest way is to carry the hard disks away. This also gives us an idea of the size of one TB of data.
In practice, most computing tasks of most users involve a data volume ranging from dozens to hundreds of gigabytes at most. The data volume rarely reaches the TB level. Yet, it takes the distributed database to run several hours to process even this amount of data. A review of the very slow tasks you handled may help prove it. The computing logic may be complex, during which it is not uncommon that repeated traversals and writebacks are involved. Generally, the currently running distributed database has only several to a dozen nodes; it is almost impossible to create an environment made up of thousands of hard disks. In this case, it is not surprising at all that several hours are spent performing the computation, which is almost normal for batch-processing tasks in the finance industry.
Even the large, top organizations that have a total data volume of N petabytes and that have thousands of, even ten thousand, nodes in their computing center, most of their computing tasks involve data volumes that only range from dozens to hundreds of gigabytes. Maybe about ten virtual machines will be allocated to a certain task. A large-scale organization has too many things to take care of. It is impossible to allocate all its resources to one task.
PB-level data does exist in many organizations, but it is mainly a storage concept rather than a computing requirement. A total data volume of PB level requires that the database have the ability to process PB-level data, which is the consequence of the databases’ closedness. Existence does not mean justifiability. Actually, that is a terrible solution, which we will talk about later.
One TB data is a huge volume for the database used for data analysis and computing. The name Teradata does not become outdated even today. It will be significant if a tool can process TB-level data smoothly, such as improving user experience by reducing the processing time from several hours to several minutes, or if it can simplify a small-scale distributed environment and convert it to a single machine so that the operation and maintenance costs can be greatly reduced. Well, esProc SPL is such a tool.
For most of the user scenarios, pursuing the ability to process PB-level data is both unnecessary and impracticable.
Published at DZone with permission of Judy Liu. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments