Best Practices for Picking PostgreSQL Data Types
In this article, learn which data types best suit your application when storing massive data volumes in PostgreSQL and Timescale.
Join the DZone community and get the full member experience.
Join For FreeWhen creating applications that store and analyze large amounts of data, such as time series, log data, or event-storing ones, developing a good and future-proof data model can be a difficult task. Determining the most appropriate data types to store the information depends on various factors, including the required precision of float-point values, the content of the values (such as text), compressibility, and query speed.
Choosing the right data types in PostgreSQL can significantly impact your database's performance and efficiency. In this article, we'll explore best practices for selecting data types that align with PostgreSQL's recommendations. Whether you're designing tables for a new project or optimizing an existing database, these insights will help you make informed decisions.
Basic Data Types
PostgreSQL offers a range of basic data types that cater to various use cases. While they are versatile, PostgreSQL's best practices advise some considerations:
Nullable Columns
Don't shy away from using nullable columns. PostgreSQL efficiently stores NULL values, making them almost cost-free. This is especially useful in medium and wide table layouts with numerous nullable columns. The same is true for Timescale’s custom compression. Due to storing data in a columnar format, empty row values are almost free when compressed
Boolean Values
A boolean value is a logical data type with one of two possible values: TRUE or FALSE. It is usually used to record decisions or states. You can choose between the boolean data type or using an integer (1 or 0). Both options are valid, with no significant compression advantages for booleans.
Floating-Point Values
Floating-point data types represent real numbers, most often decimal floating points of base ten. They store all kinds of information, such as percentages, measurements like temperature or CPU usage, or statistical values. There are two groups of floating-point numbers in PostgreSQL: float4 (a.k.a., real), float8 (double precision), and numeric. Float4 and float8 are recommended for floating-point numbers, and if you're using Timescale, they are optimized for compression. If precision is known, consider storing them as integers for better optimization. On the other hand, numeric, as an arbitrary precision-sized data type, isn’t optimized at all. Numeric isn’t recommended.
Integer Values
Integer data types, including serial data types, are well-suited for event counts and similar use cases. All integer types (int2, SmallInt, int4, Integer, int8, BigInt) are recommended data types. TimescaleDB excels in compressing these types. What is true for integers is also true for all serial data types (serial2, SmallSerial, serial4, Serial, serial8, BigSerial), as those are magical “aliases” for their integer counterparts, incorporating the automatic creation of sequences to fill in their values on insert. That said, they use their corresponding integer data types as column data types. Anyhow, the PostgreSQL best practices advise against using them and recommend using identity columns instead for anything PostgreSQL from version 10 onwards.
Timestamp, Time, and Date Values
Timestamps and time and date data types represent a specific point in time, some with more and some with less explicit information. All these data types have versions with and without timezone information attached (except date). When dealing with time-related data, consider these guidelines:
To store dates, you should consider using timestamptz with the time portion set to midnight in the necessary timezone (2023-01-01T00:00:00+00
) to prevent casting overhead when querying.
Likewise, you can use timestamptz to store a time value only. In this case, you encode the time portion to a specific date (such as 1970-01-01T15:01:44+04
) and cast the final value back into a time value. Alternatively, you can store the value as an integer by encoding the time into the (nano)seconds since midnight or any other encoding you can come up with.
Text Values
Text values are used to store textual values of arbitrary size. Those values can include detailed descriptions, log messages, and metric names or tags. The available data types include text, char(n), and varchar(n). PostgreSQL’s best practices advise against using char(n), as it will pad values shorter than n to that size and waste storage. It recommends using text instead.
For text values, keep it simple:
- Prefer text over char(n) or varchar(n) to prevent unnecessary padding.
- Deduplicate long text values by storing references in a separate table.
- TimescaleDB uses dictionary compression for text fields.
Byte Array (bytea) Values
Byte arrays (in PostgreSQL are represented by the data type bytea) store arbitrary large sequences of bytes, which may represent anything, from encoded machine state to binary data packets.
Byte arrays are less common but still have their place:
- TimescaleDB applies lz4-based compression to bytea values.
- For recurring large bytea values, consider deduplication and storage outside the hypertable.
Complex and Extension Data Types
When dealing with complex data, follow these guidelines:
- Structural Types (JSON and JSONB): Use JSONB for efficient storage and querying of complex data. Consider extracting large amounts of data into separate tables for performance.
- UUID: Be cautious when using UUIDs, as they can create locality issues with BTree indexes.
- Ltree: Use ltree for deduplicating recurring data.
- Hstore: hstore provides key-value pairs but is less commonly used than JSONB.
- PostGIS Data: PostGIS data types work well with TimescaleDB but don't receive specific optimizations.
Conclusion
Selecting the right data types in PostgreSQL is a critical aspect of database design and optimization. While specific recommendations may vary based on your project's requirements, this guide has provided valuable insights into which data types to embrace, which to be cautious of, and which to avoid. By following these best practices, you can ensure your PostgreSQL database performs efficiently and effectively.
Data Type Recommendations
- Great: Nullable Columns, Boolean Values (boolean or integer), Float4 and Float8, Integer Values, Text Values (text), Byte Array (bytea), JSONB, Ltree, PostGIS Data.
- Okay, but be wary of some: Timestamp, Time, and Date Values (without time zones), Text Values (char(n) and varchar(n)), and UUID.
- Avoid at any cost: Timestamp, Time, and Date Values (without time zones).
- Unknown (feel free to provide experience/feedback): XML, Hstore, PostGIS Data Optimization.
Published at DZone with permission of kelvin steve. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments