Programming languages allow us to communicate with computers, and they operate like sets of instructions. There are numerous types of languages, including procedural, functional, object-oriented, and more. Whether you’re looking to learn a new language or trying to find some tips or tricks, the resources in the Languages Zone will give you all the information you need and more.
Why You Should Migrate Microservices From Java to Kotlin: Experience and Insights
Top 10 C# Keywords and Features
In this post, I'll explain how to provide a default value when querying an absent key in a hash map in different programming languages. Java Let's start with Java, my first professional programming language. In older versions, retrieving a value from a map required using the get() method: Java Map map = new HashMap(); //1 Object value = map.get(new Object()); //2 if (value == null) { value = "default"; //3 } Initialize an empty map.Attempt to retrieve a non-existent key.Assign a default value if the key is absent. With Java 1.8, the Map interface introduced a more concise way to handle absent keys: Java var map = new HashMap<Object, String>(); var value = map.getOrDefault(new Object(), "default"); //1 Retrieve the value with a default in one step. Kotlin Kotlin provides several approaches to retrieve values from a map: get() and getOrDefault() function just like their Java counterparts.getValue() throws an exception if the key is missing.getOrElse() accepts a lambda to provide a default value lazily. Kotlin val map = mapOf<Any, String>() val default = map.getOrDefault("absent", "default") //1 val lazyDefault = map.getOrElse("absent") { "default" } //2 Retrieve the default value.Lazily evaluate the default value. Python Python is less forgiving than Java when handling absent keys — it raises a KeyError: Python map = {} value = map['absent'] #1 Raises a KeyError To avoid this, Python offers the get() method: Python map = {} value = map.get('absent', 'default') #1 Alternatively, Python's collections.defaultdict allows setting a default for all absent keys: Python from collections import defaultdict map = defaultdict(lambda: 'default') #1 value = map['absent'] Automatically provide a default value for any absent key. Ruby Ruby's default behavior returns nil for absent keys: Ruby map = {} value = map['absent'] For a default value, use the fetch method: Ruby map = {} value = map.fetch('absent', 'default') #1 Provide a default value for the absent key. Ruby also supports a more flexible approach with closures: Ruby map = {} value = map.fetch('absent') { |key| key } #1 Return the queried key instead of a constant. Lua My experience with Lua is relatively new, having picked it up for Apache APISIX. Let's start with Lua's map syntax: Lua map = {} --1 map["a"] = "A" map["b"] = "B" map["c"] = "C" for k, v in pairs(map) do --2 print(k, v) --3 end Initialize a new map.Iterate over key-value pairs.Print each key-value pair. Fun fact: the syntax for tables is the same as for maps: Lua table = {} --1 table[0] = "zero" table[1] = "one" table[2] = "two" for k,v in ipairs(table) do --2 print(k, v) --3 end Initialize a new mapLoop over the pairs of key values Print the following: 1 one 2 two Lua arrays start at index 0! We can mix and match indices and keys. The syntax is similar, but there's no difference between a table and a map. Indeed, Lua calls the data structure a table: Lua something = {} something["a"] = "A" something[1] = "one" something["b"] = "B" for k,v in pairs(something) do print(k, v) end The result is the following: 1 one a A b B In Lua, absent keys return nil by default: Lua map = {} value = map['absent'] To provide a default, Lua uses metatables and the __index metamethod: Metatables allow us to change the behavior of a table. For instance, using metatables, we can define how Lua computes the expression a+b, where a and b are tables. Whenever Lua tries to add two tables, it checks whether either of them has a metatable and whether that metatable has an __add field. If Lua finds this field, it calls the corresponding value (the so-called metamethod, which should be a function) to compute the sum. - Metatables and Metamethods Each table in Lua may have its own metatable. As I said earlier, when we access an absent field in a table, the result is nil. This is true, but it is not the whole truth. Such access triggers the interpreter to look for an __index metamethod: if there is no such method, as usually happens, then the access results in nil; otherwise, the metamethod will provide the result. - The __index Metamethod Here's how to use it: Lua table = {} --1 mt = {} --2 setmetatable(table, mt) --3 mt.__index = function (table, key) --4 return key end default = table['absent'] --5 Create the table.Create a metatable.Associate the metatable with the table.Define the __index function to return the absent key.The __index function is called because the key is absent. Summary This post explored how to provide default values when querying absent keys across various programming languages. Here's a quick summary: Programming languagePer callPer mapStaticLazy ScopeValueJava❎❌❎❌Kotlin❎❌❎❎Python❎❎❌❎Ruby❎❌❎❎Lua❌❎❎❌
Rust is known for its robust type system and powerful trait-based abstractions, which allow developers to write safe, efficient, and expressive code. BTreeSet in Rust is a powerful data structure for maintaining a sorted collection of unique elements. It provides the guarantees of log(n) insertion, deletion, and lookup times while keeping the elements in a well-defined order. However, when the Ord and PartialOrd trait implementations for a type differ, it can lead to unpredictable and chaotic behavior. This article explores this subtle pitfall using a practical example. Understanding Ord and PartialOrd The Ord Trait The Ord trait in Rust enforces a total order on elements. It’s used by collections like BTreeSet to maintain a consistent ordering. When you implement Ord for a type, you’re defining a complete ordering, which ensures that any two elements can be compared, and the ordering will always make sense. The PartialOrd Trait PartialOrd allows for partial ordering, meaning that not all pairs of elements need to be comparable. It’s less strict than Ord, but in practice, many types that implement PartialOrd also implement Ord. Problems arise when these two implementations do not align, especially in data structures that rely on consistent ordering. The Chaos Example To demonstrate the issue, let’s consider a custom struct Chaos and implement both Ord and PartialOrd for it, but with different logic: #[derive(Debug, Eq, Hash, Copy, Clone)] struct Chaos(i32); impl PartialOrd for Chaos { fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> { Some(self.0.cmp(&other.0).reverse()) // Reverse order for PartialOrd } } impl Ord for Chaos { fn cmp(&self, other: &Self) -> std::cmp::Ordering { self.0.cmp(&other.0) // Normal order for Ord } } impl PartialEq for Chaos { fn eq(&self, other: &Self) -> bool { self.0 == other.0 } } use std::collections::BTreeSet; fn main() { let mut set = BTreeSet::from([Chaos(1), Chaos(2), Chaos(3), Chaos(4)]); println!("Before insertion {:?}", set); set.insert(Chaos(0)); set.insert(Chaos(5)); println!("After insertion {:?}", set); } In this code, the Chaos struct has a simple integer as its sole field. However, the PartialOrd and Ord implementations are deliberately different: PartialOrd sorts the elements in descending order (reversed).Ord sorts the elements in ascending order (normal). Analyzing the Output When running the above code, the output is as follows: ❯ cargo run . Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.00s Running `target/debug/chaos .` Before insertion {Chaos(4), Chaos(3), Chaos(2), Chaos(1)} After insertion {Chaos(0), Chaos(4), Chaos(3), Chaos(2), Chaos(1), Chaos(5)} Initial State Before inserting any new elements, the set is initialized with the elements {Chaos(1), Chaos(2), Chaos(3), Chaos(4)}. Because the initialization uses PartialOrd, the elements are sorted in descending order: {Chaos(4), Chaos(3), Chaos(2), Chaos(1)} After Insertion When new elements (Chaos(0) and Chaos(5)) are inserted, the BTreeSet uses the Ord trait to maintain the order. Since Ord sorts in ascending order, the set is now partially sorted in descending order (from initialization) and partially in ascending order (from insertion): {Chaos(0), Chaos(4), Chaos(3), Chaos(2), Chaos(1), Chaos(5)} This is clearly chaotic and defies the expectations one might have for the behavior of a BTreeSet. Why This Matters: Real-World Implications In a real-world scenario, this mismatch between Ord and PartialOrd can lead to bugs that are hard to diagnose. For example, if your type’s sorting logic is critical for the correctness of your program, this inconsistency can lead to subtle errors that are only discovered much later, perhaps even in production. Best Practices When implementing Ord and PartialOrd for a type in Rust, it's essential to ensure consistency and avoid unnecessary complexity. By following these best practices, you can reduce the risk of bugs and maintain clean, maintainable code. 1. DRY: Reuse Logic to Ensure Consistency To avoid duplicating logic and ensure consistency between Ord and PartialOrd, implement cmp using the partial_cmp method. This approach not only adheres to the DRY principle but also guarantees that both traits share the same underlying comparison logic. impl PartialOrd for Chaos { fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> { Some(self.0.cmp(&other.0).reverse()) // Reverse order for PartialOrd } } impl Ord for Chaos { fn cmp(&self, other: &Self) -> std::cmp::Ordering { match self.partial_cmp(&other) { Some(v)=>v, None=>std::cmp::Ordering::Greater } } } By centralizing the comparison logic, you reduce the likelihood of introducing discrepancies between Ord and PartialOrd, leading to more predictable and reliable behavior. 2. Test for Consistency After implementing Ord and PartialOrd, thoroughly test your type to ensure that it behaves consistently in all contexts. Write tests that specifically check whether the ordering is maintained correctly when using both traits in data structures like BTreeSet. Conclusion The interplay between Ord and PartialOrd is a subtle aspect of Rust’s type system, but one that can have significant consequences when not handled correctly. By understanding the potential pitfalls and following best practices, you can avoid the chaos that mismatched implementations can cause. Always ensure your ordering logic is consistent, and you’ll be able to harness the full power of Rust’s sorted collections without fear.
Roy Fielding created REST as his doctorate dissertation. After reading it, I would boil it down to three basic elements: A document that describes object stateA transport mechanism to transmit the object state back and forth between systemsA set of operations to perform on the state While Roy was focused solely on HTTP, I don't see why another transport could not be used. Here are some examples: Mount a WebDAV share (WebDAV is an HTTP extension, so is still using HTTP). Copy a spreadsheet (.xls, .xlsx, .csv, .ods) into the mounted folder, where each row is the new/updated state. The act of copying into the share indicates the operation of upserting, the name of the file indicates the type of data, and the columns are the fields. The server responds with (document name)-status.(document suffix), which provides a key for each row, a status, and possibly an error message. In this case, it does not really make sense to request data.Use gRPC. The object transmitted is the document, HTTP is the transport, and the name of the remote method is the operation. Data can be both provided and requested.Use FTP. Similar to WebDAV, it is file-based. The PUT command is upserting, and the GET command is requesting. GET only provides a filename, so it generally provides all data of the specified type. It is possible to allow for special filenames that indicate a hard-coded filter to GET a subset of data. Whenever I see REST implementations in the wild, they often do not follow basic HTTP semantics, and I have never seen any explanation given for this, just a bunch of varying opinions. None of those I found referenced the RFC. Most seem to figure that: POST = CreatePUT = Update the whole documentPATCH = Update a portion of a documentGET = Retrieve the whole document This is counter to what HTTP states regarding POST and PUT: PUT is "create" or "update". GET generally returns whatever was last PUT. If PUT creates, it MUST return 201 Created. If PUT updates, it MUST return 200 OK or 204 No Content. The RFC suggests the content for 200 OK of a PUT should be the status of the action. I think it would ok in the case of SQL to return the new row from a select statement. This has the advantage that any generated columns are returned to the caller without having to perform a separate GET.POST processes a resource according to its own semantics. Older RFCs said POST is for subordinates of a resource. All versions give the example of posting an article to a mailing list; all versions say if a resource is created that 201 Created SHOULD be returned. I would argue that effectively what POST really means is: Any data manipulation except create, full/partial update, or deleteAny operation that is not data manipulation, such as: Perform a full-text search for rows that match a phrase.Generate a GIS object to display on a map. The word MUST means your implementation is only HTTP compliant if you do what is stated. Using PUT only for updates obviously won't break anything, just because it isn't RFC compliant. If you provide clients that handle all the details of sending/receiving data, then what verbs get used won't matter much to the user of the client. I'm the kind of guy who wants a reason for not following the RFC. I have never understood the importance of separating create from update in REST APIs, any more than in web apps. Think about cell phone apps like calendar appointments, notes, contacts, etc: "Create" is hitting the plus icon, which displays a new form with empty or default values."Update" is selecting an object and hitting the pencil icon, which displays an entry form with current values.Once the entry form appears, it works exactly the same in terms of field validations. So why should REST APIs and web front ends be any different than cell phone apps? If it is helpful for phone users to get the same data entry form for "create" and "update," wouldn't it be just as helpful to API and web users? If you decide to use PUT as "create" or "update", and you're using SQL as a store, most vendors have an upsert query of some sort. Unfortunately, that does not help to decide when to return 200 OK or 201 Created. You'd have to look at the information your driver provides when a DML query executes to find a way to distinguish insert from update for an upsert or use another query strategy. A simple example would be to perform an update set ... where pk column = pk value. If one row was affected, then the row exists and was updated; otherwise, the row does not exist and an insert is needed. On Postgres, you can take advantage of the RETURNING clause, which can actually return anything, not just row data, as follows: SQL INSERT INTO <table> VALUES (...) ON CONFLICT(<pk column>) DO UPDATE SET (...) RETURNING (SELECT COUNT(<pk column>) FROM <table> WHERE <pk column> = <pk value>) exists The genius of this is that: The subselect in the RETURNING clause is executed first, so it determines if the row exists before the INSERT ON CONFLICT UPDATE query executes. The result of the query is one column named "exists", which is 1 if the row existed before the query executed, 0 if it did not.The RETURNING clause can also return the columns of the row, including anything generated that was not provided. You only have to figure out once how to deal with if an insert or update is needed and make a simple abstraction that all your PUTs can call that handles 200 OK or 201 Created. One nice benefit of using PUT as intended is that as soon as you see a POST you know for certain it is not retrieval or persistence, and conversely, you know to search for POST to find the code for any operation that is not retrieval or persistence. I think the benefits of using PUT and POST as described in the RFC outweigh whatever reasons people have for using them in a way that is not RFC-compliant.
Go, also known as Golang, has become a popular language for developing concurrent systems due to its simple yet powerful concurrency model. Concurrency is a first-class citizen in Go, making it easier to write programs that efficiently use multicore processors. This article explores essential concurrency patterns in Go, demonstrating how to leverage goroutines and channels to build efficient and maintainable concurrent applications. The Basics of Concurrency in Go Goroutines A goroutine is a lightweight thread managed by the Go runtime. Goroutines are cheap to create and have a small memory footprint, allowing you to run thousands of them concurrently. Go package main import ( "fmt" "time" ) func sayHello() { fmt.Println("Hello, Go!") } func main() { go sayHello() // Start a new goroutine time.Sleep(1 * time.Second) // Wait for the goroutine to finish } Channels Channels are Go's way of allowing goroutines to communicate with each other and synchronize their execution. You can send values from one goroutine to another through channels. Go package main import "fmt" func main() { ch := make(chan string) go func() { ch <- "Hello from goroutine" }() msg := <-ch fmt.Println(msg) } Don't communicate by sharing memory; share memory by communicating. (R. Pike) Common Concurrency Patterns Worker Pool Purpose To manage a fixed number of worker units (goroutines) that handle a potentially large number of tasks, optimizing resource usage and processing efficiency. Use Cases Task processing: Handling a large number of tasks (e.g., file processing, web requests) with a controlled number of worker threads to avoid overwhelming the system.Concurrency management: Limiting the number of concurrent operations to prevent excessive resource consumption.Job scheduling: Distributing and balancing workloads across a set of worker threads to maintain efficient processing. Example Go package main import ( "fmt" "sync" "time" ) // Worker function processes jobs from the jobs channel and sends results to the results channel func worker(id int, jobs <-chan int, results chan<- int, wg *sync.WaitGroup) { defer wg.Done() for job := range jobs { // Simulate processing the job fmt.Printf("Worker %d processing job %d\n", id, job) time.Sleep(time.Second) // Simulate a time-consuming task results <- job * 2 } } func main() { const numJobs = 15 const numWorkers = 3 jobs := make(chan int, numJobs) results := make(chan int, numJobs) var wg sync.WaitGroup // Start workers for w := 1; w <= numWorkers; w++ { wg.Add(1) go worker(w, jobs, results, &wg) } // Send jobs to the jobs channel for j := 1; j <= numJobs; j++ { jobs <- j } close(jobs) // Wait for all workers to finish go func() { wg.Wait() close(results) }() // Collect and print results for result := range results { fmt.Println("Result:", result) } } Fan-In Purpose To merge multiple input channels or data streams into a single output channel, consolidating results from various sources. Use Cases Log aggregation: Combining log entries from multiple sources into a single logging system for centralized analysis.Data merging: Aggregating data from various producers into a single stream for further processing or analysis.Event collection: Collecting events from multiple sources into one channel for unified handling. Example Go package main import ( "fmt" ) // Function to merge multiple channels into one func merge(channels ...<-chan int) <-chan int { var wg sync.WaitGroup merged := make(chan int) output := func(c <-chan int) { defer wg.Done() for n := range c { merged <- n } } wg.Add(len(channels)) for _, c := range channels { go output(c) } go func() { wg.Wait() close(merged) }() return merged } func worker(id int, jobs <-chan int) <-chan int { results := make(chan int) go func() { defer close(results) for job := range jobs { // Simulate processing fmt.Printf("Worker %d processing job %d\n", id, job) results <- job * 2 } }() return results } func main() { const numJobs = 5 jobs := make(chan int, numJobs) // Start workers and collect their result channels workerChannels := make([]<-chan int, 0, 3) for w := 1; w <= 3; w++ { workerChannels = append(workerChannels, worker(w, jobs)) } // Send jobs for j := 1; j <= numJobs; j++ { jobs <- j } close(jobs) // Merge results results := merge(workerChannels...) // Collect and print results for result := range results { fmt.Println("Result:", result) } } Fan-Out Purpose To distribute data or messages from a single source to multiple consumers, allowing each consumer to process the same data independently. Use Cases Broadcasting notifications: Sending notifications or updates to multiple subscribers or services simultaneously.Data distribution: Delivering data to multiple components or services that each needs to process or act upon the same information.Event handling: Emitting events to various handlers that perform different actions based on the event. Example Go package main import ( "fmt" "sync" "time" ) // Subscriber function simulates a subscriber receiving a notification func subscriber(id int, notification string, wg *sync.WaitGroup) { defer wg.Done() // Simulate processing the notification time.Sleep(time.Millisecond * 100) // Simulate some delay fmt.Printf("Subscriber %d received notification: %s\n", id, notification) } func main() { // List of subscribers (represented by IDs) subscribers := []int{1, 2, 3, 4, 5} notification := "Important update available!" var wg sync.WaitGroup // Broadcast notification to all subscribers concurrently for _, sub := range subscribers { wg.Add(1) go subscriber(sub, notification, &wg) } // Wait for all subscribers to receive the notification wg.Wait() fmt.Println("All subscribers have received the notification.") } Generator Purpose To produce a sequence of data or events that can be consumed by other parts of a system. Use Cases Data streams: Generating a stream of data items, such as log entries or sensor readings, that are processed by other components.Event emission: Emitting a series of events or notifications to be handled by event listeners or subscribers.Data simulation: Creating simulated data for testing or demonstration purposes. Example Go package main import ( "fmt" "time" ) // Generator function that produces integers func generator(start, end int) <-chan int { out := make(chan int) go func() { for i := start; i <= end; i++ { out <- i } close(out) }() return out } func main() { // Start the generator gen := generator(1, 10) // Consume the generated values for value := range gen { fmt.Println("Received:", value) } } Pipeline Purpose To process data through a series of stages, where each stage transforms or processes the data before passing it to the next stage. Use Cases Data transformation: Applying a sequence of transformations to data, such as filtering, mapping, and reducing.Stream processing: Handling data streams in a step-by-step manner, where each step performs a specific operation on the data.Complex processing workflows: Breaking down complex processing tasks into manageable stages, such as data ingestion, transformation, and output. Example Go package main import ( "fmt" ) func generator(nums ...int) <-chan int { out := make(chan int) go func() { for _, n := range nums { out <- n } close(out) }() return out } func sq(in <-chan int) <-chan int { out := make(chan int) go func() { for n := range in { out <- n * n } close(out) }() return out } func main() { c := generator(2, 3, 4) out := sq(c) for n := range out { fmt.Println(n) } } Conclusion Understanding and utilizing concurrency patterns in Go can significantly enhance the performance and efficiency of your applications. The language's built-in support for goroutines and channels simplifies the process of managing concurrent execution, making it an excellent choice for developing high-performance systems. You can fully utilize Go's concurrency model to build robust, scalable applications by mastering these patterns.
In this article, I will discuss how you can apply the Pareto principle to quickly learn a new programming language and start solving real-world problems while you develop a deeper understanding of the programming language. What Is the Pareto Principle? The Pareto principle, also known as the 80/20 rule, states that for many outcomes, roughly 80% of consequences come from 20% of causes. Applying this to a personal level, 80% of your work-related output could come from only 20% of your time. I first came to know about this principle after reading the book "The 80/20 Principle: The Secret to Achieving More with Less" written by Richard Koch. How to Apply the Pareto Principle to Quickly Learn a New Programming Language When I initially started to learn programming, I used inefficient methods to learn it. I was watching hours and hours of video courses and reading books trying to master all the concepts that ever existed in the programming language before attempting to solve any real-world problems. By doing this, I was losing motivation to continue to learn. Over time, I realized that this is not an efficient way to learn a new skill. Learning about the 80/20 rule made me realize that by learning around 20% of the concepts in a programming language I could solve 80% of the problems. I needed to learn a new programming language in a short period of time a couple of times. The first time, I was using a programming language at work that was not easy to use for attending interviews, and I wanted to switch to a new programming language for solving problems in technical interviews. The second time, I was in a new team that used a completely new programming language that I had never used in the past. I used the following 4-step approach which made it efficient to learn the new programming language while keeping me motivated to increase my skill level with the programming language. Step 1: Identify key concepts of the programming language. Identify key concepts such as data structures, flow control statements, functions, classes, etc.Step 2: Spend 20% of your effort to learn these key concepts. Pick up a book or a course, and focus on learning only the key concepts identified in Step 1.Step 3: Solve some real-life problems using these concepts. Depending on the purpose of learning, pick some real-life problems and try to solve them using the concepts that you learned in the 2 steps above. For example, if you are planning to do technical interviews, try to solve some problems from websites like LeetCode or HackerRank.Step 4: Learn additional concepts as you encounter them. If you are stuck solving the problem, search for how to solve this problem and learn the additional advanced concepts as you encounter them. What Are Some Important Programming Concepts? As an example, let's look at some of the core concepts of Python that can be quickly learned before attempting to solve some problems using Python: Data structures: Review important available data structures such as strings, lists, tuples, dictionaries, and sets.Loops: Python offers two types of loops - the "for" loop and the "while" loop. Also, understand how to use continue and break statements within the loops.Conditional statements: Understand how to use conditional statements such as if, else, and elif.Logical operators: Learn logical operators such as and, or, not, etc. Functions: Learn how to define functions, pass arguments to the functions, and return values from the functions.Classes: Learn how to create and use Classes.Important built-in functions: Try to learn important built-in functions such as range(), format(), max(), min(), len(), type(), sorted(), print(), round(), etc.Other concepts: Lambdas, list comprehensions Conclusion Learning a new programming language may look daunting but leveraging the Pareto principle will make it easier to learn it quickly by spending 20% of the time mastering important concepts such as data structures, loops, conditional statements, functions, and classes and applying the knowledge to solve 80% of real-life problems.
Anchors ^ $ \b \A \Z Anchors in regular expressions allow you to specify the context in a string where your pattern should be matched. There are several types of anchors: ^ matches the start of a line (in multiline mode) or the start of the string (by default).$ matches the end of a line (in multiline mode) or the end of the string (by default).\A matches the start of the string.\Z or \z matches the end of the string.\b matches a word boundary (before the first letter of a word or after the last letter of a word).\B matches a position that is not a word boundary (between two letters or between two non-letter characters). These anchors are supported in Java, PHP, Python, Ruby, C#, and Go. In JavaScript, \A and \Z are not supported, but you can use ^ and $ instead of them; just remember to keep the multiline mode disabled. For example, the regular expression ^abc will match the start of a string that contains the letters "abc". In multiline mode, the same regex will match these letters at the beginning of a line. You can use anchors in combination with other regular expression elements to create more complex matches. For example, ^From: (.*) matches a line starting with From: The difference between \Z and \z is that \Z matches at the end of the string but also skips a possible newline character at the end. In contrast, \z is more strict and matches only at the end of the string. If you have read the previous article, you may wonder if the anchors add any additional capabilities that are not supported by the three primitives (alternation, parentheses, and the star for repetition). The answer is that they do not, but they change what is captured by the regular expression. You can match a line starting with abc by explicitly adding the newline character: \nabc, but in this case, you will also match the newline character itself. When you use ^abc, the newline character is not consumed. In a similar way, ing\b matches all words ending with ing. You can replace the anchor with a character class containing non-letter characters (such as spaces or punctuation): ing\W, but in this case, the regular expression will also consume the space or punctuation character. If the regular expression starts with ^ so that it only matches at the start of the string, it's called anchored. In some programming languages, you can do an anchored match instead of a non-anchored search without using ^. For example, in PHP (PCRE), you can use the A modifier. So the anchors don't add any new capabilities to the regular expressions, but they allow you to manage which characters will be included in the match or to match only at the beginning or end of the string. The matched language is still regular. Zero-Width Assertions (?= ) (?! ) (?<= ) (?<! ) Zero-width assertions (also called lookahead and lookbehind assertions) allow you to check that a pattern occurs in the subject string without capturing any of the characters. This can be useful when you want to check for a pattern without moving the match pointer forward. There are four types of lookaround assertions: (?=abc)The next characters are “abc” (a positive lookahead)(?!abc)The next characters are not “abc” (a negative lookahead)(?<=abc)The previous characters are “abc” (a positive lookbehind)(?<!abc)The previous characters are not “abc” (a negative lookbehind) Zero-width assertions are generalized anchors. Just like anchors, they don't consume any character from the input string. Unlike anchors, they allow you to check anything, not only line boundaries or word boundaries. So you can replace an anchor with a zero-width assertion, but not vice versa. For example, ing\b could be rewritten as ing(?=\W|$). Zero-width lookahead and lookbehind are supported in PHP, JavaScript, Python, Java, and Ruby. Unfortunately, they are not supported in Go. Just like anchors, zero-width assertions still match a regular language, so from a theoretical point of view, they don't add anything new to the capabilities of regular expressions. They just make it possible to skip certain things from the captured string, so you only check for their presence but don't consume them. Checking Strings After and Before the Expression The positive lookahead checks that there is a subexpression after the current position. For example, you need to find all div selectors with the footer ID and remove the div part: Search forReplace toExplanationdiv(?=#footer)“div” followed by “#footer” (?=#footer) checks that there is the #footer string here, but does not consume it. In div#footer, only div will match. A lookahead is zero-width, just like the anchors. In div#header, nothing will match, because the lookahead assertion fails. Of course, this can be solved without any lookahead: Search forReplace toExplanationdiv#footer#footerA simpler equivalent Generally, any lookahead after the expression can be rewritten by copying the lookahead text into a replacement or by using backreferences. In a similar way, a positive lookbehind checks that there is a subexpression before the current position: Search forReplace toExplanation(?<=<a href=")news/blog/Replace “news/” preceded by “<a href="” with “blog/”<a href="news/<a href="blog/The same replacement without lookbehind The positive lookahead and lookbehind lead to a shorter regex, but you can do without them in this case. However, these were just basic examples. In some of the following regular expressions, the lookaround will be indispensable. Testing the Same Characters for Multiple Conditions Sometimes you need to test a string for several conditions. For example, you want to find a consonant without listing all of them. It may seem simple at first: [^aeiouy] However, this regular expression also finds spaces and punctuation marks, because it matches anything except a vowel. And you want to match any letter except a vowel. So you also need to check that the character is a letter. (?=[a-z])[^aeiouy]A consonant[bcdfghjklmnpqrstvwxz]Without lookahead There are two conditions applied to the same character here: After (?=[a-z]) is checked, the current position is moved back because a lookahead has a width of zero: it does not consume characters, but only checks them. Then, [^aeiouy] matches (and consumes) one character that is not a vowel. For example, it could be H in HTML. The order is important: the regex [^aeiouy](?=[a-z]) will match a character that is not a vowel, followed by any letter. Clearly, it's not what is needed. This technique is not limited to testing one character for two conditions; there can be any number of conditions of different lengths: border:(?=[^;}]*\<solid\>)(?=[^;}]*\<red\>)(?=[^;}]*\<1px\>)[^;}]*Find a CSS declaration that contains the words solid, red, and 1px in any order. This regex has three lookahead conditions. In each of them, [^;}]* skips any number of any characters except ; and } before the word. After the first lookahead, the current position is moved back and the second word is checked, etc. The anchors \< and \> check that the whole word matches. Without them, 1px would match in 21px. The last [^;}]* consumes the CSS declaration (the previous lookaheads only checked the presence of words, but didn't consume anything). This regular expression matches {border: 1px solid red}, {border: red 1px solid;}, and {border:solid green 1px red} (different order of words; green is inserted), but doesn't match {border:red solid} (1px is missing). Simulating Overlapped Matches If you need to remove repeating words (e.g., replace the the with just the), you can do it in two ways, with and without lookahead: Search forReplace toExplanation\<(\w+)\s+(?=\1\>)Replace the first of repeating words with an empty string\<(\w+)\s+\1\>\1Replace two repeating words with the first word The regex with lookahead works like this: the first parentheses capture the first word; the lookahead checks that the next word is the same as the first one. The two regular expressions look similar, but there is an important difference. When replacing 3 or more repeating words, only the regex with lookahead works correctly. The regex without lookahead replaces every two words. After replacing the first two words, it moves to the next two words because the matches cannot overlap: However, you can simulate overlapped matches with lookaround. The lookahead will check that the second word is the same as the first one. Then, the second word will be matched against the third one, etc. Every word that has the same word after it will be replaced with an empty string: The correct regex without lookahead is \<(\w+)(\s+\1)+\> It matches any number of repeating words (not just two of them). Checking Negative Conditions The negative lookahead checks that the next characters do NOT match the expression in parentheses. Just like a positive lookahead, it does not consume the characters. For example, (?!toves) checks that the next characters are not “toves” without including them in the match. <\?(?!php)“<?” without “php” after it This pattern will match <? in <?echo 'text'?> or in <?xml. Another example is an anagram search. To find anagrams for “mate”, check that the first character is one of M, A, T, or E. Then, check that the second character is one of these letters and is not equal to the first character. After that, check the third character, which has to be different from the first and the second one, etc. \<([mate])(?!\1)([mate])(?!\1)(?!\2)([mate])(?!\1)(?!\2)(?!\3)([mate])\>Anagram for “mate” The sequence (?!\1)(?!\2) checks that the next character is not equal to the first subexpression and is not equal to the second subexpression. The anagrams for “mate” are: meat, team, and tame. Certainly, there are special tools for anagram search, which are faster and easier to use. A lookbehind can be negative, too, so it's possible to check that the previous characters do NOT match some expression: \w+(?<!ing)\bA word that does not end with “ing” (the negative lookbehind) In most regex engines, a lookbehind must have a fixed length: you can use character lists and classes ([a-z] or \w), but not repetitions such as * or +. Aba is free from this limitation. You can go back by any number of characters; for example, you can find files not containing a word and insert some text at the end of such files. Search forReplace toExplanation(?<!Table of contents.*)$$<a href="/toc">Contents</a>Insert the link to the end of each file not containing the words “Table of contents”^^(?!.*Table of contents)<a href="/toc">Contents</a>Insert it to the beginning of each file not containing the words However, you should be careful with this feature because an unlimited-length lookbehind can be slow. Controlling Backtracking A lookahead and a lookbehind do not backtrack; that is, when they have found a match and another part of the regular expression fails, they don't try to find another match. It's usually not important, because lookaround expressions are zero-width. They consume nothing and don't move the current position, so you cannot see which part of the string they match. However, you can extract the matching text if you use a subexpression inside the lookaround. For example: Search forReplace toExplanation(?=\<(\w+))\1Repeat each word Since lookarounds don't backtrack, this regular expression never matches: (?=(\N*))\1\NA regex that doesn't backtrack and always fails\N*\NA regex that backtracks and succeeds on non-empty lines The subexpression (\N*) matches the whole line. \1 consumes the previously matched subexpression and \N tries to match the next character. It always fails because the next character is a newline. A similar regex without lookahead succeeds because when the engine finds that the next character is a newline, \N* backtracks. At first, it has consumed the whole line (“greedy” match), but now it tries to match less characters. And it succeeds when \N* matches all but the last character of the line and \N matches the last character. It's possible to prevent excessive backtracking with a lookaround, but it's easier to use atomic groups for that. In a negative lookaround, subexpressions are meaningless because if a regex succeeds, negative lookarounds in it must fail. So, the subexpressions are always equal to an empty string. It's recommended to use a non-capturing group instead of the usual parentheses in a negative lookaround. (?!(a))\1A regex that always fails: (not A) and A
Linting and Its Importance Q: Can linting make my code better? A: No. If your logic is not good enough, it cannot help you, but it can surely make it look prettier. Linting is the process of analyzing code to identify potential errors, code quality issues, and deviations from coding standards. It is a crucial part of modern software development for several reasons: Error detection: Linting helps catch bugs and errors early in the development process.Code quality: It enforces coding standards, making code more readable and maintainable.Consistency: Ensures a uniform coding style across the codebase, which is particularly important in collaborative projectsEfficiency: Reduces the time spent on code reviews by automatically checking for common issues Available Tools for Linting and Formatting Several tools are available for linting and formatting Python code. Among them, the most popular are Black, Ruff, isort, PyLint, and Flake8, to name a few. There are unique strengths and weaknesses for each of the tools and they are also used for a specific purpose. In this article, we will look at Black, Ruff, and isort. A Glorious Example of How Not to Code Before diving into the comparison, let's take a look at a sample of poorly written Python code. This will help us illustrate the differences and capabilities of Black, Ruff, and isort. Python import datetime from io import BytesIO from datetime import datetime from __future__ import unicode_literals import os, sys, time from base64 import b64encode from PIL import Image, ImageDraw, Image from flask import Flask, request, redirect, url_for, send_file from werkzeug.utils import secure_filename numbers = [1, 2, 4,5,6, ] MyClass.function(arg1, arg2, arg3, flag, option) def my_func(some_data: list, *args, path: os.PathLike, name: str, verbosity: bool = True, quiet: bool = False): """Processes `data` using `args` and saves to `path`.""" with open(path, 'a') as file: ... if first_condititon \ and second_condition: ... Black Features Black performs in-place code style changes with a prime focus on the following: Opinionated (e.g., Spaces over Tabs) PEP8 Compliance [See Pragmatism] Smallest possible diffStability: Black has minimal to no configuration parameters, to ensure code style consistency.Post-processing AST checks to ensure no change in logic. Optionally you can turn it off by using the –fast option. Installation Install Black by running this command: pip install black Example Usage black [options] <SOURCE_FOLDER-or-FILE> See black --help for more details. How Did It Perform? Python import datetime from io import BytesIO from datetime import datetime from __future__ import unicode_literals import os, sys, time from base64 import b64encode from PIL import Image, ImageDraw, Image from flask import Flask, request, redirect, url_for, send_file from werkzeug.utils import secure_filename numbers = [ 1, 2, 4, 5, 6, ] MyClass.function(arg1, arg2, arg3, flag, option) def my_func( some_data: list, *args, path: os.PathLike, name: str, verbosity: bool = True, quiet: bool = False ): """Processes `data` using `args` and saves to `path`.""" with open(path, "a") as file: ... if first_condititon and second_condition: ... P.S. Notice how it did not sort/format the imports. isort Features isort prioritizes import organization with a primary focus on: Sorting: Sorts the imports alphabeticallySections: Groups the imports into sections and by typeMulti-line imports: Arranges the multi-line imports into a balanced gridAdd/Remove imports: isort can be run or configured to add/remove imports automatically. Installation Install isort by running this command: pip install isort Example Usage isort [OPTIONS] <SOURCE_FOLDER-or-FILE> See isort --help for more details. How Did It Perform? Python from __future__ import unicode_literals import datetime import os import sys import time from base64 import b64encode from datetime import datetime from io import BytesIO from flask import Flask, redirect, request, send_file, url_for from PIL import Image, ImageDraw from werkzeug.utils import secure_filename numbers = [1, 2, 4,5,6, ] MyClass.function(arg1, arg2, arg3, flag, option) def my_func(some_data: list, *args, path: os.PathLike, name: str, verbosity: bool = True, quiet: bool = False): """Processes `data` using `args` and saves to `path`.""" with open(path, 'a') as file: ... if first_condititon \ and second_condition: ... P.S. Notice how the code was not formatted. Ruff Features Ruff performs comprehensive linting and autofixes, adding type hints, and ensuring code quality and consistency. Linting: Performs a wide range of linting checksAutofix: Can automatically fix many issuesIntegration: Easy to integrate with other tools such as isortConfiguration: Supports configuration via pyproject.toml or command-line flags. Installation Install Ruff by running this command: pip install ruff Example Usage For linting: ruff check [OPTIONS] <SOURCE_FOLDER-or-FILE> For formatting: ruff format [OPTIONS] <SOURCE_FOLDER-or-FILE> See ruff --help for more details. Note: Ruff does not automatically sort imports. In order to do this, run the following: Shell ruff check --select I --fix ruff format How Did It Perform? Python from __future__ import unicode_literals import datetime import os import sys import time from base64 import b64encode from datetime import datetime from io import BytesIO from flask import Flask, redirect, request, send_file, url_for from PIL import Image, ImageDraw from werkzeug.utils import secure_filename numbers = [ 1, 2, 4, 5, 6, ] MyClass.function(arg1, arg2, arg3, flag, option) def my_func( some_data: list, *args, path: os.PathLike, name: str, verbosity: bool = True, quiet: bool = False, ): """Processes `data` using `args` and saves to `path`.""" with open(path, "a") as file: ... if first_condititon and second_condition: ... Where Do They Stand? black isort ruff Purpose Code formatter Import sorter and formatter Linter and formatter Speed Fast Fast Extremely fast Primary Functionality Formats Python code to a consistent style Sorts and formats Python imports Lints Python code and applies autofixes Configuration pyproject.toml pyproject.toml, .isort.cfg, setup.cfg pyproject.toml or command-line flags Ease of Use High High High Popularity Very high High Increasing Pros Extensive, opinionated styling Import grouping and sectioning for improved readability Faster than most linters; developed on Rust Cons May not have extensive styling rules like pylint - Supports all F Rules from Flake8, although, Missing a majority of E rules Conclusion Black, Ruff, and isort are powerful tools that help maintain high code quality in Python projects. Each tool has its specific strengths, making them suitable for different aspects of code quality: Black: Best for automatic code formatting and ensuring a consistent styleisort: Perfect for organizing and formatting import statementsRuff: Ideal for comprehensive linting and fixing code quality issues quickly By understanding the unique features and benefits of each tool, developers can choose the right combination to fit their workflow and improve the readability, maintainability, and overall quality of their codebase.
Contexts in Go provide a standard way to pass metadata and control signals between goroutines. They are mainly used to manage task execution time, data passing, and operation cancellation. This article covers different types of contexts in Go and examples of how to use them. Introduction to Contexts Contexts in Go are represented by the context.Context interface, which includes methods for getting deadlines, cancellation, values, and done channels. The primary package for working with contexts is context. Go package context type Context interface { Deadline() (deadline time.Time, ok bool) Done() <-chan struct{} Err() error Value(key interface{}) interface{} } Context Types There are six main functions to create contexts: context.Background(): Returns an empty context; It is usually used as the root context for the entire application.context.TODO(): Returns a context that can be used when a context is required but not yet defined; It signals that the context needs further work.context.WithCancel(parent Context): Returns a derived context that can be canceled by calling the cancel functioncontext.WithDeadline(parent Context, d time.Time): Returns a derived context that automatically cancels at a specified time (deadline)context.WithTimeout(parent Context, timeout time.Duration): Similar to the WithDeadline, but the deadline is set by a durationcontext.WithValue(parent Context, key, val interface{}): Returns a derived context that contains a key-value pair Examples of Using Contexts Context With Cancelation A context with cancelation is useful when you need to stop a goroutine based on an event. Go package main import ( "context" "fmt" "time" ) func main() { ctx, cancel := context.WithCancel(context.Background()) go func() { select { case <-time.After(2 * time.Second): fmt.Println("Operation completed") case <-ctx.Done(): fmt.Println("Operation canceled") } }() // try to change this value to 3 and execute again time.Sleep(1 * time.Second) cancel() time.Sleep(2 * time.Second) } Context With Timeout This context automatically cancels after a specified duration. Go package main import ( "context" "fmt" "time" ) func main() { ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second) defer cancel() go func() { select { case <-time.After(3 * time.Second): fmt.Println("Operation completed") case <-ctx.Done(): fmt.Println("Operation timed out") } }() // try to change this value to 2 and execute again time.Sleep(4 * time.Second) } Context With Deadline A context with a deadline is similar to a context with a timeout, but the time is set as a specific value. Go package main import ( "context" "fmt" "time" ) func main() { // try to change this value to 3 and execute again deadline := time.Now().Add(2 * time.Second) ctx, cancel := context.WithDeadline(context.Background(), deadline) defer cancel() go func() { select { case <-time.After(3 * time.Second): fmt.Println("Operation completed") case <-ctx.Done(): fmt.Println("Operation reached deadline") } }() time.Sleep(4 * time.Second) } Context With Values Contexts can store arbitrary data as key-value pairs. This is useful for passing parameters and settings to handlers. Go package main import ( "context" "fmt" "time" ) func main() { ctx := context.WithValue(context.Background(), "key", "value") go func(ctx context.Context) { if v := ctx.Value("key"); v != nil { fmt.Println("Value found:", v) } else { fmt.Println("No value found") } }(ctx) time.Sleep(1 * time.Second) } Applying Contexts Contexts are widely used in various parts of Go applications, including network servers, databases, and client requests. They help properly manage task execution time, cancel unnecessary operations, and pass data between goroutines. Using in HTTP Servers Go package main import ( "context" "fmt" "net/http" "time" ) func handler(w http.ResponseWriter, r *http.Request) { ctx := r.Context() select { case <-time.After(5 * time.Second): fmt.Fprintf(w, "Request processed") case <-ctx.Done(): fmt.Fprintf(w, "Request canceled") } } func main() { http.HandleFunc("/", handler) http.ListenAndServe(":8080", nil) } This code sets up an HTTP server that handles requests with a context-aware handler. It either completes after 5 seconds or responds if the request is canceled. Using in Databases Go package main import ( "context" "database/sql" "fmt" "time" _ "github.com/go-sql-driver/mysql" ) func queryDatabase(ctx context.Context, db *sql.DB) { query := "SELECT sleep(5)" rows, err := db.QueryContext(ctx, query) if err != nil { fmt.Println("Query error:", err) return } defer rows.Close() for rows.Next() { var result string if err := rows.Scan(&result); err != nil { fmt.Println("Scan error:", err) return } fmt.Println("Result:", result) } } func main() { db, err := sql.Open("mysql", "user:password@tcp(localhost:3306)/dbname") if err != nil { fmt.Println("Database connection error:", err) return } defer db.Close() ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second) defer cancel() queryDatabase(ctx, db) } Here, we connect to a MySQL database and execute a query with a context timeout of 3 seconds. If the query takes longer, it is canceled, and an error message is printed. Using in Goroutines Go package main import ( "context" "fmt" "time" ) func worker(ctx context.Context, id int) { for { select { case <-ctx.Done(): fmt.Printf("Worker %d stopped\n", id) return case <-time.After(1 * time.Second): fmt.Printf("Worker %d working\n", id) } } } func main() { ctx, cancel := context.WithCancel(context.Background()) for i := 1; i <= 3; i++ { go worker(ctx, i) } time.Sleep(3 * time.Second) cancel() time.Sleep(1 * time.Second) } In this example, the code spawns three worker goroutines that print status messages every second. The workers stop when the main function cancels the context after 3 seconds. Using in an API Request With a Deadline Go package main import ( "context" "fmt" "net/http" "time" ) func fetchAPI(ctx context.Context, url string) { req, err := http.NewRequestWithContext(ctx, "GET", url, nil) if err != nil { fmt.Println("Request creation error:", err) return } client := &http.Client{} resp, err := client.Do(req) if err != nil { fmt.Println("Request error:", err) return } defer resp.Body.Close() if resp.StatusCode == http.StatusOK { fmt.Println("API request succeeded") } else { fmt.Println("API request failed with status:", resp.StatusCode) } } func main() { ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(2*time.Second)) defer cancel() fetchAPI(ctx, "http://example.com/api") } This example demonstrates making an API request with a 2-second deadline. If the request is not completed within this timeframe, it is canceled, ensuring that the program does not wait indefinitely. Conclusion Contexts in Go are a powerful tool for managing execution time, cancelation, and data passing between goroutines. Using contexts correctly helps avoid resource leaks, ensures timely task completion, and improves code structure and readability. Various types of contexts, such as those with cancellation, timeout, deadline, and values, provide flexible task management in Go applications.
The first lie detector which relied on eye movement appeared in 2014. The Converus team together with Dr. John C. Kircher, Dr. David C. Raskin, and Dr. Anne Cook launched EyeDetect — a brand-new solution to detect deception quickly and accurately. This event became a turning point in the polygraph industry. In 2021, we finished working on a contactless lie detection technology based on eye-tracking and presented it at the International Scientific and Practical Conference. As I was part of the developers’ team, in this article, I would like to share some insights into how we worked on the creation of the new system, particularly how we chose our backend stack. What Is a Contactless Lie Detector and How Does It Work? We created a multifunctional hardware and software system for contactless lie detection. This is how it works: the system tracks a person's psychophysiological reactions by monitoring eye movements and pupil dynamics and automatically calculates the final test results. Its software consists of 3 applications. Administrator application: Allows the creation of tests and the administration of processesOperator application: Enables scheduling test dates and times, assigning tests, and monitoring the testing processRespondent application: Allows users to take tests using a special code On the computer screen, along with simultaneous audio (either synthesized or pre-recorded by a specialist), the respondent is given instructions on how to take the test. This is followed by written true/false statements based on developed testing methodologies. The respondent reads each statement and presses the "true" or "false" key according to their assessment of the statement's relevance. After half a second, the computer displays the next statement. Then, the lie-detector measures response time and error frequency, extracts characteristics from recordings of eye position and pupil size, and calculates the significance of the statement or the "probability of deception." To make it more visual here is a comparison of the traditionally used polygraph and lie-detector. CriteriaClassic PolygraphContactless Lie Detector Working Principle Registers changes in GSR, cardiovascular, and respiratory activity to measure emotional arousal Registers involuntary changes in eye movements and pupil diameter to measure cognitive effort Duration Tests take from 1.5 to 5 hours, depending on the type of examination Tests take from 15 to 40 minutes Report Time From 5 minutes to several hours; written reports can take several days Test results and reports in less than 5 minutes automatically Accuracy Screening test: 85% Investigation: 89% Screening test: 86-90% Investigation: 89% Sensor contact Sensors are placed on the body, some of which cause discomfort, particularly the two pneumatic tubes around the chest and the blood pressure cuff No sensors are attached to the person Objectivity Specialists interpret changes in responses. The specialist can influence the result. Manual evaluation of polygraphs requires training and is a potential source of errors. Automated testing process ensuring maximum reliability and objectivity. AI evaluates responses and generates a report. Training Specialists undergo 2 to 10 weeks of training. Regular advanced training courses Standard operator training takes less than 4 hours; administrator training for creating tests takes 8 hours. Remote training with a qualification exam. As you can see, our lie detector made the process more comfortable and convenient compared to traditional lie detectors. First of all, the tests take less time, from 15 to 40 minutes. Besides, one can get the results almost immediately. They are generated automatically within minutes. Another advantage is that there are no physically attached sensors which can be even more uncomfortable in an already stressful environment. Operator training is also less time-consuming. Most importantly, the results' credibility is still very high. Backend Stack Choice Our team had experience with Python and asyncio. Previously, we developed projects using Tornado. But at that time FastAPI was gaining popularity, so this time we decided to use Python with FastAPI and SQLAlchemy (with asynchronous support). To complement our choice of a popular backend stack, we decided to host our infrastructure on virtual machines using Docker. Avoiding Celery Given the nature of our lie detector, several mathematical operations require time to complete, making real-time execution during HTTP requests impractical. We developed multiple background tasks. Although Celery is a popular framework for such tasks, we opted to implement our own task manager. This decision stemmed from our use of CI/CD, where we restart various services independently. Sometimes, services could lose connection with Redis during these restarts. Our custom task manager, extending the base aioredis library, ensures reconnection if a connection is lost. Background Tasks Architecture At the project's outset, we had a few background tasks, which increased as functionality expanded. Some tasks were interdependent, requiring sequential execution. Initially, we used a queue manager where each task, upon completion, would trigger the next task via a message queue. However, asynchronous execution could lead to data issues due to varying execution speeds of related tasks. We then replaced this with a task manager that uses gRPC to call related tasks, ensuring execution order and resolving data dependency issues between tasks. Logging We couldn't use popular bug-tracking systems like Sentry for a few reasons. First, we didn’t want to use any third-party services managed and deployed outside of our infrastructure, so we were limited to using a self-hosted Sentry. At that time, we only had one dedicated server divided into multiple virtual servers, and there weren't enough resources for Sentry. Additionally, we needed to store not only bugs but also all information about requests and responses, which required the use of Elastic. Thus, we chose to store logs in Elasticsearch. However, memory leak issues led us to switch to Prometheus and Typesense. Maintaining backward compatibility between Elasticsearch and Typesense was a priority for us, as we were still determining if the new setup would meet our needs. This decision worked quite well, and we saw improvements in resource usage. The main reason for switching from Elastic to Typesense was resource usage. Elastic often requires a huge amount of memory, which is never sufficient. This is a common problem discussed in various forums, such as this one. Since Typesense is developed in C, it requires considerably fewer resources. Full-Text Search (FTS) Using PostgreSQL as our main database, we needed an efficient FTS mechanism. Based on previous experience, PostgreSQL's built-in ts_query and ts_vector could have performed better with Cyrillic text. Thus, we decided to synchronize PostgreSQL with Elasticsearch. While not the fastest solution, it provided enough speed and flexibility for our needs. PDF Report Generation As you may know, generating PDFs in Python can be quite complicated. This issue is rather common — the main challenge here is that to generate a PDF in Python you need to create an HTML file and only then convert it to PDF, similar to how it's done in other languages. This conversion process can sometimes produce unpredictable artifacts that are difficult to debug. Meanwhile, generating PDFs with JavaScript is much easier. We used Puppeteer to create an HTML page and then save it as a PDF just as we would in a browser, avoiding these problems altogether. To Conclude In conclusion, I would like to stress that this project turned out to be demanding in terms of choosing the right solutions but at the same time, it was more than rewarding. We received numerous unconventional customer requests that often questioned standard rules and best practices. The most exciting part of the journey was implementing mathematical models developed by another team into the backend architecture and designing a database architecture to handle a vast amount of unique data. It made me realize once again that popular technologies and tools are not always the best option for every case. We always need to explore different methodologies and remain open to unconventional solutions for common tasks.
In today's data-driven world, real-time data processing and analytics have become crucial for businesses to stay competitive. Apache Hudi (Hadoop Upserts and Incremental) is an open-source data management framework that provides efficient data ingestion and real-time analytics on large-scale datasets stored in data lakes. In this blog, we'll explore Apache Hudi with a technical deep dive and Python code examples, using a business example for better clarity. Table of Contents: Introduction to Apache Hudi Key Features of Apache HudiBusiness Use CaseSetting Up Apache HudiIngesting Data with Apache HudiQuerying Data with Apache HudiSecurity and Other Aspects SecurityPerformance OptimizationMonitoring and ManagementConclusion 1. Introduction to Apache Hudi Apache Hudi is designed to address the challenges associated with managing large-scale data lakes, such as data ingestion, updating, and querying. Hudi enables efficient data ingestion and provides support for both batch and real-time data processing. Key Features of Apache Hudi Upserts (Insert/Update) Efficiently handle data updates and inserts with minimal overhead. Traditional data lakes struggle with updates, but Hudi's upsert capability ensures that the latest data is always available without requiring full rewrites of entire datasets. Incremental Pulls Retrieve only the changed data since the last pull, which significantly optimizes data processing pipelines by reducing the amount of data that needs to be processed. Data Versioning Manage different versions of data, allowing for easy rollback and temporal queries. This versioning is critical for ensuring data consistency and supporting use cases such as time travel queries. ACID Transactions Ensure data consistency and reliability by providing atomic, consistent, isolated, and durable transactions on data lakes. This makes Hudi a robust choice for enterprise-grade applications. Compaction Hudi offers a compaction mechanism that optimizes storage and query performance. This process merges smaller data files into larger ones, reducing the overhead associated with managing numerous small files. Schema Evolution Handle changes in the data schema gracefully without disrupting the existing pipelines. This feature is particularly useful in dynamic environments where data models evolve over time. Integration With Big Data Ecosystem Hudi integrates seamlessly with Apache Spark, Apache Hive, Apache Flink, and other big data tools, making it a versatile choice for diverse data engineering needs. 2. Business Use Case Let's consider a business use case of an e-commerce platform that needs to manage and analyze user order data in real time. The platform receives a high volume of orders every day, and it is essential to keep the data up-to-date and perform real-time analytics to track sales trends, inventory levels, and customer behavior. 3. Setting Up Apache Hudi Before we dive into the code, let's set up the environment. We'll use PySpark and the Hudi library for this purpose. Shell # Install necessary libraries pip install pyspark==3.1.2 pip install hudi-spark-bundle_2.12 4. Ingesting Data With Apache Hudi Let's start by ingesting some order data into Apache Hudi. We'll create a DataFrame with sample order data and write it to a Hudi table. Python from pyspark.sql import SparkSession from pyspark.sql.functions import col, lit import datetime # Initialize Spark session spark = SparkSession.builder \ .appName("HudiExample") \ .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .config("spark.sql.hive.convertMetastoreParquet", "false") \ .getOrCreate() # Sample order data order_data = [ (1, "2023-10-01", "user_1", 100.0), (2, "2023-10-01", "user_2", 150.0), (3, "2023-10-02", "user_1", 200.0) ] # Create DataFrame columns = ["order_id", "order_date", "user_id", "amount"] df = spark.createDataFrame(order_data, columns) # Define Hudi options hudi_options = { 'hoodie.table.name': 'orders', 'hoodie.datasource.write.storage.type': 'COPY_ON_WRITE', 'hoodie.datasource.write.recordkey.field': 'order_id', 'hoodie.datasource.write.partitionpath.field': 'order_date', 'hoodie.datasource.write.precombine.field': 'order_date', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.database': 'default', 'hoodie.datasource.hive_sync.table': 'orders', 'hoodie.datasource.hive_sync.partition_fields': 'order_date' } # Write DataFrame to Hudi table df.write.format("hudi").options(**hudi_options).mode("overwrite").save("/path/to/hudi/orders") print("Data ingested successfully.") 5. Querying Data With Apache Hudi Now that we have ingested the order data, let's query the data to perform some analytics. We'll use the Hudi DataSource API to read the data. Python # Read data from Hudi table orders_df = spark.read.format("hudi").load("/path/to/hudi/orders/*") # Show the ingested data orders_df.show() # Perform some analytics # Calculate total sales total_sales = orders_df.groupBy("order_date").sum("amount").withColumnRenamed("sum(amount)", "total_sales") total_sales.show() # Calculate sales by user sales_by_user = orders_df.groupBy("user_id").sum("amount").withColumnRenamed("sum(amount)", "total_sales") sales_by_user.show() 6. Security and Other Aspects When working with large-scale data lakes, security, and data governance are paramount. Apache Hudi provides several features to ensure your data is secure and compliant with regulatory requirements. Security Data Encryption Hudi supports data encryption at rest to protect sensitive information from unauthorized access. By leveraging Hadoop's native encryption support, you can ensure that your data is encrypted before it is written to disk. Access Control Integrate Hudi with Apache Ranger or Apache Sentry to manage fine-grained access control policies. This ensures that only authorized users and applications can access or modify the data. Audit Logging Hudi can be integrated with log aggregation tools like Apache Kafka or Elasticsearch to maintain an audit trail of all data operations. This is crucial for compliance and forensic investigations. Data Masking Implement data masking techniques to obfuscate sensitive information in datasets, ensuring that only authorized users can see the actual data. Performance Optimization Compaction As mentioned earlier, Hudi's compaction feature merges smaller data files into larger ones, optimizing storage and query performance. You can schedule compaction jobs based on your workload patterns. Indexing Hudi supports various indexing techniques to speed up query performance. Bloom filters and columnar indexing are commonly used to reduce the amount of data scanned during queries. Caching Leverage Spark's in-memory caching to speed up repeated queries on Hudi datasets. This can significantly reduce query latency for interactive analytics. Monitoring and Management Metrics Hudi provides a rich set of metrics that can be integrated with monitoring tools like Prometheus or Grafana. These metrics help you monitor the health and performance of your Hudi tables. Data Quality Implement data quality checks using Apache Griffin or Deequ to ensure that the ingested data meets your quality standards. This helps in maintaining the reliability of your analytics. Schema Evolution Hudi's support for schema evolution allows you to handle changes in the data schema without disrupting existing pipelines. This is particularly useful in dynamic environments where data models evolve over time. 7. Conclusion In this blog, we have explored Apache Hudi and its capabilities to manage large-scale data lakes efficiently. We set up a Spark environment, ingested sample order data into a Hudi table, and performed some basic analytics. We also discussed the security aspects and performance optimizations that Apache Hudi offers. Apache Hudi's ability to handle upserts, provide incremental pulls, and ensure data security makes it a powerful tool for real-time data processing and analytics. By leveraging Apache Hudi, businesses can ensure their data lakes are up-to-date, secure, and ready for real-time analytics, enabling them to make data-driven decisions quickly and effectively. Feel free to dive deeper into Apache Hudi's documentation and explore more advanced features to further enhance your data engineering workflows. If you have any questions or need further clarification, please let me know in the comments below!
Sameer Shukla
Sr. Software Engineer,
Leading Financial Institution
Kai Wähner
Technology Evangelist,
Confluent
Alvin Lee
Founder,
Out of the Box Development, LLC