Python Memo 2: Dictionary vs. Set

A software architect takes a deep dive into the Python language by exploring dictionaries and sets and how these two elements of the language work.

Eason YIN

Apr. 15, 21 · Tutorial

Likes (2)

Comment

Save

16.3K Views

Basis of Dictionaries and Sets

A dictionary is composed of a series of key-value mapping elements. In Python 3.7 +, a dictionary is determined to be ordered, however, before 3.6, a dictionary was disordered. Its length is mutable and elements can be deleted and changed arbitrarily. Compared with lists and tuples, the performance of dictionaries is better, especially for search, add, and delete operations. A dictionary can be completed within a constant time complexity. A set and a dictionary are basically the same, the only difference is that a set has no key-value pairing and is a series of disordered and unique element combinations.

First, let's look at the creation of dictionaries and collections. There are usually the following ways;

    Python
   
          x
         
d1 = {'name': 'jason', 'age': 20, 'gender': 'male'}
d2 = dict({'name': 'jason', 'age': 20, 'gender': 'male'})
d3 = dict([('name', 'jason'), ('age', 20), ('gender', 'male')])
d4 = dict(name='jason', age=20, gender='male')
d1 == d2 == d3 ==d4
Out: True

Note here that dictionaries and sets in Python, whether they are keys or values, can be of mixed types. For example, in the following example, a set with elements 1, 'hello', and 5.0:

    Python
   
xxxxxxxxxx

s = {1, 'hello', 5.0}

Let's take a look at the issue of element access. Dictionaries can access an index directly; if it does not exist, an exception will be thrown:

    Python
   
xxxxxxxxxx

d = {'name': 'jason', 'age': 20}
d['name']
Out: 'jason'
d['location']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-10-46d978634ca1> in <module>()
----> 1 d['location']
KeyError: 'location'

We can also use the function get(key, default). If the key does not exist, function get()returns a default value. For example, the following case returns 'null'.

You can also use the get(key, default) function to index. If the key does not exist, call the get() function to return a default value. For example, the following example returns 'null'.

d = {'name': 'jason', 'age': 20}
d.get('name')
Out: 'jason'
d.get('location', 'null')
Out: 'null'

After dictionary access, let's take a look at set again.

First of all, I'd like to emphasize that the set doesn't support index operations, because the set is essentially a hash table, which is not same with the list. Therefore, the following operation is wrong and Python will throw an exception.

    Python
   
xxxxxxxxxx

s = {1, 2, 3}
s[0]
TypeError                                 Traceback (most recent call last)
<ipython-input-15-c9c96910e542> in <module>()
----> 1 s[0]
TypeError: 'set' object does not support indexing

To judge whether an element is in a dictionary or set, we can use value in dict/set.

    Python
   
xxxxxxxxxx

s = {1, 2, 3}
1 in s
Out: True
10 in s
Out: False
d = {'name': 'jason', 'age': 20}
'name' in d
Out: True
'location' in d
Out: False

In addition to creation and access, dictionary and set also support operations such as adding, deleting and updating.

    Python
   
xxxxxxxxxx

d = {'name': 'jason', 'age': 20}
d['gender'] = 'male'
d['dob'] = '1999-02-01'
d
Out: {'age': 20, 'dob': '1999-02-01', 'gender': 'male', 'name': 'jason'}
d['dob'] = '1998-01-01'
d.pop('dob')
Out: '1998-01-01'
d
Out: {'age': 20, 'gender': 'male', 'name': 'jason'}
s = {1, 2, 3}
s.add(4)
s
Out: {1, 2, 3, 4}
  
s.remove(4)
s
Out: {1, 2, 3}

But note, however the pop() operation of a set is to delete the last element, the set itself is disordered and you have no way of knowing which element is deleted. So the operation must be used with caution.
In practical applications, in many cases, we need to sort a dictionary or a set, for example, take out the 50 pairs with the largest value.

For dictionaries, we usually sort in ascending or descending order according to the key or value:

    Python
   
xxxxxxxxxx

d = {'b': 1, 'a': 2, 'c': 10}
d_sorted_by_key = sorted(d.items(), key=lambda x: x[0])
d_sorted_by_value = sorted(d.items(), key=lambda x: x[1])
d_sorted_by_key
Out: [('a', 2), ('b', 1), ('c', 10)]
d_sorted_by_value
Out: [('b', 1), ('a', 2), ('c', 10)]

A list is returned here. Each element in the list is a tuple composed of the keys and values of the original dictionary.

As for the set, the sorting is very similar to the lists and tuples mentioned above. Just call sorted(set) directly, and the result will return a sorted list.

    Python
   
          x
         
s = {3, 4, 2, 1}
sorted(s)
Out: [1, 2, 3, 4]

Performance of Dictionaries and Sets

As mentioned at the beginning of the article, the dictionary and set are data structures that have been highly optimized for performance, especially for search, add, and delete operations. Then, let's take a look at their performance in specific scenarios and their comparison with other data structures such as lists.

For example, the back-end of an e-commerce company stores the ID, name, and price of each product. The current demand is that, given the ID of a certain product, we want to find out its price.
If we use a list to store these data structures and search them, the corresponding code is as follows:

    Python
   
xxxxxxxxxx

def find_product_price(products, product_id):
    for id, price in products:
        if id == product_id:
            return price
    return None 
     
products = [
    (143121312, 100), 
    (432314553, 30),
    (32421912367, 150) 
]
print('The price of product 432314553 is {}'.format(find_product_price(products, 432314553)))
##Out: The price of product 432314553 is 30

Assuming that the list has n elements, and the search process needs to traverse the list, the time complexity is O(n). Even if we sort the list first, and then use binary search, it will require O(logn) time complexity, not to mention that the sorting of the list still needs O(nlogn) time complexity.

But if we use a dictionary to store this data, then the lookup will be very convenient and efficient, and it can be completed with only O(1) time complexity. The reason is also very simple. As we mentioned earlier, the internal composition of a dictionary is a hash table and you can find its corresponding value directly through the hash value of the key.

    Python
   
xxxxxxxxxx

products = {
  143121312: 100,
  432314553: 30,
  32421912367: 150
}
print('The price of product 432314553 is {}'.format(products[432314553])) 
##Out: The price of product 432314553 is 30

Similarly, we now need to find out how many different prices these commodities have. Let's compare them in the same way.

If you still choose to use the list, the corresponding code is as follows, where A and B are two-level loops. Also, assuming that the original list has n elements, then, in the worst case, O(n^2) time complexity is required.

    Python
   
xxxxxxxxxx

def find_unique_price_using_list(products):
    unique_price_list = [] 
    for _, price in products:  #A
       if price not in unique_price_list: #B
           unique_price_list.append(price) 
    return len(unique_price_list)
products = [ 
    (143121312, 100), 
    (432314553, 30), 
    (32421912367, 150), 
    (937153201, 30)]
print('number of unique price is: {}'.format(find_unique_price_using_list(products)))
##Out: number of unique price is: 3

But if we choose to use a set to store the data, because sets are a highly optimized hash table, the elements inside cannot be repeated and its addition and search operations only need O(1) complexity, thus the total time complexity is O(n).

    Python
   
xxxxxxxxxx

def find_unique_price_using_set(products): 
     unique_price_set = set() 
     for _, price in products: 
         unique_price_set.add(price) 
     return len(unique_price_set)
products = [ 
     (143121312, 100),
     (432314553, 30),
     (32421912367, 150),
     (937153201, 30)
]
print('number of unique price is: {}'.format(find_unique_price_using_set(products)))
##Out: number of unique price is: 3

The following code initializes a product containing 100,000 elements and calculates time needed to use lists and sets to count product prices and quantities:

    Python
   
xxxxxxxxxx

import time
id = [x for x in range(0, 100000)]
price = [x for x in range(200000, 300000)]
products = list(zip(id, price))
start_using_list = time.perf_counter()
find_unique_price_using_list(products)
end_using_list = time.perf_counter()
print("time elapse using list: {}".format(end_using_list - start_using_list))
##Out: time elapse using list: 41.61519479751587
start_using_set = time.perf_counter()
find_unique_price_using_set(products)
end_using_set = time.perf_counter()
print("time elapse using set: {}".format(end_using_set - start_using_set))
##Out: time elapse using set: 0.008238077163696289

As you can see, with only one hundred thousand data, the speed difference between the two is such big. In fact, the back-end data in large enterprises is often on the order of hundreds of millions or even billions. If an inappropriate data structure is used, it is easy to cause the server to crash, which not only affects the user experience, but also brings huge losses to the company.

How Dictionaries and Sets Work

We have seen the efficiency of dictionaries and sets through the above examples. But, why are dictionaries and sets so highly efficient, especially for lookup, insert, and delete operations? This is, of course, inseparable from the data structure of dictionaries and sets. Unlike other data structures, the internal structure of dictionaries and sets is a hash table.

For a dictionary, hash tables store three elements: hash, key, and value.
For a set, the difference is that there is no key-value pairing in the hash table, only a single element.

Let's take a look at the hash table structure of the old version of Python:

--+-------------------------------+
  |    hash       key     value
--+-------------------------------+
0 |    hash0      key0    value0
--+-------------------------------+
1 |    hash1      key1    value1
--+-------------------------------+
2 |    hash2      key2    value2
--+-------------------------------+
. |           ...
__+_______________________________+

It is not difficult to imagine that as the hash table expands, it will become more and more sparse. For example, if I have the following dictionary:

    Python
   
xxxxxxxxxx

{'name': 'mike', 'dob': '1999-01-01', 'gender': 'male'}

Then it will be stored in a form similar to the following:

    Python
   
          x
         
entries = [
  ['--', '--', '--']
  [-230273521, 'dob', '1999-01-01'],
  ['--', '--', '--'],
  ['--', '--', '--'],
  [1231236123, 'name', 'mike'],
  ['--', '--', '--'],
  [9371539127, 'gender', 'male']
]

Such a design structure is obviously a waste of storage space. In order to improve the utilization of storage space, in addition to the structure of the dictionary itself, the current hash table separates the index from the hash, key, and value separately, which gives us the following new structure:


Indices
----------------------------------------------------
None | index | None | None | index | None | index ...
----------------------------------------------------

Entries
--------------------
hash0   key0  value0
---------------------
hash1   key1  value1
---------------------
hash2   key2  value2
---------------------
        ...
---------------------

Then, the storage form used under the new hash table structure will look like:

    Java
   
          x
         
indices = [None, 1, None, None, 0, None, 2]
entries = [
  [1231236123, 'name', 'mike'],
  [-230273521, 'dob', '1999-01-01'],
  [9371539127, 'gender', 'male']
]

We can clearly see that space utilization has been greatly improved. Now that we're clear on the specific design structure, let's look at the working principles of these operations.

Insert Operation

Every time an element is inserted into a dictionary or set, Python will first calculate the hash value of the key (hash(key)), and then perform an AND operation with mask = PyDicMinSize-1 to calculate the position where the element should be inserted into the hash table (index = hash(key) & mask). If this position in the hash table is empty, then this element will be inserted into it.

And if this position is already occupied, Python will compare the hash value and key of the two elements to see if they are equal.

If the two are equal, it means that the element already exists. If the value is different, the value is updated.
If the two are not equal, this situation is usually called a hash collision, which means that the keys of the two elements are not equal, but the hash values are equal. In this case, Python will continue to look for free positions in the table until it finds a position.

It is worth mentioning that, generally speaking, in this situation, the simplest way is to search linearly. That is, start from this position and look for vacancies one by one. Of course, Python has optimized this internally (you don't need to understand this deeply, you can check the source code if you are interested, I will not repeat it) to make this step more efficient.

Find Operation

Similar to the previous insert operation, Python will find the position where it should be based on the hash value; then, compare the hash value and key of the element in this position to the hash table to see if it is equal to the element that needs to be found. If they are equal, return directly; if they are not, then continue to search until a slot is found or an exception is thrown.

Delete Operation

For the delete operation, Python temporarily assigns a special value to the element at this position and then deletes it when the hash table is resized.

It is not difficult to understand that the occurrence of hash collisions tends to reduce the speed of dictionary and set operations. Therefore, in order to ensure its efficiency, the dictionary and the hash table in the collection are usually guaranteed to have at least 1/3 of the remaining space. With the continuous insertion of elements, when the remaining space is less than 1/3, Python will regain a larger memory space and expand the hash table. However, in this case, all element positions in the table will be re-arranged.

Although the hash collision and the adjustment of the size of the hash table will slow down the speed, this happens very rarely. Therefore, on average, this can still ensure that the time complexity of insert, find, and delete is O(1).

Conclusion

In this lesson, we have learned the basic operations of dictionaries and sets together, and have explained their high performance and internal storage structure.

The dictionary is an ordered data structure in Python 3.7+, while the set is unordered. Its internal hash table storage structure ensures the efficiency of its find, insert, and delete operations. Therefore, dictionary and set are usually used in scenarios such as efficient find and de-duplication of elements.

Dictionary (software) Python (language) Element Database Data (computing) Data structure Space (architecture) optimization

Opinions expressed by DZone contributors are their own.

Related

Trending