Pablo Galindo Salgado Show
Abstract#The main garbage collection algorithm used by CPython is reference counting. The basic idea is that CPython counts how
many different places there are that have a reference to an object. Such a place could be another object, or a global (or static) C variable, or a local variable in some C function. When an object’s reference count becomes zero, the object is deallocated. If it contains references to other objects, their reference counts are decremented. Those other objects may be deallocated in turn, if this decrement makes their reference count become zero, and so on. The reference count field can be examined
using the >>> x = object() >>> sys.getrefcount(x) 2 >>> y = x >>> sys.getrefcount(x) 3 >>> del y >>> sys.getrefcount(x) 2 The main problem with the reference counting scheme is that it does not handle reference cycles. For instance, consider this code: >>> container = [] >>> container.append(container) >>> sys.getrefcount(container) 3 >>> del container In this example, Memory layout and object structure#Normally the C structure supporting a regular Python object looks as follows: object -----> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ \ | ob_refcnt | | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | PyObject_HEAD | *ob_type | | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ / | ... | In order to support the garbage collector, the memory layout of objects is altered to accommodate extra information before the normal layout: +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ \ | *_gc_next | | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | PyGC_Head | *_gc_prev | | object -----> +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ / | ob_refcnt | \ +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ | PyObject_HEAD | *ob_type | | +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+ / | ... | In this way the object can be treated
as a normal python object and when the extra information associated to the GC is needed the previous fields can be accessed by a simple type cast from the original object: As is explained later in the Optimization: reusing fields to save memory section, these two extra fields are normally used to keep doubly linked lists of all the objects tracked by the garbage collector (these lists are the GC generations, more on that in the Optimization: generations section), but they are also reused to fulfill other purposes when the full doubly linked list structure is not needed as a memory optimization. Doubly linked lists are used because they efficiently support most frequently required operations. In general, the collection of all objects tracked by GC are partitioned into disjoint sets, each in its own doubly linked list. Between collections, objects are partitioned into “generations”, reflecting how often they’ve survived collection attempts. During collections, the generation(s) being collected are further partitioned into, e.g., sets of reachable and unreachable objects. Doubly linked lists support moving an object from one partition to another, adding a new object, removing an object entirely (objects tracked by GC are most often reclaimed by the refcounting system when GC isn’t running at all!), and merging partitions, all with a small constant number of pointer updates. With care, they also support iterating over a partition while objects are being added to - and removed from - it, which is frequently required while GC is running. Specific APIs are offered to allocate, deallocate, initialize, track, and untrack objects with GC support. These APIs can be found in the Garbage Collector C API documentation. Apart from this object structure, the type object for objects supporting garbage collection must include the Identifying reference cycles#The algorithm that CPython uses to detect those reference cycles is implemented in the
To correctly dispose of these objects once they become unreachable, they need to be identified first. Inside the function that identifies cycles, two doubly linked lists are maintained: one list contains all objects to be scanned, and the other will contain all objects “tentatively” unreachable. To understand how the algorithm works, let’s take the case of a circular linked list which has one link referenced by a variable >>> import gc >>> class Link: ... def __init__(self, next_link=None): ... self.next_link = next_link >>> link_3 = Link() >>> link_2 = Link(link_3) >>> link_1 = Link(link_2) >>> link_3.next_link = link_1 >>> A = link_1 >>> del link_1, link_2, link_3 >>> link_4 = Link() >>> link_4.next_link = link_4 >>> del link_4 # Collect the unreachable Link object (and its .__dict__ dict). >>> gc.collect() 2 When the GC starts, it has all the container objects it wants to scan on the first linked list. The objective is to move all the unreachable objects. Since most objects turn out to be reachable, it is much more efficient to move the unreachable as this involves fewer pointer updates. Every object that supports garbage collection will have an extra reference count field initialized to the reference count ( The GC then iterates over all containers in the first list and decrements by one the Notice that having Then the GC scans the next When the GC encounters an object which is reachable ( Notice that an object that was marked as “tentatively unreachable” and was later moved back to the reachable list will be visited again by the garbage collector as now all the references that that object has need to be processed as well. This process is really a breadth first search over the object graph. Once all the objects are scanned, the GC knows that all container objects in the tentatively unreachable list are really unreachable and can thus be garbage collected. Pragmatically, it’s important to note that no recursion is required by any of this, and neither does it in any other way require additional memory proportional to the
number of objects, number of pointers, or the lengths of pointer chains. Apart from Why moving unreachable objects is better#It sounds logical to move the unreachable objects under the premise that most objects are usually reachable, until you think about it: the reason it pays isn’t actually obvious. Suppose we create objects A, B, C in that order. They appear in the young generation in the same order. If B points to A, and C to B, and C is reachable from outside, then the adjusted refcounts after the first step of the algorithm runs will be 0, 0, and 1 respectively because the only reachable object from the outside is C. When the next step of the algorithm finds A, A is moved to the unreachable list. The same for B when it’s first encountered. Then C is traversed, B is moved back to the reachable list. B is eventually traversed, and then A is moved back to the reachable list. So instead of not moving at all, the reachable objects B and A are each moved twice. Why is this a win? A straightforward algorithm to move the reachable objects instead would move A, B, and C once each. The key is that this dance leaves the objects in order C, B, A - it’s reversed from the original order. On all subsequent scans, none of them will move. Since most objects aren’t in cycles, this can save an unbounded number of moves across an unbounded number of later collections. The only time the cost can be higher is the first time the chain is scanned. Destroying unreachable objects#Once the GC knows the list of unreachable objects, a very delicate process starts with the objective of completely destroying these objects. Roughly, the process follows these steps in order:
Optimization: generations#In order to limit the time each garbage collection takes, the GC uses a popular optimization: generations. The main idea behind this concept is the assumption that most objects have a very short lifespan and can thus be collected shortly after their creation. This has proven to be very close to the reality of many Python programs as many temporary objects are created and destroyed very fast. The older an object is the less likely it is that it will become unreachable. To take advantage of this fact, all container objects are segregated into three spaces/generations. Every new object starts in the first generation (generation 0). The previous algorithm is executed only over the objects of a particular generation and if an object survives a collection of its generation it will be moved to the next one (generation 1), where it will be surveyed for collection less often. If the same object survives another GC round in this new generation (generation 1) it will be moved to the last generation (generation 2) where it will be surveyed the least often. Generations are collected when the number of objects that they contain reaches some predefined threshold, which is unique for each generation and is lower the older the generations are. These thresholds can be examined using the >>> import gc >>> gc.get_threshold() (700, 10, 10) The content of these generations can be examined using the >>> import gc >>> class MyObj: ... pass ... # Move everything to the last generation so it's easier to inspect # the younger generations. >>> gc.collect() 0 # Create a reference cycle. >>> x = MyObj() >>> x.self = x # Initially the object is in the youngest generation. >>> gc.get_objects(generation=0) [..., <__main__.MyObj object at 0x7fbcc12a3400>, ...] # After a collection of the youngest generation the object # moves to the next generation. >>> gc.collect(generation=0) 0 >>> gc.get_objects(generation=0) [] >>> gc.get_objects(generation=1) [..., <__main__.MyObj object at 0x7fbcc12a3400>, ...] Collecting the oldest generation#In addition to the various
configurable thresholds, the GC only triggers a full collection of the oldest generation if the ratio Optimization: reusing fields to save memory#In order to save memory, the two linked list pointers in every object with GC support are reused for
several purposes. This is a common optimization known as “fat pointers” or “tagged pointers”: pointers that carry additional data, “folded” into the pointer, meaning stored inline in the data representing the address, taking advantage of certain properties of memory addressing. This is possible as most architectures align certain types of data to the size of the data, often a word or multiple thereof. This discrepancy leaves a few of the least significant bits of the pointer unused, which can be
used for tags or to keep other information – most often as a bit field (each bit a separate tag) – as long as code that uses the pointer masks out these bits before accessing memory. E.g., on a 32-bit architecture (for both addresses and word size), a word is 32 bits = 4 bytes, so word-aligned addresses are always a multiple of 4, hence end in The CPython GC makes use of two fat pointers that correspond to the extra fields of
Optimization: delay tracking containers#Certain types of containers cannot participate in a reference cycle, and so do not need to be tracked by the garbage collector. Untracking these objects reduces the cost of garbage collection. However, determining which objects may be untracked is not free, and the costs must be weighed against the benefits for garbage collection. There are two possible strategies for when to untrack a container:
As a general rule, instances of atomic types aren’t tracked and instances of non-atomic types (containers, user-defined objects…) are. However, some type-specific optimizations can be present in order to suppress the garbage collector footprint of simple instances. Some examples of native types that benefit from delayed tracking:
The garbage collector module provides the Python function >>> gc.is_tracked(0) False >>> gc.is_tracked("a") False >>> gc.is_tracked([]) True >>> gc.is_tracked({}) False >>> gc.is_tracked({"a": 1}) False >>> gc.is_tracked({"a": []}) True What is reference cycle in Python?A reference cycle simply means one or more objects referencing each other, such that if you drew it out on paper with arrows representing the dependencies you would see a cycle. The (almost) simplest reference cycle is having two objects a and b that refer to each other: a.other = b b.some_attr = a.
Does Python garbage collect automatically?Python has an automated garbage collection. It has an algorithm to deallocate objects which are no longer needed. Python has two ways to delete the unused objects from the memory.
Is reference counting garbage collection?Reference counting. Reference counting garbage collection is where each object has a count of the number of references to it. Garbage is identified by having a reference count of zero. An object's reference count is incremented when a reference to it is created, and decremented when a reference is destroyed.
What triggers Python garbage collection?For each generation, the garbage collector module has a threshold number of objects. If the number of objects exceeds that threshold, the garbage collector will trigger a collection process. For any objects that survive that process, they're moved into an older generation.
|