Logo

Living The Dream Blog

T Dodge Consulting LLC

RSS

Significant Garbage Collection Improvement For Emacs

Reducing Wall Clock Latency For sweep_conses By 50%.

This commit reduces the total wall clock duration forsweep consesexecution by approximately 50%. It does so by reducing branch mispredictions from dereferencing storage blocks while sweeping the cons blocks. Parsing the output from some subprocesses such as LSP servers creates huge amounts of conses, so this commit is significant for increasing the responsiveness for modes such as eglot or company-mode.

Caveats

I have only tested this on Mac OS on a processor with a 256 KB L2 cache. I am uncertain if the behavior is the same on other operating systems. There also may be constraints of which I'm unaware that make this not backwards compatible with other setups.

The Commit

This commit changes BLOCK ALIGN from 1 << 10 to 1 << 15.

Testing Procedure

I ran the following program that creates a long chain of conses, and then nils them out to ensure that they get collected in the next call to garbage-collect. The total was arbitrarily picked. The experiment is repeatable, but the results are prone to noise. The effect is large enough to be visible even with these inconsistencies.

  1. (defun test-garbage-collect-large-cons-allocation ()
  2. (garbage-collect)
  3. (setq long-chain (cl-loop for i from 0 below
  4. 105910837 collect i))
  5. (garbage-collect)
  6. (redisplay)
  7. (message "%S" (garbage-collect))
  8. (setq gc-cons-threshold most-positive-fixnum)
  9. (let ((head long-chain))
  10. (while head
  11. (let ((tail (cdr head)))
  12. (setcdr head nil)
  13. (setcar head nil)
  14. (setq head tail))))
  15. (setq long-chain nil)
  16. (sleep-for 10)
  17. (let ((start-time (time-to-seconds (current-time))))
  18. (message "Start Garbage Collection")
  19. (setq gc-cons-threshold (car (get 'gc-cons-threshold 'standard-value)))
  20. (message "%S" (garbage-collect))
  21. (let ((duration (- (time-to-seconds (current-time)) start-time)))
  22. (prog1 duration (message "Garbage Collection: %f" duration)))))

Observed Results

The above graph shows that latency decreases whenBLOCK ALIGNincreases. It stops at 1 << 15 because my instance of emacs crashes on startup at that value.

Why Does This Work?

Insweep conses, emacs traverses the cblock linked list. With the current value for BLOCK ALIGN, each block contains less than 1024 conses. Since there are no guarantees of cache locality between blocks, each dereference is a branch misprediction. By increasing the size of these blocks, the total amount of work is decreased from having less total blocks, because there are less branch mispredictions total.

Screenshot that shows that there are more branch mispredictions without the change.
Branch Mispredictions withBLOCK ALIGNat 1 << 10

Picking a representative sample from a run without the change using Instruments.app's performance counter view shows 14316 branch mispredictions sampled.

Screenshot that shows that there are less branch mispredictions with the change.
Branch Mispredictions with BLOCK ALIGN at 1 << 15

Whereas with the change picking a representative sample shows 6511 branch mispredictions sampled, which roughly correlates with the difference in latency from the change.

How Was This Found?

I had been profiling emacs using Instruments awhile ago and had observed that the majority of garbage collection time in my use cases tended to be insweep conses. I increasedBLOCK ALIGNin response to that because 1024 is a pretty small number for processors nowadays, and observed that my garbage collection time perceptibly decreased. I had suspected that it could be caused by branch mispredictions from iterating the cons block linked list, but I was not confident in that assertion without direct evidence.

Recently, I learned that CPU Performance Counters are available in Instruments without having to recompile emacs with additional flags and that one of the counters counted branch mispredictions. With that newly acquired knowledge, I decided to run this experiment to confirm that conjecture.

Future Work

  1. Identify why 1 << 15 is the current limit.

Living the Eshell Dream

A Reduction in Latency From 70 Seconds to 3 Seconds

Eshell is an excellent package for emacs. However, it has the appearance of being slow in comparison to other shells. Turns, out this is not eshell's fault, but a limitation in how emacs processes subprocess output in conjunction with the default STDOUT buffer sizes for PTY processes (On MacOS this maxes out at1024 bytes) This becomes more of an issue because emacs handles subprocess output in the event loop in which it handles all other user input.

In a change that I made to my fork of emacs, I added a background thread that continuously handles buffering subprocess output. This has the benefit of ensuring that the subprocess output is consumed as soon as it is available in STDOUT, which minimizes the amount of time that the subprocess blocks waiting for emacs to consume its output. This also makes it so that the strings passed to the subprocess filter can be larger than 1024 bytes because multiple reads can happen in the time between event loop evaluations.

For reference, on my machine running UnrealHeaderTool without the change would take around 70 seconds to complete because of the 1024 byte buffer bottleneck.

Running it with the change takes 3 seconds, which is aproximately a 95.7% reduction in latency.

Similarly, running sourcekitten complete goes from 7.334 seconds to 0.344 seconds, which is aproximately a 95.3% reduction in latency.

Without This Change

Subprocess
Emacs

Total Wasted Latency

RED The subprocess is blocked waiting for STDOUT to be read

YELLOW The subprocess is writing to STDOUT

BLUE Emacs is doing something other than reading subprocess output

GREEN Emacs is processing subprocess output including other processes

With This Change

Subprocess
Background Thread
Emacs

Total Wasted Latency

Background

An emacs subprocess can either use a pipe or a PTY to communicate with emacs. If you're only consuming stdout you want to use a pipe because the STDOUT buffer can be much larger. However, if you also need to write to STDIN, you will need to use a PTY. All processes that run inside eshell use a PTY so that the process can accept user input.

On MacOS, this has the effect of routing all subprocess output through a 1024 byte bottleneck. Each time the 1024 byte buffer is consumed by emacs, emacs decodes it according to the subprocess's locale, and then passes it to the subprocess's filter, which handles displaying it in the eshell buffer. This causes a number of issues:

  1. The subprocess is not preparing to write the next stdout buffer while blocking. This means any time spent blocking in emacs is pure additional latency on the subprocess evaluation.
  2. Process filters are arbitrary elisp code and can take longer to run than 0ms. This in turn delays the next time STDOUT from the subprocess is read causing the subprocess to block waiting for emacs to read from it again.
  3. Eshell and font lock now have to spend much more processing power rendering intermediate broken states prior to rendering the final correct state. It's much cheaper computationally to render the full buffer once than it is to thrash and re-render repeatedly.

Commit Overview

The commit adds a background thread for processing output from subprocesses continually. It tries to behave similarly to read() as an interface to minimize the amount of changes necessary in process.c. This abstraction has a few leaks in that waiting_for_process_output can no longer rely on the ready states of the fds when computing the fd mask.

To mitigate this, an internal file descriptor is added so that the background thread can notify the main thread when the process output is available. Similarly, another file descriptor is used to communicate back to the background output producer thread when it stops because the output buffers are full.

Conclusion

With this change, using eshell is a lot closer to indistinguishable from other terminals which is impressive considering how much more extensible it is in respect to consuming output. In conjunction with my packageĀ org-runbook.el, Eshell is the best terminal experience available today.

Appendix

Buffer Size Message Form

  1. (defun eshell-output-filter-debug-advice (process string)
  2. "Remove me with with C-x C-e in Help on the following form:
  3. `(advice-remove
  4. ''eshell-output-filter
  5. #''eshell-output-filter-debug-advice)' "
  6. (message "Output Filter Buffer Size: %d"
  7. (length string)))
  8. (advice-add 'eshell-output-filter
  9. :before #'eshell-output-filter-debug-advice)

sourcekitten completeOutput With Change

Output Filter Buffer Size: 3565
Output Filter Buffer Size: 182778
Output Filter Buffer Size: 4

sourcekitten completeWithout Change

Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1016
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1016
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1016
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1018
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 87
Output Filter Buffer Size: 11
Output Filter Buffer Size: 14
            




Portfolio

iOS App Development

  1. Swift
  2. SwiftUI
  3. Objective C
  4. ARKit
  5. SceneKit
  6. UIKit
  7. Stripe
  8. Push Notifications
  1. Game Center
  2. Apple Pay

Web Development

  1. React
  2. Javascript
  3. Typescript
  4. Redux
  5. Java
  6. Scala
  7. AWS
  8. NodeJS
  1. OAuth2
  2. Ecommerce
  3. Live Video Streaming
  4. Typescript
  5. Python
  6. Django
  7. SQL

Design

  1. UI/UX
  2. Graphic
  3. Product
  4. Game
  5. Sound
  6. Prototype
  7. Digital
  8. Illustration
  1. Adobe Creative
  2. Unreal Engine
  3. Figma
  4. Blender
  5. TouchDesigner
  6. Rive

Apps

AR Pong

An Augmented Reality Beer Pong iOS Game Released in October 2019.

Esale Rugs Search Page

A custom search page built on top of OpenSearch.

Craps( Source )

A sample react implementation of the game of Craps.

Emac Packages

Counsel Edit Mode

A mode that allows editing files in place for use with counsel-ag, counsel-rg, counsel-git-grep, counsel-ack, and counsel-grep.

Org Runbook

A library for executing runbooks in org format.

Compile Queue

A Queue for scheduling and running shell commands in emacs sequentially.

Company Tailwindcss

A company backend for autocompleting tailwindcss classes.

Eglot Copilot

An transformer for using copilot to sort autocomplete results from other company backends.

Company Eglot

A company backend for getting results from eglot servers asynchronously.

Endorsements

Tyler is my long-time friend and colleague from our time consulting together while attending NCSU, until afterwards working together at Amazon. Simply put: his work is meticulous and fast, and he's one of the best developers I know in the industry. If you are considering his consultancy, starting your code base with best practices and avoiding troublesome vendor lock-in is a must-have. I cannot recommend T Dodge Consulting LLC enough for your consulting needs.

- Brandon Prime