Significant Garbage Collection Improvement For Emacs

October 21, 2022

(24)

(173)

Significant Garbage Collection Improvement For Emacs

Reducing Wall Clock Latency For sweep_conses By 50%.

This commit reduces the total wall clock duration forsweep consesexecution by approximately 50%. It does so by reducing branch mispredictions from dereferencing storage blocks while sweeping the cons blocks. Parsing the output from some subprocesses such as LSP servers creates huge amounts of conses, so this commit is significant for increasing the responsiveness for modes such as eglot or company-mode.

Caveats

I have only tested this on Mac OS on a processor with a 256 KB L2 cache. I am uncertain if the behavior is the same on other operating systems. There also may be constraints of which I'm unaware that make this not backwards compatible with other setups.

The Commit

This commit changes BLOCK ALIGN from 1 << 10 to 1 << 15.

Testing Procedure

I ran the following program that creates a long chain of conses, and then nils them out to ensure that they get collected in the next call to garbage-collect. The total was arbitrarily picked. The experiment is repeatable, but the results are prone to noise. The effect is large enough to be visible even with these inconsistencies.

(defun test-garbage-collect-large-cons-allocation ()
  (garbage-collect)
  (setq long-chain (cl-loop for i from 0 below
                            105910837 collect i))
  (garbage-collect)
  (redisplay)
  (message "%S" (garbage-collect))
  (setq gc-cons-threshold most-positive-fixnum)
  (let ((head long-chain))
    (while head
      (let ((tail (cdr head)))
        (setcdr head nil)
        (setcar head nil)
        (setq head tail))))
  (setq long-chain nil)
  (sleep-for 10)
  (let ((start-time (time-to-seconds (current-time))))
    (message "Start Garbage Collection")
    (setq gc-cons-threshold (car (get 'gc-cons-threshold 'standard-value)))
    (message "%S" (garbage-collect))
    (let ((duration (- (time-to-seconds (current-time)) start-time)))
      (prog1 duration (message "Garbage Collection: %f" duration)))))

Observed Results

The above graph shows that latency decreases whenBLOCK ALIGNincreases. It stops at 1 << 15 because my instance of emacs crashes on startup at that value.

Why Does This Work?

Insweep conses, emacs traverses the cblock linked list. With the current value for BLOCK ALIGN, each block contains less than 1024 conses. Since there are no guarantees of cache locality between blocks, each dereference is a branch misprediction. By increasing the size of these blocks, the total amount of work is decreased from having less total blocks, because there are less branch mispredictions total.

Screenshot that shows that there are more branch mispredictions without the change. — Branch Mispredictions withBLOCK ALIGNat 1 << 10

Picking a representative sample from a run without the change using Instruments.app's performance counter view shows 14316 branch mispredictions sampled.

Screenshot that shows that there are less branch mispredictions with the change. — Branch Mispredictions with BLOCK ALIGN at 1 << 15

Whereas with the change picking a representative sample shows 6511 branch mispredictions sampled, which roughly correlates with the difference in latency from the change.

How Was This Found?

I had been profiling emacs using Instruments awhile ago and had observed that the majority of garbage collection time in my use cases tended to be insweep conses. I increasedBLOCK ALIGNin response to that because 1024 is a pretty small number for processors nowadays, and observed that my garbage collection time perceptibly decreased. I had suspected that it could be caused by branch mispredictions from iterating the cons block linked list, but I was not confident in that assertion without direct evidence.

Recently, I learned that CPU Performance Counters are available in Instruments without having to recompile emacs with additional flags and that one of the counters counted branch mispredictions. With that newly acquired knowledge, I decided to run this experiment to confirm that conjecture.

Future Work

Identify why 1 << 15 is the current limit.

May 1, 2023

Using Nix to build Emacs on MacOS Ventura

Almost Reproducible builds without most of the tradeoffs

I implemented adefault.nixfor my fork of emacs optimized for newer versions of MacOS. The build is impure in that it relies directly on the host sdks from xcode instead of relying on the end user to store a xcode zip in a nix store. Now a full working Emacs build optimized for MacOS with dependency resolution and the homebrew-emacs-plus patches is available by running the following command at the repo root:

nix-build

Because of that impurity, the rest of the build is reproducible since the rest of the dependency closure is pure. This avoids most of the tradeoffs with using nix on macos to build Emacs.

Also, it overrides the deployment target so that the build can benefit from the newer MacOS windowing features.

Special thanks to @jimeh for build-emacs-for-macos. That was my preferred way to build emacs prior to this implementation.

April 29, 2023

Nix Debugging Tips: traceVal

Use traceVal for print debugging with nix

Sometimes while debugging nix packages, you need to know what a value is at runtime. NixPkgs provides the function traceValto trace a value at a specific point of the execution.

TraceVal takes one argument, the value that will be traced, and returns the value as is without modification. It can always be accessed from the nixpkgs set like:

let pkgs = import <nixpkgs> {};
in (pkgs.lib.debug.traceVal pkgs.emacs)

Since nix execution is lazy, the call to traceVal must be part of the dependency graph of the execution. The following example would not be evaluated, since it is not referenced by the derivation:

let pkgs = import <nixpkgs> {};
  not_referenced = (pkgs.lib.debug.traceVal pkgs.emacs);
in pkgs.emacs

Whereas the following example would be evaluated:

let pkgs = import <nixpkgs> {};
  referenced = (pkgs.lib.debug.traceVal "OVERRIDE_NAME");
in pkgs.emacs.overrideAttrs (oldAttrs: { name = referenced; } )

nixpkgs.lib.debugprovides a bunch of additional tracing functions that are worth looking into further.

April 4, 2023

Removing Animation Jitter From Emacs For MacOS

Why Stop At 2?

The Problem

Emacs For MacOS sometimes has frame hangs when there are a lot of changes occurring at once. This happens because there is IOSurface contention in the current implementation.

The implementation uses IOSurfaces to achieve double buffering. It swaps out an IOSurface which is used asynchronously by a CALayer to render to the window. Currently, it uses two IOSurfaces total: the one that's currently displayed and the one that's being used to render the change.

Having only two IOSurfaces leads to a race condition. When neither buffer is free, the implementation uses the least recently used buffer. This can potentially lead to cascading IOSurface contention depending on the speed of redraws occurring, since one delay can lead to further delay. The cascade is not catastrophic because blocking the redraw loop slows down the rate of changes which allows the contention to cease.

Beyond Double Buffering

The lock contention can be avoided by increasing IOSurface count from two to some number greater than 2. I arbitrarily set it to 16 for my own purposes, and anecdotally have observed a decrease in frame hangs at runtime.

Additional Optimization

The CALayer's opaque property defaults to NO. There are benefits to the performance of compositing operations by setting it to YES since the compositor no longer has to be concerned with transparent values. It is probably a mild performance benefit but has no negative impact in my usage since I do not use transparent frames.

March 29, 2023

obs-dsl.el

A DSL For Controlling OBS Declaratively With Emacs

Simple Configuration

(obs-dsl
  (scene "Main"
    (start-recording)
    (microphone "mic")
    (run-at-time 3 nil (lambda () (stop-recording)))
    (browser "website" "https://tdodge.consulting" :width 1920 :height 1080)))

I have been working withOBSrecently, was looking for ways to automate it via emacs, and noticed a dearth of elisp packages related to this. To remedy this, here is obs-dsl.el.

obs-dsl.el is defined with the intent of being the source of truth for the current obs configuration.

The prior code block shows an example configuration via the dsl. It defines a scene called Main that contains a microphone named "mic" and a browser named "website" displaying tdodge.consulting. It records for 3 seconds and then stops.

More Complex Configuration

A neat aspect of having this be a dsl is that it integrates cleanly with arbitrary emacs lisp. Below is an example of using elisp to run a count-up timer with a count-down timer to send the right message in OBS.

(obs-dsl
  (scene "Main"
    (start-recording)
    (microphone "mic")
    (let ((count-down-id "count-up")
          (count-up-id "count-down")
          (count-down 100)
          (counter 0)
          (timer nil))
     (text count-up-id "0" :x 0)
     (text count-down-id "0" :x 500)
     (setq timer (run-at-time nil 1 (lambda ()
                                      (condition-case err
                                      (progn
                                          (set-text count-up-id (number-to-string (setq counter (1+ counter))))
                                          (set-text count-down-id (number-to-string (setq count-down (1- count-down)))))
                                        (error (cancel-timer timer)))))))))

Going Further

Obs-dsl.el still needs additional work before it will be complete. OBS needs an additional websocket API so that obs-dsl.el can get the window list and process list from the destination computer so that it can support screen capture. Alternatively, the the existing APIs could support looking up based on name instead of relying on host-specific IDs.

The diff check could be better than it currently is, which would get rid of some flicker effects.

May 18, 2022

LOBSTE.RS

(48)

(155)

Living the Eshell Dream

A Reduction in Latency From 70 Seconds to 3 Seconds

Eshell is an excellent package for emacs. However, it has the appearance of being slow in comparison to other shells. Turns, out this is not eshell's fault, but a limitation in how emacs processes subprocess output in conjunction with the default STDOUT buffer sizes for PTY processes (On MacOS this maxes out at1024 bytes) This becomes more of an issue because emacs handles subprocess output in the event loop in which it handles all other user input.

In a change that I made to my fork of emacs, I added a background thread that continuously handles buffering subprocess output. This has the benefit of ensuring that the subprocess output is consumed as soon as it is available in STDOUT, which minimizes the amount of time that the subprocess blocks waiting for emacs to consume its output. This also makes it so that the strings passed to the subprocess filter can be larger than 1024 bytes because multiple reads can happen in the time between event loop evaluations.

For reference, on my machine running UnrealHeaderTool without the change would take around 70 seconds to complete because of the 1024 byte buffer bottleneck.

Running it with the change takes 3 seconds, which is aproximately a 95.7% reduction in latency.

Similarly, running sourcekitten complete goes from 7.334 seconds to 0.344 seconds, which is aproximately a 95.3% reduction in latency.

Without This Change

Subprocess

Emacs

Total Wasted Latency

RED The subprocess is blocked waiting for STDOUT to be read

YELLOW The subprocess is writing to STDOUT

BLUE Emacs is doing something other than reading subprocess output

GREEN Emacs is processing subprocess output including other processes

With This Change

Subprocess

Background Thread

Emacs

Total Wasted Latency

Background

An emacs subprocess can either use a pipe or a PTY to communicate with emacs. If you're only consuming stdout you want to use a pipe because the STDOUT buffer can be much larger. However, if you also need to write to STDIN, you will need to use a PTY. All processes that run inside eshell use a PTY so that the process can accept user input.

On MacOS, this has the effect of routing all subprocess output through a 1024 byte bottleneck. Each time the 1024 byte buffer is consumed by emacs, emacs decodes it according to the subprocess's locale, and then passes it to the subprocess's filter, which handles displaying it in the eshell buffer. This causes a number of issues:

The subprocess is not preparing to write the next stdout buffer while blocking. This means any time spent blocking in emacs is pure additional latency on the subprocess evaluation.
Process filters are arbitrary elisp code and can take longer to run than 0ms. This in turn delays the next time STDOUT from the subprocess is read causing the subprocess to block waiting for emacs to read from it again.
Eshell and font lock now have to spend much more processing power rendering intermediate broken states prior to rendering the final correct state. It's much cheaper computationally to render the full buffer once than it is to thrash and re-render repeatedly.

Commit Overview

The commit adds a background thread for processing output from subprocesses continually. It tries to behave similarly to read() as an interface to minimize the amount of changes necessary in process.c. This abstraction has a few leaks in that waiting_for_process_output can no longer rely on the ready states of the fds when computing the fd mask.

To mitigate this, an internal file descriptor is added so that the background thread can notify the main thread when the process output is available. Similarly, another file descriptor is used to communicate back to the background output producer thread when it stops because the output buffers are full.

Conclusion

With this change, using eshell is a lot closer to indistinguishable from other terminals which is impressive considering how much more extensible it is in respect to consuming output. In conjunction with my package org-runbook.el, Eshell is the best terminal experience available today.

Appendix

Buffer Size Message Form

(defun eshell-output-filter-debug-advice (process string)
   "Remove me with with C-x C-e in Help on the following form:
`(advice-remove
   ''eshell-output-filter
   #''eshell-output-filter-debug-advice)' "
   (message "Output Filter Buffer Size: %d"
            (length string)))
(advice-add 'eshell-output-filter
            :before #'eshell-output-filter-debug-advice)

sourcekitten completeOutput With Change

Output Filter Buffer Size: 3565
Output Filter Buffer Size: 182778
Output Filter Buffer Size: 4

sourcekitten completeWithout Change

Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1016
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1016
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1016
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1018
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1020
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1022
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 1024
Output Filter Buffer Size: 87
Output Filter Buffer Size: 11
Output Filter Buffer Size: 14

Reducing Wall Clock Latency For sweep_conses By 50%.

Caveats

The Commit

Testing Procedure

Observed Results

Why Does This Work?

How Was This Found?

Future Work

Almost Reproducible builds without most of the tradeoffs

Use traceVal for print debugging with nix

Why Stop At 2?

The Problem

Beyond Double Buffering

Additional Optimization

A DSL For Controlling OBS Declaratively With Emacs

Simple Configuration

More Complex Configuration

Going Further

A Reduction in Latency From 70 Seconds to 3 Seconds

Without This Change

Total Wasted Latency

With This Change

Total Wasted Latency

Background

Commit Overview

Conclusion

Appendix

Buffer Size Message Form

sourcekitten completeOutput With Change

sourcekitten completeWithout Change

Portfolio

Live Video Production

iOS and Android App Development

Web Development

Distributed Systems

Design

Optimization

Apps

Craps( Source )

Emacs Packages

Endorsements