Mimicking MIME HTML · Andy Chen

In an iOS app I’m currently building, I wanted to give the user an option to save webpages for offline use. To do this, I wanted a program that would turn an open web page into a self-contained copy that can be read later with the network off. Most modern Chromium-based browsers have something like this: Save Page As > Webpage, Single File (.mhtml).

Apple doesn’t have MHTML#

Due to Apple’s browser regulations, third-party browser apps on Apple platforms must use the WebKit rendering engine. Furthermore, Apple platforms do not support MHTML, so we cannot save webpages in that format or render them later.

What WebKit does provide is its own single-file archive format, .webarchive, which serves a very similar purpose:

Feature	Chromium `.mhtml`	WebKit `.webarchive`
Stores an entire webpage in one file	✓	✓
Includes images, stylesheets, and other assets	✓	✓
Can be loaded later without re-downloading resources	✓	✓
Cross-browser support	✓	✗ (Apple only)

The biggest advantage for us is that WebKit can generate a .webarchive for a webpage with a single API call.

webView.createWebArchiveData { result in
    // result is Result<Data, Error> containing the generated .webarchive data
}

Unfortunately, doing so introduces some new problems.

Live pages do not archive cleanly#

createWebArchiveData() snapshots whatever the web view currently holds. If you point it at a live, JavaScript-heavy page, the archive you get back may not behave like a clean offline snapshot:

JavaScript runs again. Saved scripts re-execute when the archive is reopened. For a single-page-app, that can mean the app tries to boot again, re-hydrate, or call APIs that are no longer reachable offline.
Lazy-loaded content is missing. Content that was never rendered or inserted into the DOM before archiving may not make it into the archive. This includes things loaded only after user interaction or client-side API calls.
HTML metadata can interfere. Tags like <base> may break inlined content or redirect URLs back to the live site.

Chrome’s MHTML export is closer to a frozen snapshot of the page after it has loaded. To get closer to that behavior with WebKit, I need to clean up the DOM before archiving.

The recipe#

The archive pipeline runs in three steps: serialize the rendered page into static HTML, reload that HTML offscreen with JavaScript disabled, then package the result into a .webarchive.

1. Serialize the rendered DOM and offload JavaScript#

Before creating the archive, a script runs inside the live page using callAsyncJavaScript().

The first thing it does is scroll through the page. Many sites lazy-load images only when they come into view, so walking from top to bottom gives those images a chance to load before the snapshot is taken.

const step = window.innerHeight;

for (let i = 0; i < maxSteps && y < docHeight(); i++) {
    window.scrollTo(0, y);
    await sleep(60);
    y += step;
}

Once everything has had a chance to render, the script clones the DOM. All further changes happen on the clone so the visible page is left untouched.

const doc = document.documentElement.cloneNode(true);

Next, the page styles are copied into a single inline <style> element. This captures the CSS that is currently applied to the page and keeps WebKit aware of any referenced fonts or background images.

let css = "";

for (const sheet of document.styleSheets) {
    try {
        for (const rule of sheet.cssRules) {
            css += rule.cssText + "\n";
        }
    } catch {}
}

Images are then rewritten to use the exact resource that is already loaded. Responsive image candidates are removed so the archived page does not try to fetch a different version later.

const url = (live[i] && (live[i].currentSrc || live[i].src))
    || img.getAttribute("data-src")
    || img.src;

if (url) img.setAttribute("src", url);

img.removeAttribute("srcset");

Finally, anything that could change the page after it is reopened is removed.

doc.querySelectorAll(
    "script, noscript, base, meta[http-equiv='Content-Security-Policy']"
).forEach(el => el.remove());

At this point, the script returns a complete HTML document representing the current state of the page.

return "<!DOCTYPE html>\n" + doc.outerHTML;

2. Load the static page#

The generated HTML is loaded into a temporary WKWebView with JavaScript disabled.

let config = WKWebViewConfiguration()
config.defaultWebpagePreferences.allowsContentJavaScript = false

The HTML is then loaded using the original page URL as the base URL.

let loader = StaticLoader(configuration: config)

try await loader.load(
    html: html,
    baseURL: pageURL
)

This gives WebKit one chance to download any remaining images, fonts, or stylesheets referenced by the page. Since JavaScript is disabled, the page cannot modify itself while this happens.

try? await Task.sleep(for: .seconds(2))

3. Create the archive#

Once the page has finished loading, WebKit can package everything into a single .webarchive file.

return try await loader.webArchiveData()

The resulting data is written to disk, along with a small manifest containing metadata such as the title, URL, size, and save date.

Loading it back#

WebKit recognizes .webarchive files automatically, so the saved file can be loaded directly into a web view.

webView.loadFileURL(
    fileURL,
    allowingReadAccessTo: fileURL.deletingLastPathComponent()
)

Because the archive already contains the page and its resources, it renders without a network connection. The end result is very close to Chrome’s MHTML export, but built entirely on top of WebKit’s .webarchive support.

The full implementation of this is available in the mime-mhtml repository, which includes all the pieces needed to integrate this into an iOS app.