Developing AI Products For Local Inference

June 08, 2024

At Quill, we believe in doing useful AI on people’s own devices as much as possible. This increases both privacy and dependability (Quill works offline!)

However, this does introduce constraints - we work hard to make sure Quill can work on devices as old as a 2019 Microsoft Surface or a 2019 Macbook Air with an intel processor. At the same time, the results must be quality or else they aren’t useful to users - and we have a pretty high bar for quality. To be honest, we strive for the best of all worlds; we actually aim for the highest quality results in the industry while running where people are.

There are a few tricks to doing so, but today I just want to talk about combining orthogonal approaches in probabilistic domains.

Lost Robot Hiker
"90% certain I should have turned left 3 days ago..."

Implications of Proabilistic Programming

AI programming is probabilistic. This is the biggest change from traditional SaaS programming for most product developers. There’s no way to prove correctness of code, no “complete” test suites, and no proofs by induction that work. Therefore, every inference has an X% of generating good results and (1-X)% chance of generating bad results.

Moreover, X compounds.

In the worst case, you design a pipeline where each step depends on getting good inputs from the previous step. Lots of “agents” work this way. In that case, a 2-step process where each step has an X% chance of a good result means your overall pipeline has an X% * X% chance of a good result. If X is 90%, your overall chance of a good result is 81%. Serial dependent steps increase variance.

The two approaches we take to making the probabilistic nature of AI work help us is to run multiple passes at each step and to design steps that can compensate for errors in their inputs.

Parallel Processing

When running multiple versions of a single step in parallel, we end up multiplying (1-X) instead of X. This looks like the Swiss Cheese model of security. If we identify the speaker of a sentence in 2 different ways that are unrelated to each other and we can figure out which is most likely to be correct, we get 1-((1-X) * (1-X)) = 2*X-X^2 If X is again 90%, this comes out to 99%. The thing to be cautious of here is correlation between your different ways of running each step; reducing independence means they make the same sorts of mistakes rather than making different mistakes. The other practical consideration is performance.

Robust Pipeline Steps

For the second, we like to think about the robustness principle. I first learned about this when discussing converting analog voltages into digital signals at MIT. In essence, steps in our pipeline should take messy inputs and create cleaner outputs rather than compounding errors.

In a two stage pipeline (Step A -> Step B), it can be the case that Step A creates outputs that would be acceptable to a human user of Quill only X% of the time. However, Step B takes messy inputs and generates human-quality output Y% where Y > X. By being judicious about which steps you include in your pipeline, which order they happen in, and trading off their faults, you can increase the output quality of the final result without having state-of-the-art, resource-intensive algorithms for each individual step.

Generating robust pipelines with multiple steps created out of simpler techniques also means we can scale performance up and down depending on the capabilities of a given users’ machine and its current load in order to provide the best experience possible to every user. Related practical tips we’ve learned - don’t run 100 tabs in your browser, and turn off low-power mode on M2 and M3 machines.

I wish I could go into more detail about our actual implementations, but we have to have a few secrets :)

Not every step in a pipeline has to be ultra advanced; the fact that we can use orthogonal approaches to compensate for each other means that even extremely simple heuristics can go quite far when combined with sophisticated techniques for other steps in the pipeline.

In conclusion, don’t underestimate the effectiveness of combining multiple simple approaches in creating AI software that performs well in the real world.



m [at] mpdaugherty.com