A Simple Search With Vector Derivatives

The following image is creepy and has almost nothing to to, except maybe in spirit, to the article below. I just like asking stable diffusion for images. I asked for a book transforming into a vector. It didn’t like that. I can tell from the eyes.

A floating journal of secrets being pulled into a dense mist at the center and transformed into a floating uuid string on a white background

Part 1: Imagine Identity Data Filling A Void

Every instance of a login, every identity check or touchpoint is simply a portion of a persons full identity signal being used to produce a token or key. Which portions of a persons identity space are used is largely up to the application developer what to use. Here, we show that regardless of the portion of a persons identity signal a developer or organization cares about, there’s no need to store the raw identity information. The analysts will need to transform and vectorize it anyway. We can make it searchable, build proofs, and go to production leaner, without string identity data clogging our systems. Unless you’re selling mailing lists for side money, there’s no need to store even emails anymore - or to allow them into your system creating a liability on the organization. We can embed at the edge and use stable hashes of a known identity space (and masked vectors therein) for most common use cases.

Part 2: What Is An Embedding and Why

ChatGPT can probably say it better than I can:

“An embedding is a way of representing data, such as text or images, as numerical values that a computer can understand and work with. The process of turning data into an embedding is known as embedding the data. This is often done by using mathematical algorithms to convert the data into a numerical format that can be easily processed and analyzed by machine learning models. By embedding the data, it becomes possible to perform tasks such as search, classification, and clustering on the data. In addition, embeddings can also be used to create visualizations of the data, making it easier for humans to understand and interpret the results.”

One trick we can employ to help index lots of different data sources is to map all the identity information into a big fixed vector. That’s what ivec is doing in the background. There’s an identity model I that’s versioned. Every file indexed with model ‘I’ can be searched against each other.

Part 3: Building a Simple, Fast, Index and De-Identified Data Pool

Without getting into partial matches and nearest neighbors, we can jump right to the exact match case - the case most user databases require. Some user data is entered into a frontend and a lookup is required. Here we demonstrate a UUID derived from the vector embedding stable enough for search over repeated indexing.

import pandas as pd

# Create some simple rows of data
df = pd.DataFrame({'id': [1, 1, 3, 4, 4], 'name': ['a', 'a', 'c', 'd', 'e'], 'result': [10, 20, 30, 40, 50]})
    
    # Note the disjunction of the id and name columns.  We'll see those signals in the output

# Define the mapping
map = {'institutional_id_1': 'id', 'first_name': 'name'}

Actual output when indexed:

di_data
[{'result': 10, 'uuid': UUID('18b2c95f-5905-54f4-b828-6d7dab47aad7')}, {'result': 20, 'uuid': UUID('18b2c95f-5905-54f4-b828-6d7dab47aad7')}, {'result': 30, 'uuid': UUID('f8606c06-feef-5cba-b53a-30a2bbaa767e')}, {'result': 40, 'uuid': UUID('8ffcc23a-283d-5850-87c1-ed5ab8b41391')}, {'result': 50, 'uuid': UUID('00a62f45-d5e5-5450-a3c6-d019615b93e6')}]
indexes
{UUID('18b2c95f-5905-54f4-b828-6d7dab47aad7'): [0, 1], UUID('f8606c06-feef-5cba-b53a-30a2bbaa767e'): [2], UUID('8ffcc23a-283d-5850-87c1-ed5ab8b41391'): [3], UUID('00a62f45-d5e5-5450-a3c6-d019615b93e6'): [4]}

Now, when a lookup is required with senstive information, the identity information is first embedded and it’s UUID representation is checked against existing records. If nothing exists (UUID not found in index), then the application developer can decide how to proceed - notifying of no data found, flagging session, whatever’s required.

Future Coolness

Because the vector has room for lots of different identity signals, we can use nearest neighbor and some partial exactness to create immediate, flexible, de-identified user databases. More on that after the nearest neighbor quick look. For that, we’ll need a vector database or a some virtual index machinery.

References

Python Notebook