nlp posts - idmontie's Portfolio

/*@jsxRuntime automatic @jsxImportSource react*/
const {Fragment: _Fragment, jsx: _jsx, jsxs: _jsxs} = arguments[0];
function _createMdxContent(props) {
  const _components = Object.assign({
    p: "p",
    pre: "pre",
    code: "code",
    h2: "h2",
    a: "a"
  }, props.components);
  return _jsxs(_Fragment, {
    children: [_jsx(_components.p, {
      children: "While working on the Clarity Hub NLP API, we had a common use-case where we would create embeddings from text, and use those embedding to determine cosine similarity with other embeddings. Doing this required loading all of the embeddings in-memory and then computing cosine similarity with the entire dataset. As the dataset grew, this operation would get incredibly slow."
    }), "\n", _jsx(_components.p, {
      children: "We worked on a fast way to do these lookups using ranges that can be performed in any database. This approach was never implemented, but we worked on multiple proof-of-concepts to test out our ideas. The goal was to take an input text, compute an embedding, load the entire embedding datasets loaded into an AWS lambda, find the most similar set of vectors, and return the top N similar vectors in one use-case. To tackle that, we came up with the following idea."
    }), "\n", _jsx(_components.p, {
      children: "Given a vector A, compute is similar to a unit vector U of the same dimension as A. So:"
    }), "\n", _jsx(_components.pre, {
      children: _jsx(_components.code, {
        className: "language-cpp",
        children: "dim(U) = dim(A)\n"
      })
    }), "\n", _jsx(_components.p, {
      children: "And"
    }), "\n", _jsx(_components.pre, {
      children: _jsx(_components.code, {
        className: "language-cpp",
        children: "S_u = cos(θ) = A · U / ||A|| x ||U||\n"
      })
    }), "\n", _jsx(_components.p, {
      children: "Where S_u is the similarity with the unit vector. The unit vector just needs to be the same across all samples."
    }), "\n", _jsx(_components.p, {
      children: "For each embedding, store the calculated S_u."
    }), "\n", _jsx(_components.p, {
      children: "If we want to find similar vectors for a new vector B, then we compute is similarity to the unit vector."
    }), "\n", _jsxs(_components.p, {
      children: ["Then, we can query the database for vectors within an interval of ", _jsx(_components.code, {
        children: "[S_u - ε, S_u + ε]"
      }), " . This will give us a subset of the dataset that have similar similarities with the unit vector."]
    }), "\n", _jsx(_components.p, {
      children: "We can re-query increasing or decreasing ε until the top N results are found."
    }), "\n", _jsx(_components.p, {
      children: "To further improve accuracy, we can also re-compute the similarity score using cosine similarity with the subset of vectors, which is still much faster then computing the similarity against the entire dataset."
    }), "\n", _jsxs(_components.p, {
      children: ["This approach begins to break down as the cosine similarity to the unit vector chosen gets very large (", _jsx(_components.code, {
        children: "> 0.4"
      }), ").  We end up with the possibility of matching against vectors that are of opposite directions – the least similar vectors to the original input vector."]
    }), "\n", _jsx(_components.p, {
      children: "One solution to workaround this could be to pre-compute the similarity of a vector against unit vectors for each dimension of the input vector. But this could be 512 or more cosine similarity calculations for modern embeddings just to precompute the data. Once all unit vector similarities are calculated and stored, the range query against the database would be made against the column for which the input vector’s similarity is closest to 0."
    }), "\n", _jsx(_components.p, {
      children: "There are a lot of real solutions to this problem, but this was a fun exercise to think about and work on."
    }), "\n", _jsx(_components.h2, {
      children: "Further reading"
    }), "\n", _jsxs(_components.p, {
      children: ["Vector similarity search is becoming increasingly popular and integrated into databases. Here are some resources to learn more: ", _jsx(_components.a, {
        href: "https://zilliz.com/blog/vector-similarity-search",
        children: "Vector Similarity Search"
      }), "."]
    })]
  });
}
function MDXContent(props = {}) {
  const {wrapper: MDXLayout} = props.components || ({});
  return MDXLayout ? _jsx(MDXLayout, Object.assign({}, props, {
    children: _jsx(_createMdxContent, props)
  })) : _createMdxContent(props);
}
return {
  default: MDXContent
};


<p>While working on the Clarity Hub NLP API, we had a common use-case where we would create embeddings from text, and use those embedding to determine cosine similarity with other embeddings. Doing this required loading all of the embeddings in-memory and then computing cosine similarity with the entire dataset. As the dataset grew, this operation would get incredibly slow.</p>
<p>We worked on a fast way to do these lookups using ranges that can be performed in any database. This approach was never implemented, but we worked on multiple proof-of-concepts to test out our ideas. The goal was to take an input text, compute an embedding, load the entire embedding datasets loaded into an AWS lambda, find the most similar set of vectors, and return the top N similar vectors in one use-case. To tackle that, we came up with the following idea.</p>
<p>Given a vector A, compute is similar to a unit vector U of the same dimension as A. So:</p>
<div class="overflow-auto rounded bg-gray-200 p-4 font-mono text-sm dark:bg-gray-800 dark:text-gray-100"><pre><code class="language-cpp">dim(U) = dim(A)
</code></pre></div>
<p>And</p>
<div class="overflow-auto rounded bg-gray-200 p-4 font-mono text-sm dark:bg-gray-800 dark:text-gray-100"><pre><code class="language-cpp">S_u = cos(θ) = A · U / ||A|| x ||U||
</code></pre></div>
<p>Where S_u is the similarity with the unit vector. The unit vector just needs to be the same across all samples.</p>
<p>For each embedding, store the calculated S_u.</p>
<p>If we want to find similar vectors for a new vector B, then we compute is similarity to the unit vector.</p>
<p>Then, we can query the database for vectors within an interval of <code>[S_u - ε, S_u + ε]</code> . This will give us a subset of the dataset that have similar similarities with the unit vector.</p>
<p>We can re-query increasing or decreasing ε until the top N results are found.</p>
<p>To further improve accuracy, we can also re-compute the similarity score using cosine similarity with the subset of vectors, which is still much faster then computing the similarity against the entire dataset.</p>
<p>This approach begins to break down as the cosine similarity to the unit vector chosen gets very large (<code>&gt; 0.4</code>).  We end up with the possibility of matching against vectors that are of opposite directions – the least similar vectors to the original input vector.</p>
<p>One solution to workaround this could be to pre-compute the similarity of a vector against unit vectors for each dimension of the input vector. But this could be 512 or more cosine similarity calculations for modern embeddings just to precompute the data. Once all unit vector similarities are calculated and stored, the range query against the database would be made against the column for which the input vector’s similarity is closest to 0.</p>
<p>There are a lot of real solutions to this problem, but this was a fun exercise to think about and work on.</p>
<h2>Further reading</h2>
<p>Vector similarity search is becoming increasingly popular and integrated into databases. Here are some resources to learn more: <a href="https://zilliz.com/blog/vector-similarity-search">Vector Similarity Search</a>.</p>


While working on the Clarity Hub NLP API, we had a common use-case where we would create embeddings from text, and use those embedding to determine cosine similarity with other embeddings. Doing this required loading all of the embeddings in-memory and then computing cosine similarity with the entire dataset. As the dataset grew, this operation would get incredibly slow.

We worked on a fast way to do these lookups using ranges that can be performed in any database. This approach was never implemented, but we worked on multiple proof-of-concepts to test out our ideas. The goal was to take an input text, compute an embedding, load the entire embedding datasets loaded into an AWS lambda, find the most similar set of vectors, and return the top N similar vectors in one use-case. To tackle that, we came up with the following idea.

Given a vector A, compute is similar to a unit vector U of the same dimension as A. So:

```cpp
dim(U) = dim(A)
```

And

```cpp
S_u = cos(θ) = A · U / ||A|| x ||U||
```

Where S_u is the similarity with the unit vector. The unit vector just needs to be the same across all samples.

For each embedding, store the calculated S_u.

If we want to find similar vectors for a new vector B, then we compute is similarity to the unit vector.

Then, we can query the database for vectors within an interval of `[S_u - ε, S_u + ε]` . This will give us a subset of the dataset that have similar similarities with the unit vector.

We can re-query increasing or decreasing ε until the top N results are found.

To further improve accuracy, we can also re-compute the similarity score using cosine similarity with the subset of vectors, which is still much faster then computing the similarity against the entire dataset.

This approach begins to break down as the cosine similarity to the unit vector chosen gets very large (`> 0.4`).  We end up with the possibility of matching against vectors that are of opposite directions – the least similar vectors to the original input vector.

One solution to workaround this could be to pre-compute the similarity of a vector against unit vectors for each dimension of the input vector. But this could be 512 or more cosine similarity calculations for modern embeddings just to precompute the data. Once all unit vector similarities are calculated and stored, the range query against the database would be made against the column for which the input vector’s similarity is closest to 0.

There are a lot of real solutions to this problem, but this was a fun exercise to think about and work on.

## Further reading

Vector similarity search is becoming increasingly popular and integrated into databases. Here are some resources to learn more: [Vector Similarity Search](https://zilliz.com/blog/vector-similarity-search).


Fast Similar Embedding Lookup

/*@jsxRuntime automatic @jsxImportSource react*/
const {Fragment: _Fragment, jsx: _jsx, jsxs: _jsxs} = arguments[0];
function _createMdxContent(props) {
  const _components = Object.assign({
    p: "p",
    img: "img",
    strong: "strong",
    h3: "h3",
    ul: "ul",
    li: "li",
    a: "a"
  }, props.components), {Mermaid} = _components;
  if (!Mermaid) _missingMdxReference("Mermaid", true);
  return _jsxs(_Fragment, {
    children: [_jsx(_components.p, {
      children: _jsx(_components.img, {
        src: "/media/2023-01-07-clarity-hub-infer/Screen_Shot_2023-01-07_at_3.57.30_PM.png",
        alt: "Screen Shot 2023-01-07 at 3.57.30 PM.png"
      })
    }), "\n", _jsx(_components.p, {
      children: "While working on Clarity Hub, we created a Clarity Hub Infer API along with a developer portal that would let anyone create infer models."
    }), "\n", _jsx(_components.p, {
      children: "The Clarity Hub Infer API provides a fast and intuitive way to create, manage, and deploy NLP models based on labelling utterances."
    }), "\n", _jsx(_components.p, {
      children: "At the most basic level, the Infer API would let users send utterances via an API and get toxicity analysis, sentiment scores, and simple NLP data like nouns and topics from the utterances."
    }), "\n", _jsx(_components.p, {
      children: "The power of the Infer API is that consumers can supply a set of pre-labelled utterances to the API, and the API will create a model from this, even if there are only a few utterances used for training. Then the consumer can send a new utterance get a label using that model."
    }), "\n", _jsx(_components.p, {
      children: "The NLP APIs at Clarity Hub were a set of APIs:"
    }), "\n", _jsx(Mermaid, {
      chart: "graph RL\n  NLP(Clarity Hub NLP API) --> API(Clarity Hub Infer API) --> Consumer"
    }), "\n", _jsx(_components.p, {
      children: "The Consumer would user the Infer API which provided APIs for training and labeling datasets and getting toxicity and sentiment analyses. the Clarity Hub NLP API contained trained Tensorflow datasets for creating embeddings via the Universal Sentence Encoder (USE)."
    }), "\n", _jsxs(_components.p, {
      children: ["An ", _jsx(_components.strong, {
        children: "embedding"
      }), " a vector that represents an utterance - a sentence, sentence fragment, or paragraph of text."]
    }), "\n", _jsx(_components.p, {
      children: "Training would involve a consumer sending a payload of utterances with labels to the Infer API, which would call the NLP API internally to create embeddings. We then clustered these embeddings to and re-labelled the clusters using the given labels. If no label was found for an utterance cluster, we attempted to pull a topic out of the utterances to re-label it."
    }), "\n", _jsx(_components.p, {
      children: "The clusters with labels were then stored into S3."
    }), "\n", _jsx(Mermaid, {
      chart: "graph TD\n  Train -->|Utterances with labels| USE\n  USE -->|Embeddings with labels| Clustering\n  Clustering -->|Embedding Clusters| Labeller\n  Labeller -->|Clusters with Labels| S3"
    }), "\n", _jsx(_components.p, {
      children: "To classify a new utterance, we created an embedding from it, loaded the existing dataset in, then ran cosine similarity to find the most probabilistic matches:"
    }), "\n", _jsx(Mermaid, {
      chart: "graph TD\n  Classify --> |Utterance| USE\n  USE --> |Embedding| Classifier(Classifier)\n  Classifier --> |Embedding + Clusters from S3| Similarity(Cosine Similarity)\n  Similarity --> |Labels with Probability| Response"
    }), "\n", _jsx(_components.h3, {
      children: "What it looked like"
    }), "\n", _jsx(_components.p, {
      children: _jsx(_components.img, {
        src: "/media/2023-01-07-clarity-hub-infer/Screen_Shot_2023-01-07_at_3.57.56_PM.png",
        alt: "Screen Shot 2023-01-07 at 3.57.56 PM.png"
      })
    }), "\n", _jsx(_components.p, {
      children: _jsx(_components.img, {
        src: "/media/2023-01-07-clarity-hub-infer/Screen_Shot_2023-01-07_at_3.58.08_PM.png",
        alt: "Screen Shot 2023-01-07 at 3.58.08 PM.png"
      })
    }), "\n", _jsx(_components.p, {
      children: _jsx(_components.img, {
        src: "/media/2023-01-07-clarity-hub-infer/Screen_Shot_2023-01-07_at_3.58.28_PM.png",
        alt: "Screen Shot 2023-01-07 at 3.58.28 PM.png"
      })
    }), "\n", _jsx(_components.h3, {
      children: "Conclusion"
    }), "\n", _jsx(_components.p, {
      children: "With ChatGPT and other NLP models coming out lately, this seems fairly basic, but the following processes are still very useful to understand:"
    }), "\n", _jsxs(_components.ul, {
      children: ["\n", _jsx(_components.li, {
        children: "Convert language to a representation that is easier to work with, like a vector."
      }), "\n", _jsx(_components.li, {
        children: "Clustering vectors is a great way to find representative vectors, reducing the size of the number of vectors you need to work with."
      }), "\n", _jsx(_components.li, {
        children: "Cosine Similarity can be used to find how similar vectors are. If a vector is labelled with metadata, it also tells you how similar the metadata between the vectors are as well."
      }), "\n"]
    }), "\n", _jsxs(_components.p, {
      children: ["You can see ", _jsx(_components.a, {
        href: "/projects/2020-05-18-clarity-hub-infer",
        children: "my project page"
      }), " for more details and links to the Github repos."]
    })]
  });
}
function MDXContent(props = {}) {
  const {wrapper: MDXLayout} = props.components || ({});
  return MDXLayout ? _jsx(MDXLayout, Object.assign({}, props, {
    children: _jsx(_createMdxContent, props)
  })) : _createMdxContent(props);
}
return {
  default: MDXContent
};
function _missingMdxReference(id, component) {
  throw new Error("Expected " + (component ? "component" : "object") + " `" + id + "` to be defined: you likely forgot to import, pass, or provide it.");
}


<p><img alt="Screen Shot 2023-01-07 at 3.57.30 PM.png" src="/media/2023-01-07-clarity-hub-infer/Screen_Shot_2023-01-07_at_3.57.30_PM.png" style="max-height:500px;margin:auto;text-align:center"/></p>
<p>While working on Clarity Hub, we created a Clarity Hub Infer API along with a developer portal that would let anyone create infer models.</p>
<p>The Clarity Hub Infer API provides a fast and intuitive way to create, manage, and deploy NLP models based on labelling utterances.</p>
<p>At the most basic level, the Infer API would let users send utterances via an API and get toxicity analysis, sentiment scores, and simple NLP data like nouns and topics from the utterances.</p>
<p>The power of the Infer API is that consumers can supply a set of pre-labelled utterances to the API, and the API will create a model from this, even if there are only a few utterances used for training. Then the consumer can send a new utterance get a label using that model.</p>
<p>The NLP APIs at Clarity Hub were a set of APIs:</p>
<div class="py-8 [&amp;_svg]:m-auto"><div class="mermaid" data-mermaid-src="graph RL
  NLP(Clarity Hub NLP API) --&gt; API(Clarity Hub Infer API) --&gt; Consumer">graph RL
  NLP(Clarity Hub NLP API) --&gt; API(Clarity Hub Infer API) --&gt; Consumer</div></div>
<p>The Consumer would user the Infer API which provided APIs for training and labeling datasets and getting toxicity and sentiment analyses. the Clarity Hub NLP API contained trained Tensorflow datasets for creating embeddings via the Universal Sentence Encoder (USE).</p>
<p>An <strong>embedding</strong> a vector that represents an utterance - a sentence, sentence fragment, or paragraph of text.</p>
<p>Training would involve a consumer sending a payload of utterances with labels to the Infer API, which would call the NLP API internally to create embeddings. We then clustered these embeddings to and re-labelled the clusters using the given labels. If no label was found for an utterance cluster, we attempted to pull a topic out of the utterances to re-label it.</p>
<p>The clusters with labels were then stored into S3.</p>
<div class="py-8 [&amp;_svg]:m-auto"><div class="mermaid" data-mermaid-src="graph TD
  Train --&gt;|Utterances with labels| USE
  USE --&gt;|Embeddings with labels| Clustering
  Clustering --&gt;|Embedding Clusters| Labeller
  Labeller --&gt;|Clusters with Labels| S3">graph TD
  Train --&gt;|Utterances with labels| USE
  USE --&gt;|Embeddings with labels| Clustering
  Clustering --&gt;|Embedding Clusters| Labeller
  Labeller --&gt;|Clusters with Labels| S3</div></div>
<p>To classify a new utterance, we created an embedding from it, loaded the existing dataset in, then ran cosine similarity to find the most probabilistic matches:</p>
<div class="py-8 [&amp;_svg]:m-auto"><div class="mermaid" data-mermaid-src="graph TD
  Classify --&gt; |Utterance| USE
  USE --&gt; |Embedding| Classifier(Classifier)
  Classifier --&gt; |Embedding + Clusters from S3| Similarity(Cosine Similarity)
  Similarity --&gt; |Labels with Probability| Response">graph TD
  Classify --&gt; |Utterance| USE
  USE --&gt; |Embedding| Classifier(Classifier)
  Classifier --&gt; |Embedding + Clusters from S3| Similarity(Cosine Similarity)
  Similarity --&gt; |Labels with Probability| Response</div></div>
<h3>What it looked like</h3>
<p><img alt="Screen Shot 2023-01-07 at 3.57.56 PM.png" src="/media/2023-01-07-clarity-hub-infer/Screen_Shot_2023-01-07_at_3.57.56_PM.png" style="max-height:500px;margin:auto;text-align:center"/></p>
<p><img alt="Screen Shot 2023-01-07 at 3.58.08 PM.png" src="/media/2023-01-07-clarity-hub-infer/Screen_Shot_2023-01-07_at_3.58.08_PM.png" style="max-height:500px;margin:auto;text-align:center"/></p>
<p><img alt="Screen Shot 2023-01-07 at 3.58.28 PM.png" src="/media/2023-01-07-clarity-hub-infer/Screen_Shot_2023-01-07_at_3.58.28_PM.png" style="max-height:500px;margin:auto;text-align:center"/></p>
<h3>Conclusion</h3>
<p>With ChatGPT and other NLP models coming out lately, this seems fairly basic, but the following processes are still very useful to understand:</p>
<ul>
<li>Convert language to a representation that is easier to work with, like a vector.</li>
<li>Clustering vectors is a great way to find representative vectors, reducing the size of the number of vectors you need to work with.</li>
<li>Cosine Similarity can be used to find how similar vectors are. If a vector is labelled with metadata, it also tells you how similar the metadata between the vectors are as well.</li>
</ul>
<p>You can see <a href="/projects/2020-05-18-clarity-hub-infer">my project page</a> for more details and links to the Github repos.</p>


![Screen Shot 2023-01-07 at 3.57.30 PM.png](/media/2023-01-07-clarity-hub-infer/Screen_Shot_2023-01-07_at_3.57.30_PM.png)

While working on Clarity Hub, we created a Clarity Hub Infer API along with a developer portal that would let anyone create infer models.

The Clarity Hub Infer API provides a fast and intuitive way to create, manage, and deploy NLP models based on labelling utterances.

At the most basic level, the Infer API would let users send utterances via an API and get toxicity analysis, sentiment scores, and simple NLP data like nouns and topics from the utterances.

The power of the Infer API is that consumers can supply a set of pre-labelled utterances to the API, and the API will create a model from this, even if there are only a few utterances used for training. Then the consumer can send a new utterance get a label using that model.

The NLP APIs at Clarity Hub were a set of APIs:

```mermaid
graph RL
  NLP(Clarity Hub NLP API) --> API(Clarity Hub Infer API) --> Consumer
```

The Consumer would user the Infer API which provided APIs for training and labeling datasets and getting toxicity and sentiment analyses. the Clarity Hub NLP API contained trained Tensorflow datasets for creating embeddings via the Universal Sentence Encoder (USE).

An **embedding** a vector that represents an utterance - a sentence, sentence fragment, or paragraph of text.

Training would involve a consumer sending a payload of utterances with labels to the Infer API, which would call the NLP API internally to create embeddings. We then clustered these embeddings to and re-labelled the clusters using the given labels. If no label was found for an utterance cluster, we attempted to pull a topic out of the utterances to re-label it.

The clusters with labels were then stored into S3.

```mermaid
graph TD
  Train -->|Utterances with labels| USE
  USE -->|Embeddings with labels| Clustering
  Clustering -->|Embedding Clusters| Labeller
  Labeller -->|Clusters with Labels| S3
```

To classify a new utterance, we created an embedding from it, loaded the existing dataset in, then ran cosine similarity to find the most probabilistic matches:

```mermaid
graph TD
  Classify --> |Utterance| USE
  USE --> |Embedding| Classifier(Classifier)
  Classifier --> |Embedding + Clusters from S3| Similarity(Cosine Similarity)
  Similarity --> |Labels with Probability| Response
```

### What it looked like

![Screen Shot 2023-01-07 at 3.57.56 PM.png](/media/2023-01-07-clarity-hub-infer/Screen_Shot_2023-01-07_at_3.57.56_PM.png)

![Screen Shot 2023-01-07 at 3.58.08 PM.png](/media/2023-01-07-clarity-hub-infer/Screen_Shot_2023-01-07_at_3.58.08_PM.png)

![Screen Shot 2023-01-07 at 3.58.28 PM.png](/media/2023-01-07-clarity-hub-infer/Screen_Shot_2023-01-07_at_3.58.28_PM.png)

### Conclusion

With ChatGPT and other NLP models coming out lately, this seems fairly basic, but the following processes are still very useful to understand:

- Convert language to a representation that is easier to work with, like a vector.
- Clustering vectors is a great way to find representative vectors, reducing the size of the number of vectors you need to work with.
- Cosine Similarity can be used to find how similar vectors are. If a vector is labelled with metadata, it also tells you how similar the metadata between the vectors are as well.

You can see [my project page](/projects/2020-05-18-clarity-hub-infer) for more details and links to the Github repos.