Agenda

I. The Tasks, Models and Compute Thought Process
II. Tasks: Predictive vs Generative Tasks
III. Models: Evolution of LLMs - from a size-centric view
IV. Compute: Cloud Deployment Landscape for LLMs
- The Mental Model for Deploying in Cloud
- The Compute Environments in AWS
- Marriage between Models and Compute
V. Conclusion

I. The Thought Process for use of LLM in applications - `Task, Model & Compute`

The below thought process diagram is inspired from the GenAI Project Life Cycle Diagram by Deeplearning.ai course in Coursera

^{Source: Author’s take | Open for debate | Inspired from the GenAI Project Life Cycle Diagram by Deeplearning.ai course in Coursera}

Out of Scope:

I have kept the following thoughts out of scope, for now. But for a reader, they may be very important, hence could be factored into the Thought Process
- Do you need domain-specific models (relevant if your application is run in healthcare/finance/law etc.,)?
- Why does it have to be just one of the 2 options - Purpose-built/Customized LLM and Prompt-based General Purpose LLM?
  - Why not a Prompt-based General Purpose LLM customized with RAG or Fine-tuning?

II. The Discussion on `Tasks`

^{Source: Inspired from my fav NLP Researcher / SpaCy Founder - Ines Motwani in a QCon’24 London Talk | Refer Slide 50}

Some examples when Generative AI Is Not Needed (Traditional ML Suffices):

Predictive Analytics: Forecasting future values, like stock prices or equipment failure.
Classification: Assigning labels to data, such as spam detection or image classification.
Recommendation Systems: Suggesting products or content based on user behavior.
Anomaly Detection: Identifying outliers, like fraud detection or quality control.
Optimization and Causal Inference: Solving problems like route optimization or understanding cause-and-effect relationships.

Some examples when Generative AI Is Needed (Traditional ML won’t be enough): - Content Creation: Generating new text, images, music, or other creative outputs. - Data Augmentation: Creating synthetic data to enhance training datasets. - Personalization: Customizing content or interactions based on user preferences. - Simulations and Scenario Generation: Creating dynamic and realistic training or testing environments. - Creative Problem Solving and Design: Exploring innovative solutions, designs, or artistic ideas.

Source: Reply from GPT 4o for the prompt - “Can you summarize in 5 bullet points when is Gen AI needed and when it is not”

[1] Food for thought: Would you use a bull-dozer to mow a lawn? That is just a waste of resource, honestly.
[2] Food for thought: A slide from Ines Montani’s presentation in Data Hack Summit 2024 on Generative vs Predictive Tasks. Refer here

III. The Discussion on `Models`

Evolution of LLMs - A Memory Footprint centric View

^{Source: Author’s take | Open for debate}

Key Interpretations: - As we go right in the above diagram, the models become bigger and the tasks become more generic and complex - As models become bigger, they are highly likely to be Closed Source than Open Source (LLM weights, training methodology are shared)

Firstly, how is model memory size computed:

\[\text{Parameters in billions} \times \, \text{Floating Point(FP) Precision of each parameter in byte} = \, \text{Model Memory Size in GB}\]
\[\text{Total Memory taken by the model} = \text{Model Memory Size in GB} + \text{Memory for other components of the model}\]

The Model Sizes can be reduced by Quantization

^{Source: Author’s simplified take | Open for debate}

How is a model quantized:

Let us look at a Post-training Quantization called Weight Quantization used by Ollama:

In weight-based quantization, the weights of a trained model are converted to lower precision without requiring any retraining. While this approach is simple, it can result in model performance degradation.

Below is a list of Model Options possible to served by Ollama - one of the popular LLM inferencing framework options.

Model	Parameters	Size	Download
Llama 3	8B	4.7GB	`ollama run llama3`
Llama 3	70B	40GB	`ollama run llama3:70b`
Phi 3 Mini	3.8B	2.3GB	`ollama run phi3`
Phi 3 Medium	14B	7.9GB	`ollama run phi3:medium`

The default quantization offered by Ollama is INT4 (specifically Q4_0) 1
For example, for Phi 3 Mini consisting of 3.8B parameters, the math comes to: \(\text{Memory} = 3.8B \times 0.5 \, \text{byte} = 1.9 \, \text{GB}\) for Storage.
In the case of Phi3 Mini in Ollama’s implementation, there could be additional memory occupied by tokenizer, embeddings, metadata, and any additional layers or features included in the model - making it 2.3 GB when using inside Ollama

Sources:

^{1. There are other quantization options offered by Ollama as well. Refer GitHub Ollama Issue Discussion here for an interesting debate}

^{2. Another Quantization Method Dense and Sparse Refer GitHub link and paper link}

^{3. For a better read on the Math behind Quantizations: Refer Introduction to Post Training Quantization Medium Article}

^{4. Do try the Deeplearning.ai short courses centered around Quantization, if interested more}

IV. The Debate on `Compute` power needed for the Models

Architecting LLM Applications in Cloud :: A Cloud Agnostic View

A Cloud Agnostic View

^{Source: Author’s take | Open for debate}

Key Interpretations:

Typically, Purpose-built Models are deployed in dedicated instances
The Cloud providers have access to Foundation Models (some of those models are even Open Source as well) which are provided to users as pre-built APIs
Typically Serverless APIs scale with demand, but their performance cannot be guaranteed during sudden spikes.
In those cases, Provisioned Throughput can guarantee to meet those higher demands without degradation in performance. Provisioned Throughput comes with higher baseline cost compared to truly Serverless options
One nuance to note: Mostly the Fine-tuned LLMs, even if the foundation model is Serverless, needs Provisioned Throughput. Imagine as if one of the nodes/instances saves your fine-tuned model and the Provisioned Throughput scales the contents of that node when the demand increases

Cloud Providers offering the LLMs as APIs: - Azure AI Foundation Models - AWS Bedrock Foundation Models - GCP Model Garden

How are the Compute Environments stacked in AWS for LLM Inferencing

An AWS-specific View

^{Source: Author’s take | Open for debate}

Key Interpretations:

More Devops/MLOps skills are needed for Dedicated Instances. Serverless eases that skill set burden by helping with AWS Managed Scaling
Serverless options like Lambda and Fargate can work for Task specific (Purpose-built Small models)^ and Small Language Models^
Refer my Serverless attempts for

The Marriage between Compute Environments and LLMs

Different Sized LLMs and their Compute Options in AWS

^{Source: Author’s take | Open for debate}

Key Interpretations: - The upper half of the diagram comprising different Cloud Deployment options is mapped to the lower half consisting of different sized models.

For lower sized models,
- Task-specific Models can work with Serverless In-house Option like Serverless SageMaker Inference (there are limitations like only CPU and utmost only 6GB memory | refer) and AWS Lambda (with upto 10 GB memory possible).
On the other end of the size,
- Very Large Language Models are currently possible only as Serverless APIs across Cloud providers. In other words, we cannot host a GPT 4 model in our cloud environment

V. Conclusion

In this blog, we have seen the different types of tasks that a model addresses, and which among those tasks have the need for Generative AI.
Also, we have covered how the LLMs, coming in various sizes, can be deployed in Cloud.

Potential Next Steps for the Author (even for the reader):

Good to focus on the right sized EC2/ SageMaker instances for different LLMs discussed above.
- For example, what is the minimum-sized and recommended compute instances for
  - Phi-3-mini-4k-instruct-gguf and
  - Phi-3-mini-128k-instruct-onnx
It would also be a good continuation of this blog to focus on Efficient LLM Inferencing options like below

# some popular LLM inference frameworks

- llama.cpp
- ollama
- mistral.rs
- vLLM

# some popular technologies to make machine learning models more portable and efficient 
# across different hardware and software requirements

- ONNX (Open Neural Network Exchange) - an open format designed to represent machine learning models that provides interoperability between different ML frameworks like PyTorch and Tensorflow.
- GGUF (Generic Graph Update Format) - a format used for representing and updating machine learning models, particularly useful for smaller language models that can run effectively on CPUs with 4-8bit quantization.

# some popular Efficient Machine learning frameworks or libraries 
# designed to run ML models efficiently on mobile and edge devices

- PyTorch Mobile
- Tensorflow Lite
- Apple Core ML
- Windows DirectML

Agenda

I. The Thought Process for use of LLM in applications - Task, Model & Compute

II. The Discussion on Tasks

III. The Discussion on Models

Evolution of LLMs - A Memory Footprint centric View

Firstly, how is model memory size computed:

The Model Sizes can be reduced by Quantization

How is a model quantized:

IV. The Debate on Compute power needed for the Models

Architecting LLM Applications in Cloud :: A Cloud Agnostic View

A Cloud Agnostic View

How are the Compute Environments stacked in AWS for LLM Inferencing

An AWS-specific View

The Marriage between Compute Environments and LLMs

Different Sized LLMs and their Compute Options in AWS

V. Conclusion

I. The Thought Process for use of LLM in applications - `Task, Model & Compute`

II. The Discussion on `Tasks`

III. The Discussion on `Models`

IV. The Debate on `Compute` power needed for the Models