glm4/basic_demo/README_en.md

# Basic Demo

In this demo, you will experience how to use the GLM-4-9B open source model to perform basic tasks.

Please follow the steps in the document strictly to avoid unnecessary errors.

## Device and dependency check

### Related inference test data

**The data in this document are tested in the following hardware environment. The actual operating environment
requirements and the GPU memory occupied by the operation are slightly different. Please refer to the actual operating
environment.**

Test hardware information:

+ OS: Ubuntu 22.04
+ Memory: 512GB
+ Python: 3.10.12 (recommend) / 3.12.3 have been tested
+ CUDA Version: 12.3
+ GPU Driver: 535.104.05
+ GPU: NVIDIA A100-SXM4-80GB * 8

The stress test data of relevant inference are as follows:

**All tests are performed on a single GPU, and all GPU memory consumption is calculated based on the peak value**

#

### GLM-4-9B-Chat

| Dtype | GPU Memory | Prefilling | Decode Speed  | Remarks                |
|-------|------------|------------|---------------|------------------------|
| BF16  | 19 GB      | 0.2s       | 27.8 tokens/s | Input length is 1000   |
| BF16  | 21 GB      | 0.8s       | 31.8 tokens/s | Input length is 8000   |
| BF16  | 28 GB      | 4.3s       | 14.4 tokens/s | Input length is 32000  |
| BF16  | 58 GB      | 38.1s      | 3.4  tokens/s | Input length is 128000 |

| Dtype | GPU Memory | Prefilling | Decode Speed  | Remarks               |
|-------|------------|------------|---------------|-----------------------|
| INT4  | 8 GB       | 0.2s       | 23.3 tokens/s | Input length is 1000  |
| INT4  | 10 GB      | 0.8s       | 23.4 tokens/s | Input length is 8000  |
| INT4  | 17 GB      | 4.3s       | 14.6 tokens/s | Input length is 32000 |

### GLM-4-9B-Chat-1M

| Dtype | GPU Memory | Prefilling | Decode Speed     | Remarks                |
|-------|------------|------------|------------------|------------------------|
| BF16  | 74497MiB   | 98.4s      | 2.3653  tokens/s | Input length is 200000 |

If your input exceeds 200K, we recommend that you use the vLLM backend with multi gpus for inference to get better
performance.

#### GLM-4V-9B

| Dtype | GPU Memory | Prefilling | Decode Speed  | Remarks              |
|-------|------------|------------|---------------|----------------------|
| BF16  | 28 GB      | 0.1s       | 33.4 tokens/s | Input length is 1000 |
| BF16  | 33 GB      | 0.7s       | 39.2 tokens/s | Input length is 8000 |

| Dtype | GPU Memory | Prefilling | Decode Speed  | Remarks              |
|-------|------------|------------|---------------|----------------------|
| INT4  | 10 GB      | 0.1s       | 28.7 tokens/s | Input length is 1000 |
| INT4  | 15 GB      | 0.8s       | 24.2 tokens/s | Input length is 8000 |

### Minimum hardware requirements

If you want to run the most basic code provided by the official (transformers backend) you need:

+ Python >= 3.10
+ Memory of at least 32 GB

If you want to run all the codes in this folder provided by the official, you also need:

+ Linux operating system (Debian series is best)
+ GPU device with more than 8GB GPU memory, supporting CUDA or ROCM and supporting `BF16` reasoning (`FP16` precision
  cannot be finetuned, and there is a small probability of problems in infering)

Install dependencies

```shell
pip install -r requirements.txt
```

## Basic function calls

**Unless otherwise specified, all demos in this folder do not support advanced usage such as Function Call and All Tools
**

### Use transformers backend code

+ Use the command line to communicate with the GLM-4-9B model.

```shell
python trans_cli_demo.py # GLM-4-9B-Chat
python trans_cli_vision_demo.py # GLM-4V-9B
```

+ Use the Gradio web client to communicate with the  GLM-4-9B model.

```shell
python trans_web_demo.py  # GLM-4-9B-Chat
python trans_web_vision_demo.py # GLM-4V-9B
```

+ Use Batch inference.

```shell
python trans_batch_demo.py
```

### Use vLLM backend code

+ Use the command line to communicate with the GLM-4-9B-Chat model.

```shell
python vllm_cli_demo.py
```

+ use LoRA adapters with vLLM on GLM-4-9B-Chat model.

```python
# vllm_cli_demo.py
# add LORA_PATH = ''
```

+ Build the server by yourself and use the request format of `OpenAI API` to communicate with the glm-4-9b model. This
  demo supports Function Call and All Tools functions.
+ Modify the `MODEL_PATH` in `open_api_server.py`, and you can choose to build the GLM-4-9B-Chat or GLM-4v-9B server side.

Start the server:

```shell
python openai_api_server.py
```

Client request:

```shell
python openai_api_request.py
```

## Stress test

Users can use this code to test the generation speed of the model on the transformers backend on their own devices:

```shell
python trans_stress_test.py
```

##Use Ascend card to run code

Users can run the above code in the Ascend hardware environment. They only need to change the transformers to openmind and the cuda device in device to npu.

```shell
#from transformers import AutoModelForCausalLM, AutoTokenizer
from openmind import AutoModelForCausalLM, AutoTokenizer

#device = 'cuda'
device = 'npu'
```
init commit 2024-06-05 10:22:16 +08:00			`# Basic Demo`

fix readme error 2024-06-06 10:00:11 +08:00			`In this demo, you will experience how to use the GLM-4-9B open source model to perform basic tasks.`
init commit 2024-06-05 10:22:16 +08:00
			`Please follow the steps in the document strictly to avoid unnecessary errors.`

			`## Device and dependency check`

			`### Related inference test data`

			`**The data in this document are tested in the following hardware environment. The actual operating environment`
fix with vllm requirements.txt 2024-06-06 13:57:22 +08:00			`requirements and the GPU memory occupied by the operation are slightly different. Please refer to the actual operating`
			`environment.**`

init commit 2024-06-05 10:22:16 +08:00			`Test hardware information:`

			`+ OS: Ubuntu 22.04`
			`+ Memory: 512GB`
add openai demo stream and function call fix #130 #124 2024-06-09 16:11:20 +08:00			`+ Python: 3.10.12 (recommend) / 3.12.3 have been tested`
init commit 2024-06-05 10:22:16 +08:00			`+ CUDA Version: 12.3`
			`+ GPU Driver: 535.104.05`
			`+ GPU: NVIDIA A100-SXM4-80GB * 8`

			`The stress test data of relevant inference are as follows:`

fix with vllm requirements.txt 2024-06-06 13:57:22 +08:00			`All tests are performed on a single GPU, and all GPU memory consumption is calculated based on the peak value`
update link 2024-06-05 16:24:36 +08:00
add glm-4v-9b stress test 2024-06-05 12:55:41 +08:00			`#`
update link 2024-06-05 16:24:36 +08:00
add glm-4v-9b stress test 2024-06-05 12:55:41 +08:00			`### GLM-4-9B-Chat`

fix with vllm requirements.txt 2024-06-06 13:57:22 +08:00			`\| Dtype \| GPU Memory \| Prefilling \| Decode Speed \| Remarks \|`
			`\|-------\|------------\|------------\|---------------\|------------------------\|`
			`\| BF16 \| 19 GB \| 0.2s \| 27.8 tokens/s \| Input length is 1000 \|`
			`\| BF16 \| 21 GB \| 0.8s \| 31.8 tokens/s \| Input length is 8000 \|`
			`\| BF16 \| 28 GB \| 4.3s \| 14.4 tokens/s \| Input length is 32000 \|`
			`\| BF16 \| 58 GB \| 38.1s \| 3.4 tokens/s \| Input length is 128000 \|`
update link 2024-06-05 16:24:36 +08:00
fix with vllm requirements.txt 2024-06-06 13:57:22 +08:00			`\| Dtype \| GPU Memory \| Prefilling \| Decode Speed \| Remarks \|`
			`\|-------\|------------\|------------\|---------------\|-----------------------\|`
			`\| INT4 \| 8 GB \| 0.2s \| 23.3 tokens/s \| Input length is 1000 \|`
			`\| INT4 \| 10 GB \| 0.8s \| 23.4 tokens/s \| Input length is 8000 \|`
			`\| INT4 \| 17 GB \| 4.3s \| 14.6 tokens/s \| Input length is 32000 \|`
update link 2024-06-05 16:24:36 +08:00
			`### GLM-4-9B-Chat-1M`

fix with vllm requirements.txt 2024-06-06 13:57:22 +08:00			`\| Dtype \| GPU Memory \| Prefilling \| Decode Speed \| Remarks \|`
			`\|-------\|------------\|------------\|------------------\|------------------------\|`
			`\| BF16 \| 74497MiB \| 98.4s \| 2.3653 tokens/s \| Input length is 200000 \|`
add glm-4v-9b stress test 2024-06-05 12:55:41 +08:00
fix with vllm requirements.txt 2024-06-06 13:57:22 +08:00			`If your input exceeds 200K, we recommend that you use the vLLM backend with multi gpus for inference to get better`
			`performance.`
add glm-4v-9b stress test 2024-06-05 12:55:41 +08:00
			`#### GLM-4V-9B`

fix with vllm requirements.txt 2024-06-06 13:57:22 +08:00			`\| Dtype \| GPU Memory \| Prefilling \| Decode Speed \| Remarks \|`
			`\|-------\|------------\|------------\|---------------\|----------------------\|`
			`\| BF16 \| 28 GB \| 0.1s \| 33.4 tokens/s \| Input length is 1000 \|`
			`\| BF16 \| 33 GB \| 0.7s \| 39.2 tokens/s \| Input length is 8000 \|`
add vision demo 2024-06-05 13:21:23 +08:00
fix with vllm requirements.txt 2024-06-06 13:57:22 +08:00			`\| Dtype \| GPU Memory \| Prefilling \| Decode Speed \| Remarks \|`
			`\|-------\|------------\|------------\|---------------\|----------------------\|`
			`\| INT4 \| 10 GB \| 0.1s \| 28.7 tokens/s \| Input length is 1000 \|`
			`\| INT4 \| 15 GB \| 0.8s \| 24.2 tokens/s \| Input length is 8000 \|`
init commit 2024-06-05 10:22:16 +08:00
			`### Minimum hardware requirements`

			`If you want to run the most basic code provided by the official (transformers backend) you need:`

			`+ Python >= 3.10`
			`+ Memory of at least 32 GB`

			`If you want to run all the codes in this folder provided by the official, you also need:`

			`+ Linux operating system (Debian series is best)`
fix with vllm requirements.txt 2024-06-06 13:57:22 +08:00			+ GPU device with more than 8GB GPU memory, supporting CUDA or ROCM and supporting `BF16` reasoning (`FP16` precision
			`cannot be finetuned, and there is a small probability of problems in infering)`
init commit 2024-06-05 10:22:16 +08:00
			`Install dependencies`

			```shell
			`pip install -r requirements.txt`
			```

			`## Basic function calls`

			`**Unless otherwise specified, all demos in this folder do not support advanced usage such as Function Call and All Tools`
			`**`

			`### Use transformers backend code`

fix readme error 2024-06-06 10:00:11 +08:00			`+ Use the command line to communicate with the GLM-4-9B model.`
init commit 2024-06-05 10:22:16 +08:00
			```shell
fix readme error 2024-06-06 10:00:11 +08:00			`python trans_cli_demo.py # GLM-4-9B-Chat`
			`python trans_cli_vision_demo.py # GLM-4V-9B`
init commit 2024-06-05 10:22:16 +08:00			```

fix #232 2024-06-24 23:45:04 +08:00			`+ Use the Gradio web client to communicate with the GLM-4-9B model.`
init commit 2024-06-05 10:22:16 +08:00
			```shell
fix #232 2024-06-24 23:45:04 +08:00			`python trans_web_demo.py # GLM-4-9B-Chat`
			`python trans_web_vision_demo.py # GLM-4V-9B`
init commit 2024-06-05 10:22:16 +08:00			```

			`+ Use Batch inference.`

			```shell
update readme 2024-06-20 21:00:46 +08:00			`python trans_batch_demo.py`
init commit 2024-06-05 10:22:16 +08:00			```

fix readme error 2024-06-06 10:00:11 +08:00			`### Use vLLM backend code`
init commit 2024-06-05 10:22:16 +08:00
fix readme error 2024-06-06 10:00:11 +08:00			`+ Use the command line to communicate with the GLM-4-9B-Chat model.`
init commit 2024-06-05 10:22:16 +08:00
			```shell
			`python vllm_cli_demo.py`
			```

Update README_en.md 2024-09-04 21:20:45 +08:00			`+ use LoRA adapters with vLLM on GLM-4-9B-Chat model.`

			```python
			`# vllm_cli_demo.py`
			`# add LORA_PATH = ''`
			```

init commit 2024-06-05 10:22:16 +08:00			+ Build the server by yourself and use the request format of `OpenAI API` to communicate with the glm-4-9b model. This
			`demo supports Function Call and All Tools functions.`
add glm4v openai server 2024-09-06 13:59:41 +08:00			+ Modify the `MODEL_PATH` in `open_api_server.py`, and you can choose to build the GLM-4-9B-Chat or GLM-4v-9B server side.
init commit 2024-06-05 10:22:16 +08:00
			`Start the server:`

			```shell
			`python openai_api_server.py`
			```

			`Client request:`

			```shell
			`python openai_api_request.py`
			```

			`## Stress test`

			`Users can use this code to test the generation speed of the model on the transformers backend on their own devices:`

			```shell
			`python trans_stress_test.py`
Update README_en.md 2024-09-04 21:20:45 +08:00			```
update readme 2024-09-24 19:15:20 +08:00
			`##Use Ascend card to run code`

			`Users can run the above code in the Ascend hardware environment. They only need to change the transformers to openmind and the cuda device in device to npu.`

			```shell
			`#from transformers import AutoModelForCausalLM, AutoTokenizer`
			`from openmind import AutoModelForCausalLM, AutoTokenizer`

			`#device = 'cuda'`
			`device = 'npu'`
			```