LLM – Neohope的网络笔记

大模型公司的技术能力分级

Posted on 2024/06/30 by neohope — No Comments ↓

1、研发大模型算法，要有强大的科研团队
比如Transformer、StableDiffusion

2、在上述理论上，改进并模型结构，并提供预训练模型，要有海量算力和海量优质数据
比如ChatGPT、Llamma、千问、Kimi，也包括一些采用“知识蒸馏”技术的公司

3、自有大模型，简单问题自有模型解决，复杂问题集成外部模型功能
比如苹果

4、在预训练大模型上进行调优，并辅助RAG技术，要有算力和大量行业优质数据
比如保险行业大模型、健康行业大模型，华为盘古大模型做的就是这个生意

5、直接使用多个外部大模型，进行能力整合
比如Perplexity

6、直接使用预训练大模型，并进行RAG调优，需要有行业数据积累
各类行业垂直“大模型”

7、直接使用外部大模型，优化提示词，声称自己有大模型能力
比如各类套壳公司

8、直接使用国外大模型，进行转发
比如各类转发网站

9、根本没用大模型技术，直接包装原有功能，四处忽悠
比如各类噱头公司

PS：
其实还有几类公司，类似于美国淘金时代，卖铲子、卖水、卖牛仔裤的公司：
1、提供硬件的公司，尤其是GPU制造厂商
2、提供GPU算力的公司
3、主要从事大模型培训，不管上面几类公司是否赚钱了，这些培训公司可真赚钱了

Apple Intelligence三层模型结构

Posted on 2024/06/17 by neohope — No Comments ↓

苹果在AI上很久没有实质性进展了：
Siri多年没有进步，停止了造车项目，解散了部分AI团队。
虽然陆续低调的进行了一些AI公司收购，但没有什么可称道的成果，实在算不上有什么进展。

今年WWDC上，终于发布了AI相关的内容，一如既往的“重新定义”了AI的概念：发明了一个新词Apple Intelligence，缩写还是AI。

咱们仔细看一下这个Apple Intelligence，还是动了一些脑筋的，整体架构分了三层：
1、首先是在移动设备端，运行了一个30亿参数的小模型，处理一些简单的任务（苹果自研芯片，让小模型可以在功耗可控的情况下，及时响应这些请求）
2、如果本地模型无法处理，就将请求发送到是云端，通过苹果自己的大模型，响应用户请求
3、如果任务太复杂，苹果自家模型处理不好，则将请求发送到合作伙伴提供的大模型，比如GPT-4o等，合作伙伴会不断增加
当然，对于用户的授权，和数据隐私保护，还是做了不少工作的

这样乍一看，好像没有什么吗，就是集成了多个模型。但咱们加上一个事实后，这个事情就不这么简单了：
苹果对自己的操作系统完全可控，就让本地模型可以获取比竞争对手高的多的权限。
苹果自家模型，可以读邮件、可以看日程、可以访问通讯记录、可以查看网页浏览记录，可以搜集全部图像。。。
也就是说，苹果的自家模型，可以高效收集客户设备上所有信息。
同样的，苹果自家模型，可以调用用户设备全部的功能，包括第三方APP的功能。
通过整合这些信息，就可以让苹果自家模型，吊打全部竞争对手。

细思极恐，在移动小模型上，在IOS设备上，几乎已经没有了任何生存空间。
如果Google也在安卓上，部署自己的小模型，那安卓设备上的机会，也就不存在了。
无论Google如何选择，国内厂商必然快速跟进，那手机小模型这个赛道很快就不存在了。
而第三方的移动小模型和应用，无论如何努力，由于无法控制操作系统底层，几乎不可能形成任何竞争优势，几乎必然出局。

可以看下，现在国内大模型赛道整体太卷了，小厂商几乎没有机会：
1、大模型的研发、训练，需要大量的资金、人员、算力、数据的投入，小厂玩不起，大厂不赚钱
2、开源大模型的性能，比闭源大模型并不差太多，而且也在疯狂迭代，没有商业模式，更没有资本愿意长期投入，小厂更玩不起
3、小厂在垂直赛道可能会有些机会，但如果市场足够大，被大厂嗅到，没有赚钱途径的大厂一定会下场卷死你
4、移动端小模型，上面也说了，没有操作系统权限，小厂几乎没有机会了
5、在APP创新上，国内互联网流量过于集中，应用开发出来只能依附于几个大流量平台。这些平台不会允许某几个应用过热，而且在有了热度后，大厂还会无良的抄小厂的作业，让某类APP瞬间消失

所以很可惜，虽然大家都知道大模型是个好东西。但国内环境太卷了：
没有给小厂的生态位，没有好的生态
就不会有大量的创新，后面难以出现百花齐放的场景
到头来，还是要等别人创新后，大厂去抄？
大家都懂，但停不下来。
卷来卷去，难有赢家。

好像扯远了。。。
其实，对于苹果，其实还有两个事情做的挺到位的
1、将prompt屏蔽了，让普通人可以更便捷的使用AI
2、再次发挥，强大的整合能力，提前抢占了移动AI的入口

当然，对于个人来说，用好大模型，提高自己获取知识的速度，提升自己的认知圈，扩展自己的能力边界，还是很重要的。

将被大模型+机器人严重冲击的行业

Posted on 2024/05/26 by neohope — No Comments ↓

这里说的冲击严重，指的是可能导致从业人员大规模失业，而不是单纯的提升工作效率。
现在看起来，下面的部分行业从业人员，会受到较大冲击：

文字处理
1、客服人员（聊天机器人、语音机器人）
2、翻译人员（普通文件翻译）
3、文员（部分工作机会会被替代）
4、内容审核人员
5、内容创作人员（新闻转发、内容创作）
6、部分开发人员（部分代码编写人员）
7、部分法律从业者（文档整理、案例分析、合同审查）
8、部分保险从业者（部分业务员、部分核保任务）
9、部分财务人员（部分财务审计任务）

自动驾驶
1、网约车驾驶员
2、长途运输司机
3、物流人员（自动配送）

产业自动化
1、流水线工人（机器人）
2、仓库管理（无人仓储）
3、养殖人员
4、农业人员

NEOHOPE大模型发展趋势预测2405

Posted on 2024/05/26 by neohope — No Comments ↓

NEOHOPE大模型发展趋势预测2405
1、规模化遇到瓶颈，资源陷阱效应愈发明显，GPT5未能按时发布，估计遇到不小的技术问题
2、垂直化趋势明显，完全通用的大模型投产比不高，而垂直化的大模型可以在一定领域内保障效果的情况下，有效降低模型训练及推理成本，
3、移动化趋势明显，以苹果为首的各厂商在努力缩减模型规模，努力提升设备推理性能，通过大模型赋能移动终端
4、具身化初现效果，无论是人形机器人，还是机器人训练，效果显著
5、多模态大模型投产低，远不如多个模态的模型整合
6、部分整合类应用已经可以赚钱，比如Perplexity等
7、下半年没有盈利能力的大模型厂商财务压力会很大
8、美国对外大模型技术封锁会更加严格

一线厂商【主观】：
1、国外闭源：ChatGPT、Gemini、Claude、Mistral
2、国外开源：Llama3
3、国内闭源：月之暗面Kimi、质谱清言ChatGLM
4、国内开源：阿里通义千问

PS：
补充其他几个不错的模型
1、绘画方向，Midjourney，SD
2、视频生成，Sora
3、文字转音频，ChatTTS

英伟达也有几个不错的模型平台
1、药物研发，BioNeMo
2、基因分析，Parabricks
3、医学影像，MONAI

qwen.cpp简明教程

Posted on 2024/02/25 by neohope — No Comments ↓

1、下载并编译qwen.cpp

git clone --recursive https://github.com/QwenLM/qwen.cpp
cd qwen.cpp
cmake -B build
cmake -B build -DGGML_OPENBLAS=ON
cmake -B build -DGGML_CUBLAS=ON
cmake --build build -j --config Release

2、下载模型，转化为ggml格式

#从hf下载模型，下载完成后，本地地址为 ~/.cache/huggingface/hub/模型名称
#部分代码文件会有缺失，可以到hf上对比下载
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat",trust_remote_code=True)

#模型转化为ggml格式
#同时进行量化，降低资源需求
python3 qwen_cpp/convert.py -i PATH_TO_MODEL -t q4_0 -o qwen7b-q40-ggml.bin

3、运行模型

./build/bin/main -m qwen7b-q40-ggml.bin --tiktoken PATH_TO_MODEL/qwen.tiktoken -i

chatglm.cpp简明教程

Posted on 2024/02/25 by neohope — No Comments ↓

1、下载并编译chatglm.cpp

git clone --recursive https://github.com/li-plus/chatglm.cpp.git
cd chatglm.cpp
git submodule update --init --recursive
#cmake -B build
cmake -B build -DGGML_OPENBLAS=ON
#cmake -B build -DGGML_CUBLAS=ON
cmake --build build -j --config Release

2、下载模型，转化为ggml格式

#从hf下载模型，下载完成后，本地地址为 ~/.cache/huggingface/hub/模型名称
#部分代码文件会有缺失，可以到hf上对比下载
from transformers import AutoModel
model = AutoModel.from_pretrained("THUDM/chatglm-6b",trust_remote_code=True)

#模型转化为ggml格式
#同时进行量化，降低资源需求
pip install torch tabulate tqdm transformers accelerate sentencepiece
python3 chatglm_cpp/convert.py -i PATH_TO_MODEL -t q4_0 -o chatglm-6b-q40-ggml.bin

3、运行模型

./build/bin/main -m chatglm-6b-q40-ggml.bin -i

4、常见问题

#下面的错误，是transformers版本太高导致
AttributeError: 'ChatGLMTokenizer' object has no attribute 'sp_tokenizer'. Did you mean: '_tokenize'?
#需要降低transformers版本
pip uninstall transformers
pip install transformers==4.33.2

大语言模型资料汇总

Posted on 2024/02/16 by neohope — No Comments ↓

一、之前整理了一些大模型的Demo，汇总如下
1、ChatGPT
https://github.com/neohope/NeoDemosChatGPT

2、Llama2
https://github.com/neohope/NeoDemosLlama2
可同步看一下中文版Llama2
https://github.com/ymcui/Chinese-LLaMA-Alpaca-2

3、阿里千问
https://github.com/neohope/NeoDemosQwen

4、清华ChatGLM
https://github.com/neohope/NeoDemosChatGLM

二、建议看一下llama.cpp
1、llama.cpp
https://github.com/ggerganov/llama.cpp

2、python的llama.cpp封装
https://github.com/abetlen/llama-cpp-python

3、千问的qwen.cpp实现
https://github.com/QwenLM/qwen.cpp

4、ChatGLM的chatglm.cpp实现
https://github.com/li-plus/chatglm.cpp

三、还有量化
https://github.com/AutoGPTQ/AutoGPTQ

四、当然还有langchain
https://github.com/langchain-ai/langchain

五、如果有余力，看一下Transformer实现
https://github.com/huggingface/transformers

llama.cpp简要教程

Posted on 2024/02/14 by neohope — No Comments ↓

1、下载并编译llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

2、下载llama-2-7b-chat
a、可以从fb或hf下载
b、可以使用脚本下载工具，比如llama-dl
c、可以使用Chinese-LLaMA-2-7B
d、可以使用其他三方源

3、模型转换为ggml格式

python3 convert.py ../llama/llama-2-7b-chat/ 
Loading model file ../llama/llama-2-7b-chat/consolidated.00.pth
params = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=2048, n_ff=11008, n_head=32, n_head_kv=32, n_experts=None, n_experts_used=None, f_norm_eps=1e-06, rope_scaling_type=None, f_rope_freq_base=None, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=None, path_model=PosixPath('../llama/llama-2-7b-chat'))
Found vocab files: {'tokenizer.model': PosixPath('../llama/tokenizer.model'), 'vocab.json': None, 'tokenizer.json': None}
Loading vocab file '../llama/tokenizer.model', type 'spm'
Vocab info: <SentencePieceVocab with 32000 base tokens and 0 added tokens>
Special vocab info: <SpecialVocab with 0 merges, special tokens unset, add special tokens unset>
tok_embeddings.weight                            -> token_embd.weight                        | BF16   | [32000, 4096]
norm.weight                                      -> output_norm.weight                       | BF16   | [4096]
output.weight                                    -> output.weight                            | BF16   | [32000, 4096]
layers.0.attention.wq.weight                     -> blk.0.attn_q.weight                      | BF16   | [4096, 4096]
...
layers.31.ffn_norm.weight                        -> blk.31.ffn_norm.weight                   | BF16   | [4096]
skipping tensor rope_freqs
Writing ../llama/llama-2-7b-chat/ggml-model-f16.gguf, format 1
Ignoring added_tokens.json since model matches vocab size without it.
gguf: This GGUF file is for Little Endian only
[  1/291] Writing tensor token_embd.weight                      | size  32000 x   4096  | type F16  | T+   3
...
[291/291] Writing tensor blk.31.ffn_norm.weight                 | size   4096           | type F32  | T+ 314
Wrote ../llama/llama-2-7b-chat/ggml-model-f16.gguf

4、模型量化，减少资源使用

./quantize ../llama/llama-2-7b-chat/ggml-model-f16.gguf  ../llama/llama-2-7b-chat/ggml-model-f16-q4_0.gguf q4_0 
main: build = 2060 (5ed26e1f)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '../llama/llama-2-7b-chat/ggml-model-f16.gguf' to '../llama/llama-2-7b-chat/ggml-model-f16-q4_0.gguf' as Q4_0
llama_model_loader: loaded meta data with 15 key-value pairs and 291 tensors from ../llama/llama-2-7b-chat/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = llama
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llama_model_quantize_internal: meta size = 740928 bytes
[   1/ 291]                    token_embd.weight - [ 4096, 32000,     1,     1], type =    f16, quantizing to q4_0 .. size =   250.00 MiB ->    70.31 MiB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
...   
[ 291/ 291]               blk.31.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
llama_model_quantize_internal: model size  = 12853.02 MB
llama_model_quantize_internal: quant size  =  3647.87 MB
llama_model_quantize_internal: hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
main: quantize time = 323302.84 ms
main:    total time = 323302.84 ms

5、使用模型

./main -m ../llama/llama-2-7b-chat/ggml-model-f16-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -ins