基于 OpenAI 微调分类模型-工具盒子

基于 OpenAI 的模型进行分类任务微调，大致需要以下几个步骤：

准备数据：这一步先自行对文本进行预处理，然后使用 OpenAI 工具对文本内容进行二次处理
微调模型：将准备好的数据上传，并指定预训练模型进行微调
使用模型：使用微调后的模型做情感分类

Doc：https://platform.openai.com/docs/api-reference/fine-tunes

准备数据 {#title-0} ==================

import pandas as pd
def prepare_data():
data = pd.read_csv('data/comments.csv')
class_0_data = data[data['label'] == 0]
class_1_data = data[data['label'] == 1]
new_data = pd.concat([class_0_data[:10], class_1_data[:10]])
new_data.index = pd.Series(list(range(20)))
new_data['label'] = pd.Series(['消极'] * 10 + ['积极'] * 10)
new_data.columns = ['completion', 'prompt']
new_data.to_json('data/fine_tuning.json', orient='records', lines=True)

if name == 'main':
prepare_data()

原始数据内容为：

label review 0 1 距离川沙公路较近,但是公交指示不对,如果是"蔡陆线"的话,会非常麻烦.建议用别的路线.房间较... 1 1 商务大床房，房间很大，床有2M宽，整体感觉经济实惠不错! 2 1 早餐太差，无论去多少人，那边也不加食品的。酒店应该重视一下这个问题了。房间本身很好。 3 1 宾馆在小街道上，不大好找，但还好北京热心同胞很多~宾馆设施跟介绍的差不多，房间很小，确实挺小... 4 1 CBD中心,周围没什么店铺,说5星有点勉强.不知道为什么卫生间没有电吹风 ... ... ... 7761 0 尼斯酒店的几大特点：噪音大、环境差、配置低、服务效率低。如：1、隔壁歌厅的声音闹至午夜3点许... 7762 0 盐城来了很多次，第一次住盐阜宾馆，我的确很失望整个墙壁黑咕隆咚的，好像被烟熏过一样家具非常的... 7763 0 看照片觉得还挺不错的，又是4星级的，但入住以后除了后悔没有别的，房间挺大但空空的，早餐是有但... 7764 0 我们去盐城的时候那里的最低气温只有4度，晚上冷得要死，居然还不开空调，投诉到酒店客房部，得到... 7765 0 说实在的我很失望，之前看了其他人的点评后觉得还可以才去的，结果让我们大跌眼镜。我想这家酒店以...

[7766 rows x 2 columns]

经过 prepare_data 函数处理之后的内容为：

   completion                                             prompt
0          消极              标准间太差房间还不如3星的而且设施非常陈旧.建议酒店把老的标准间从新改善.
1          消极  服务态度极其差，前台接待好象没有受过培训，连基本的礼貌都不懂，竟然同时接待几个客人；大堂副理...
2          消极  地理位置还不错，到哪里都比较方便，但是服务不象是豪生集团管理的，比较差。下午睡了一觉并洗了一...
3          消极  1。我住的是靠马路的标准间。房间内设施简陋，并且的房间玻璃窗户外还有一层幕墙玻璃，而且不能打...
4          消极  我这次是第5次住在长春的雁鸣湖大酒店。昨晚夜里停电。深夜我睡着了。我的钱包被内贼进入我的房间...
5          消极  前台checkin花了20分钟，checkout25分钟，这是服务态度和没有做到位。信用卡刷...
6          消极  有或者很少房!梯部不吸,但是有一些吸者仍然有服!我是不抽的人,成二手的受害者!(中13人口中...
7          消极                          酒店服务态度极差，设施很差，建议还是不要到那儿去。
8          消极  我3.6预定好的180的标间，当我到的时候竟然说有会议房间满了，我订的房间没有了，太不讲信誉...
9          消极                           房间的环境非常差,而且房间还不隔音，住的不舒服。
10         积极  距离川沙公路较近,但是公交指示不对,如果是"蔡陆线"的话,会非常麻烦.建议用别的路线.房间较...
11         积极                       商务大床房，房间很大，床有2M宽，整体感觉经济实惠不错!
12         积极         早餐太差，无论去多少人，那边也不加食品的。酒店应该重视一下这个问题了。房间本身很好。
13         积极  宾馆在小街道上，不大好找，但还好北京热心同胞很多~宾馆设施跟介绍的差不多，房间很小，确实挺小...
14         积极               CBD中心,周围没什么店铺,说5星有点勉强.不知道为什么卫生间没有电吹风
15         积极           总的来说，这样的酒店配这样的价格还算可以，希望他赶快装修，给我的客人留些好的印象
16         积极  价格比比较不错的酒店。这次免费升级了，感谢前台服务员。房子还好，地毯是新的，比上次的好些。早...
17         积极                               不错，在同等档次酒店中应该是值得推荐的！
18         积极  入住丽晶，感觉很好。因为是新酒店，的确有淡淡的油漆味，房间内较新。房间大小合适，卫生间设备齐...
19         积极  1。酒店比较新，装潢和设施还不错，只是房间有些油漆味。2。早餐还可以，只是品种不是很多。3。...

接下来，使用 openai 的数据预处理工具，将上面的数据处理成适合模型微调的数据。 CD 到 prepare_data 函数存储数据的目录，执行如下命令：

openai tools fine_tunes.prepare_data -f fine_tuning.json -q

此时会输出内容如下：

Analyzing... Your JSON file appears to be in a JSONL format. Your file will be converted to JSONL format Your file contains 20 prompt-completion pairs. In general, we recommend having at least a few hundred examples. We've found that performance tends to linearly increase for every doubling of the number of examples Based on your data it seems like you're trying to fine-tune a model for classification For classification, we recommend you try one of the faster and cheaper models, such as ada For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty The completion should start with a whitespace character (). This tends to produce better results due to the tokenization we use. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details Based on the analysis we will perform the following actions: [Necessary] Your format JSON will be converted to JSONL [Recommended] Add a suffix separator -> to all prompts [Y/n]: Y [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y [Recommended] Would you like to split into training and validation set? [Y/n]: Y Your data will be written to a new JSONL file. Proceed [Y/n]: Y Wrote modified files to fine_tuning_prepared_train.jsonl and fine_tuning_prepared_valid.jsonl Feel free to take a look! Now use that file when fine-tuning: > openai api fine_tunes.create -t "fine_tuning_prepared_train.jsonl" -v "fine_tuning_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " 消极"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string -> for the model to start generating completions, rather than continuing with the prompt. Make sure to include stop=["极"] so that the generated texts ends at the expected place. Once your model starts training, it'll approximately take 2.81 minutes to train a curie model, and less for ada and babbage. Queue will approximately take half an hour per job ahead of you.

上面有以下几项大致的内容：

Your file contains 20 prompt-completion pairs. In general, we recommend having at least a few hundred examples. We've found that performance tends to linearly increase for every doubling of the number of examples。我们的微调使用的样本太少了，输出日志中建议我们至少几百个样本，并且告诉我们训练样本越多，模型的性能越好。
Based on your data it seems like you're trying to fine-tune a model for classification。For classification, we recommend you try one of the faster and cheaper models, such as ada。这里提到说，根据我们的数据，就知道我们要做分类任务，并强烈建议我们使用更快、并且更便宜的 ada 模型作为微调的预训练模型。
Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. 这里是说，我们的训练数据中在 prompts 之后并没有添加一个分割符号，通过这个符号可以告诉模型开始预测 completion。所以，该工具会在我们的文本之后添加一个特殊的标记，至于这个标记具体是什么，不同的 openai 版本可能有所不同，一会可以看下生成的数据集即可。

最终，openai 的数据处理工具会进行一些如下的操作：

[Necessary] Your format JSON will be converted to JSONL
[Recommended] Add a suffix separator -> to all prompts [Y/n]: Y 增加 -> 在 prompts 之后作为分隔符
[Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y 在 completion 之前增加空格，表示 prompts 和 completion 之间的间隔
[Recommended] Would you like to split into training and validation set? [Y/n]: Y 将数据集分割为训练集和测试集

最终，会在数据目录下生成两个文件：

fine_tuning_prepared_train.jsonl 训练集
fine_tuning_prepared_valid.jsonl 验证集

我们看下验证集的数据内容：

{"prompt":"有或者很少房!梯部不吸,但是有一些吸者仍然有服!我是不抽的人,成二手的受害者!(中13人口中,民只有3.2.不到1\/4!!!)看到的民,自好?. ->","completion":" 消极"}
{"prompt":"酒店服务态度极差，设施很差，建议还是不要到那儿去。 ->","completion":" 消极"}
{"prompt":"距离川沙公路较近,但是公交指示不对,如果是\"蔡陆线\"的话,会非常麻烦.建议用别的路线.房间较为简单. ->","completion":" 积极"}
{"prompt":"CBD中心,周围没什么店铺,说5星有点勉强.不知道为什么卫生间没有电吹风 ->","completion":" 积极"}

可以看到，在每个 prompt 后面增加一个 "->" 符号，一方面表示 prompt 结束，另一方面表示开始生成 completion。同时在每个 competion 之前增加一个空格，用于和 prompt 隔开。需要注意的是，如果我们要在该模型进行分类的话，别忘了在新的输入 prompt 之后加上 "->" 分隔符。

微调模型 {#title-1} ==================

在上面输出的内容中，有提到，如果我们要进行微调的话，使用下面的命令：

openai api fine_tunes.create 
-t "fine_tuning_prepared_train.jsonl" 
-v "fine_tuning_prepared_valid.jsonl" 
--compute_classification_metrics 
--classification_positive_class " 消极"

-t 指定训练数据集
-v 指定验证数据集
--compute_classification_metrics 表示对测试集进行评估
--classification_positive_class 表示数据集中的正样本标签，我们这里把它生成 " 消极" 修改为 " 积极"

最终执行命令如下（这种方式需要在系统环境变量中设置 OPENAI_API_KEY）：

openai api fine_tunes.create -t "fine_tuning_prepared_train.jsonl" -v "fine_tuning_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " 积极"

其他的一些超参数：

model：要微调的基本模型的名称。您可以选择 "ada"、"babbage"、"curie" 或者 "davinci"。
n_epochs- 默认为 4。
batch_size- 默认为训练集中样本数的 ~0.2%，上限为 256。通常较大的批量大小更适合较大的数据集。
learning_rate_multiplier- 默认为 0.05、0.1 或 0.2

命令执行之后，输出内容如下：

Upload progress: 100%|███████████████████████████████████████████████████████████████████████████████████| 6.39k/6.39k [00:00<00:00, 7.65Mit/s] Uploaded file from fine_tuning_prepared_train.jsonl: file-Euf0oLwGX1hx4nbHZu9o8LgC Upload progress: 100%|███████████████████████████████████████████████████████████████████████████████████████| 636/636 [00:00<00:00, 1.06Mit/s] Uploaded file from fine_tuning_prepared_valid.jsonl: file-gu3kbiOr7yCKGgN6YLsYg7fY Created fine-tune: ft-7xhbzqbkIxUiCRBvsWEVjQyE Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune) [2023-04-09 17:55:48] Created fine-tune: ft-7xhbzqbkIxUiCRBvsWEVjQyE [2023-04-09 17:55:59] Fine-tune costs $0.05 [2023-04-09 17:55:59] Fine-tune enqueued. Queue number: 0 [2023-04-09 17:56:01] Fine-tune started

微调模型是需要花钱的，上面输出内容显示出此次训练花费了 0.05 美刀。使用下面命令，查看我们微调的模型：

openai api fine_tunes.list

输出内容如下：

{
  "data": [
    {
      "created_at": 1681034148,
      "fine_tuned_model": "curie:ft-personal-2023-04-09-09-57-44",
      "hyperparams": {
        "batch_size": 1,
        "classification_positive_class": " \u79ef\u6781",
        "compute_classification_metrics": true,
        "learning_rate_multiplier": 0.1,
        "n_epochs": 4,
        "prompt_loss_weight": 0.01
      },
      "id": "ft-7xhbzqbkIxUiCRBvsWEVjQyE",
      "model": "curie",
      "object": "fine-tune",
      "organization_id": "org-CYfr2zckOKWqMlBgQHaL8rzp",
      "result_files": [
        {
          "bytes": 4214,
          "created_at": 1681034264,
          "filename": "compiled_results.csv",
          "id": "file-9djdWGJeysmtpT7uYRpzuvkm",
          "object": "file",
          "purpose": "fine-tune-results",
          "status": "processed",
          "status_details": null
        }
      ],
      "status": "succeeded",
      "training_files": [
        {
          "bytes": 6393,
          "created_at": 1681034146,
          "filename": "fine_tuning_prepared_train.jsonl",
          "id": "file-Euf0oLwGX1hx4nbHZu9o8LgC",
          "object": "file",
          "purpose": "fine-tune",
          "status": "processed",
          "status_details": null
        }
      ],
      "updated_at": 1681034265,
      "validation_files": [
        {
          "bytes": 636,
          "created_at": 1681034148,
          "filename": "fine_tuning_prepared_valid.jsonl",
          "id": "file-gu3kbiOr7yCKGgN6YLsYg7fY",
          "object": "file",
          "purpose": "fine-tune",
          "status": "processed",
          "status_details": null
        }
      ]
    }
  ],
  "object": "list"
}

从输出的内容，我们可以得到如下信息：

"fine_tuned_model": "curie:ft-personal-2023-04-09-09-57-44″：这个就是我们使用的微调后的模型ID
"hyperparams"：微调时，我们使用的超参数
"id": "ft-7xhbzqbkIxUiCRBvsWEVjQyE"：模型的编号，用于获得模型相关信息
"model": "curie"：我们微调使用的基础模型为 curie，这是默认使用的模型，我们也可以微调时通过 -m 来指定 ada
"status": "succeeded"：表示微调的状态，这里显示微调成功或者完成
...其他

使用下面的命令，将训练结果下载到本地：

# -i 用来指定模型的id
openai api fine_tunes.results -i ft-7xhbzqbkIxUiCRBvsWEVjQyE > result.csv

在我的本地生成了 result.csv 文件，我们看下该文件的内容，下面内容可以保存到 csv 文件中查看：

step,elapsed_tokens,elapsed_examples,training_loss,training_sequence_accuracy,training_token_accuracy,validation_loss,validation_sequence_accuracy,validation_token_accuracy,classification/accuracy,classification/precision,classification/recall,classification/auroc,classification/auprc,classification/f1.0
1,137,1,0.11297712909037737,0.0,0.5,0.1965679380938286,0.0,0.5,,,,,,
2,1138,2,0.026734490502419668,0.0,0.5,,,,,,,,,
3,1547,3,0.041407643495439544,0.0,0.6666666666666666,,,,,,,,,
4,1604,4,0.20218953214144544,0.0,0.3333333333333333,,,,,,,,,
5,1701,5,0.11237559589742213,0.0,0.5,,,,,,,,,
6,1846,6,0.07346868902813872,0.0,0.5,,,,,,,,,
7,1975,7,0.07653379087575103,0.0,0.8333333333333334,,,,,,,,,
8,2992,8,0.021339937947021315,0.0,0.8333333333333334,,,,,,,,,
9,3241,9,0.03946435204677875,0.0,0.8333333333333334,0.11924629365356243,0.0,0.6666666666666666,,,,,,
10,3330,10,0.07196011489376213,0.0,0.6666666666666666,,,,,,,,,
11,3395,11,0.07033659532143474,0.0,0.8333333333333334,,,,,,,,,
12,3612,12,0.03595350411543043,0.0,0.8333333333333334,,,,,,,,,
13,3741,13,0.02119638757410091,1.0,1.0,,,,,,,,,
14,3830,14,0.05899414582442825,0.0,0.8333333333333334,,,,,,,,,
15,3887,15,0.05293363639834362,0.0,0.8333333333333334,,,,,,,,,
16,4272,16,0.01709485127953574,1.0,1.0,,,,,,,,,
17,5289,17,0.015280593367766031,1.0,1.0,0.03453893841861048,1.0,1.0,0.5,0.0,0.0,0.25,0.3333333333333333,0.0
18,6290,18,0.01659894965089464,0.0,0.8333333333333334,,,,,,,,,
19,6387,19,0.08227947945942217,0.0,0.8333333333333334,,,,,,,,,
20,6452,20,0.0673786170370725,0.0,0.8333333333333334,,,,,,,,,
21,6589,21,0.01852450909944652,1.0,1.0,,,,,,,,,
22,6838,22,0.01672595546186104,1.0,1.0,,,,,,,,,
23,6967,23,0.01842574403242967,1.0,1.0,,,,,,,,,
24,7352,24,0.018070253951559655,1.0,1.0,,,,,,,,,
25,7497,25,0.018238978561170335,0.0,0.8333333333333334,0.026812056690901857,0.0,0.8333333333333334,,,,,,
26,7554,26,0.018804501375033807,1.0,1.0,,,,,,,,,
27,7611,27,0.018643728179655778,1.0,1.0,,,,,,,,,
28,8020,28,0.021471951293555423,0.0,0.8333333333333334,,,,,,,,,
29,8149,29,0.01589545032287536,1.0,1.0,,,,,,,,,
30,8238,30,0.02560983169629191,1.0,1.0,,,,,,,,,
31,8327,31,0.016086113855492766,1.0,1.0,,,,,,,,,
32,8544,32,0.025426828228896272,0.0,0.8333333333333334,,,,,,,,,
33,8641,33,0.016373053608218033,1.0,1.0,0.02478175366484695,1.0,1.0,,,,,,
34,8698,34,0.012001360776212056,1.0,1.0,,,,,,,,,
35,8763,35,0.018112453049790214,1.0,1.0,,,,,,,,,
36,8980,36,0.02059483739298285,0.0,0.8333333333333334,,,,,,,,,
37,9109,37,0.020320957661593295,1.0,1.0,,,,,,,,,
38,10110,38,0.01485486638983113,1.0,1.0,,,,0.75,0.6666666666666666,1.0,0.75,0.7916666666666666,0.8
39,10167,39,0.016590197144024585,1.0,1.0,,,,,,,,,
40,10256,40,0.019145875603147972,1.0,1.0,,,,,,,,,
41,10393,41,0.01974262404774302,1.0,1.0,0.03868247642918245,1.0,1.0,,,,,,
42,10482,42,0.017745215463898456,1.0,1.0,,,,,,,,,
43,10867,43,0.01643800483749426,1.0,1.0,,,,,,,,,
44,11276,44,0.016559487510540653,1.0,1.0,,,,,,,,,
45,12293,45,0.014681880749237009,1.0,1.0,,,,,,,,,
46,12542,46,0.015384798510621219,1.0,1.0,,,,,,,,,
47,12687,47,0.013482770004832167,1.0,1.0,,,,,,,,,
48,12816,48,0.016251353808475387,1.0,1.0,,,,,,,,,
49,12945,49,0.015636948876930602,1.0,1.0,0.011417988476832727,1.0,1.0,,,,,,
50,13194,50,0.015154518923057117,1.0,1.0,,,,,,,,,
51,13603,51,0.01619545711472625,1.0,1.0,,,,,,,,,
52,13668,52,0.01764821404665394,1.0,1.0,,,,,,,,,
53,13757,53,0.017546420559014668,1.0,1.0,,,,,,,,,
54,14142,54,0.01629021499741482,1.0,1.0,,,,,,,,,
55,15159,55,0.014517653008841998,1.0,1.0,,,,0.5,0.0,0.0,0.75,0.7916666666666666,0.0
56,15376,56,0.015640235622392317,1.0,1.0,,,,,,,,,
57,15473,57,0.017093179406629962,1.0,1.0,0.037967130073333646,0.0,0.8333333333333334,,,,,,
58,15602,58,0.015698925190150044,1.0,1.0,,,,,,,,,
59,16603,59,0.014752162144404994,1.0,1.0,,,,,,,,,
60,16660,60,0.012103026345412493,1.0,1.0,,,,,,,,,
61,16805,61,0.013015271683916614,1.0,1.0,,,,,,,,,
62,16894,62,0.01824016018865998,1.0,1.0,,,,,,,,,
63,17031,63,0.01693082119428943,1.0,1.0,,,,,,,,,
64,17088,64,0.014343387357669275,1.0,1.0,,,,,,,,,
65,17145,65,0.014343387357669275,1.0,1.0,0.037712673740637276,0.0,0.8333333333333334,,,,,,
66,17362,66,0.015630679206655895,1.0,1.0,,,,0.5,0.0,0.0,0.75,0.7916666666666666,0.0

使用模型 {#title-2} ==================

注意：在输入新的内容时，别忘了在 prompt 后面加上 "->" 就行了，加载微调的模型，可以直接使用 openai 的 API，也可以使用 requests 模块发送请求。示例代码如下：

import requests
import json
import openai
def use_fine_tune_model_01():
model = &quot;curie:ft-personal-2023-04-09-09-57-44&quot;
sentence = '酒店服务态度极差，设施很差，建议还是不要到那儿去。 -&gt;'
request_url = 'https://api.openai.com/v1/completions'
headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + open('openai_api_key').read()}
data = json.dumps({&quot;model&quot;: model, 'prompt': sentence, 'max_tokens': 6, 'temperature': 0})
response = requests.post(request_url, headers=headers, data=data)
response = json.loads(response.text)
print(response['choices'][0]['text'])

def use_fine_tune_model_02():
openai.api_key = open('openai_api_key').read()
model = &quot;curie:ft-personal-2023-04-09-09-57-44&quot;
sentence = '酒店服务态度极差，设施很差，建议还是不要到那儿去。 -&gt;'
outputs = openai.Completion.create(model=model, prompt=sentence, max_tokens=6, temperature=0)
print(outputs['choices'][0]['text'])

if name == 'main':
use_fine_tune_model_01()
use_fine_tune_model_02()

程序输出结果：

 消极
 消极

51工具盒子

基于 OpenAI 微调分类模型

厉飞雨

相关推荐

最新文章

猜你喜欢

快捷分类