OpenAI GPT For Python Developers


image-20230530182644993

[TOC]

前言

What is ChatGPT, GPT, GPT-3, DALL-E, Codex?

Story of OpenAI

An artificial intelligence research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc.

  • 成立于Y2015, 成立之初的愿景:promoting and developing a friendly AI in a way that benefits humanity as a whole。

  • Sam Altman, Elon Musk, Greg Brockman, Reid Hoffman, Jessica Livingston, Peter Thiel, Amazon

    Web Services (AWS), Infosys, and YC Research announced the formation of OpenAI and pledged

    over US$1 billion to the venture.

  • Y2016, relase “Universe”, a software platform for measuring and training an AI’s general intelligence accross the world’s supply of games, websites and other applications

  • Y2018, Elon Musk resigned from his board seat, but remained a donor.

  • Y2019, transitioned from non-profit to capped-profit (profit cap set to 100 times on any investment), The company distributed equity to its employees and partnered with Microsoft, which announced an investment package of US$1 billion into the company.

  • Y2020, GPT-3, a language model trained on trillions of words from the Internet.

  • Y2021, DALL-E, a deep-learning model that can generate digital images from natural language descriptions.

  • Y2022, ChatGPT - 自此开始全球风靡

ChatGPT

ChatGPT simply stands for Generative Pre-trained Transformer, is built on top of OpenAI’s GPT-3 family of large language models.

Other projects using GPT-3 are:

  • GitHub Copilot (using the OpenAI Codex model, a descendant of GPT-3, fine-tuned for generating code)

  • Copy.ai and Jasper.ai (content generation for marketing purposes)

  • Drexel University (detection of early signs of Alzheimer’s disease)

  • Algolia (enhancing their search engine capabilities)

What we can do with GPT?

  1. By the end of your learning journey, you will have built applications such as:

    • A fine-tuned medical chatbot assistant

    • An intelligent coffee recommendation system

    • An intelligent conversational system with memory and context

    • An AI voice assistant like Alexa but smarter

    • A Chatbot assistant to help with Linux commands

    • A semantic search engine

    • A news category prediction system

    • An image recognition intelligent system (image to text)

    • An image generator (text to image)

    • and more!

  2. By reading this guide and following the examples, you will be able to:

    • Understand the different models available, and how and when to use each one.

    • Generate human-like text for various purposes, such as answering questions, creating content, and other creative uses.

    • Control the creativity of GPT models and adopt the best practices to generate high-quality text.

    • Transform and edit the text to perform translation, formatting, and other useful tasks.

    • Optimize the performance of GPT models using the various parameters and options such as suffix, max_tokens, temperature, top_p, n, stream, logprobs, echo, stop, presence_penalty, frequency_penalty, best_of, and others.

    • Stem, lemmatize, and reduce your bills when using the API

    • Understand Context Stuffing, chaining, and practice using advanced techniques

    • Understand text embedding and how companies such as Tesla and Notion are using it

    • Understand and implement semantic search and other advanced tools and concepts.

    • Creating prediction algorithms and zero-shot techniques and evaluating their accuracy

    • Understand, practice, and improve few-shot learning.

    • Understand fine-tuning and leveraging its power to create your own models.

    • Understand and use the best practices to create your own models.

    • Practice training and classification techniques using GPT.

    • Create advanced fine-tuned models.

    • Use OpenAI Whisper and other tools to create intelligent voice assistants.

    • Implement image classification using OpenAI CLIP.

    • Generate and edit images using OpenAI DALL-E.

    • Draw inspiration from other images to create yours.

    • Reverse engineer images’ prompts from Stable Diffusion (image to text)

How Does GPT Work?

Deep Learning, Machine Learning and AI

关系图:AI、深度学习与机器学习

  • Deep learning is a subset of machine learning that’s based on artificial neural networks. The learning process is deep because the structure of artificial neural networks consists of multiple input, output, and hidden layers. Each layer contains units that transform the input data into information that the next layer can use for a certain predictive task. Thanks to this structure, a machine can learn through its own data processing.
  • Generative AI is a subset of artificial intelligence that uses techniques (such as deep learning) to generate new content. For example, you can use generative AI to create images, text, or audio. These models leverage massive pre-trained knowledge to generate this content.

Artificial neural networks

Artificial neural networks are formed by layers of connected nodes. Deep learning models use neural networks that have a large number of layers.

The following are most popular artificial neural network typologies and models

  • Feedforward neural network

  • Recurrent neural network (RNN)

  • Convolutional neural network (CNN)

  • Generative adversarial network (GAN)

  • Transformers

Transformers

Transformers are a model architecture of Artificial neural networks that is suited for solving problems containing sequences such as text or time-series data. They consist of encoder and decoder layers. The encoder takes an input and maps it to a numerical representation containing information such as context. The decoder uses information from the encoder to produce an output such as translated text. What makes transformers different from other architectures containing encoders and decoders are the attention sub-layers. Attention is the idea of focusing on specific parts of an input based on the importance of their context in relation to other inputs in a sequence. For example, when summarizing a news article, not all sentences are relevant to describe the main idea. By focusing on key words throughout the article, summarization can be done in a single sentence, the headline.

Transformers have been used to solve natural language processing problems such as translation, text generation, question answering, and text summarization.

Some well-known implementations of transformers are:

  • Bidirectional Encoder Representations from Transformers (BERT)
  • Generative Pre-trained Transformer 2 (GPT-2)
  • Generative Pre-trained Transformer 3 (GPT-3)

GPT (Generative Pre-trained Transformer)

GPT is a type of neural network called a transformer, which is specifically designed for natural language processing tasks. The architecture of a transformer is based on a series of self-attention mechanisms that allow the model to process input text in parallel and weigh the importance of each word or token based on its context.

Self-attention

Self-attention is a mechanism used in deep learning models for natural language processing (NLP) that allows a model to weigh the importance of different parts of a sentence or a number of sentences when making predictions. Part of the Transformer architecture, it enables a neural network to achieve a satisfactory degree of performance when it comes to NLP tasks.

How GPT generate text differently from other

An example of using Hugging Face transformers for GPT-2 interface

1
2
3
4
5
6
7
8
1 from transformers import pipeline
2 generator = pipeline('text-generation', model = 'gpt2')
3 generator("Hello, I'm a language model", max_length = 30, num_return_sequences=3)
4 ## [{'generated_text': "Hello, I'm a language modeler. So while writing this, when I\
5 went out to meet my wife or come home she told me that my"},
6 ## {'generated_text': "Hello, I'm a language modeler. I write and maintain software\
7 in Python. I love to code, and that includes coding things that require writing"}, \
8 ...

By default, a model has no memory, this means that each input is processed independently, without any information being carried over from previous inputs. When GPT generates text, it doesn’t have any preconceived notions about what should come next based on previous inputs. Instead, it generates each word based on the probability of it being the next likely word given the previous input. This results in text that can be surprising and creative.

默认情况下,模型没有内存,这意味着每个输入都是独立处理的,不需要从以前的输入传递任何信息。当GPT生成文本时,它不会根据之前的输入对下一步应该做什么有任何先入为主的概念。相反,它根据给定前一个输入的下一个可能单词的概率生成每个单词。这就产生了令人惊讶且富有创造性的文本。

This is another example of code that uses a GPT model to generate text based on user input.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Import the necessary libraries
2 from transformers import GPT2Tokenizer, GPT2LMHeadModel
3
4 # Load the pre-trained GPT-2 tokenizer and model
5 tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
6 model = GPT2LMHeadModel.from_pretrained('gpt2')
7
8 # Set the model to evaluation mode
9 model.eval()
10
11 # Define a prompt for the model to complete
12 prompt = input("You: ")
13
14 # Tokenize the prompt and generate text
15 input_ids = tokenizer.encode(prompt, return_tensors='pt')
16 output = model.generate(input_ids, max_length=50, do_sample=True)
17
18 # Decode the generated text and print it to the console
19 generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
20 print("AI: " + generated_text)
1
pip install openai
1
pip install --upgrade openai
1
2
3
4
5
6
7
8
9
import os
import openai

openai.api_key = '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'

models = openai.Model.list()
datas = models['data']
for data in datas:
print(data['id'])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import os
import openai

openai.api_key = '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'


response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "Assistant is a large language model trained by OpenAI."},
{"role": "user", "content": "Who were the founders of Microsoft?"}
]
)

print(response['choices'][0]['message']['content'])
1
pip install transformers
1
2
3
4
5
6
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained_pretrained('gpt2')

model.eval()
1
2
openai.api_key = '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'
print(openai.api_key)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

models = openai.Model.list()
datas = models['data']
for data in datas:
print(data['id'])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

next = openai.Completion.create(
model="text-davinci-003",
prompt="Once upon a time",
max_tokens=15,
temperature=0.5
)

print(next)

Using GPT Text Completions

Logprobs

To increase the possibilities, we can use the “logprobs” prameters. For example, setting logprobs to 2 will return two versions of each token

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

next = openai.Completion.create(
model="text-davinci-003",
prompt="Once upon a time",
max_tokens=15,
temperature=0,
logprobs=3,
)

print(next)

Streaming the Results

Another common parameter we can use in OpenAI is the stream. It is possible to instruct the API to return a stream of tokens instead of a block containing all tokens. In this case, the API will return a generator that yields tokens in the order they were generated.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

next = openai.Completion.create(
model="text-davinci-003",
prompt="Once upon a time",
max_tokens=7,
stream=True,
)

print(type(next))


# # * will unpack the generator
# print(*next, sep='\n')

# Read the generator text elements one by one
for i in next:
print(i["choices"][0]["text"])

Controlling Repetitivity: Frequency and Presence Penalties

the completions API has two features that can be used to stop the same words from being suggested too often. These features change the chances of certain words being suggested by adding a bonus or penalty to the logits(the numbers that show how likely a word is to be suggested)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

next = openai.Completion.create(
model="text-davinci-003",
prompt="Once upon a time",
max_tokens=100,
frequency_penalty=2.0,
presence_penalty=2.0,
)

print("=== Frequency and presence penalty 2.0 ===")
print(next["choices"][0]["text"])

next = openai.Completion.create(
model="text-davinci-003",
prompt="Once upon a time",
max_tokens=100,
frequency_penalty=-2.0,
presence_penalty=-2.0,
)

print("=== Frequency and presence penalty -2.0 ===")
print(next["choices"][0]["text"])
1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

next = openai.Completion.create(
model="text-davinci-003",
prompt="Once upon a time",
max_tokens=5,
n=2
)

print(next)

Getting the “best of”

It is possible to ask the AI models to generate possible completions for a given task on the server side and select the one with the highest probability of being correct. This can be done using the best_of parameter.
When using best_of, you need to specify two numbers: n and best_of
As seen previously, n is the number of candidate completions you want to see.
Note: Make sure that best_of is greater than n.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

next = openai.Completion.create(
model="text-davinci-003",
prompt="Once upon a time",
max_tokens=5,
n=1,
best_of=2,
)

print(next)

Controlling when the completion stops

In most cases, it is useful to stop the API from generating more text.

Let’s say, we want to generate a single paragraph and no more. In this case, we can ask the API to stop completing the text when there’s a new line(\n). This can be done using a similar code as below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

next = openai.Completion.create(
model="text-davinci-003",
prompt="Once upon a time",
max_tokens=5,
stop=["\n",],
)

print(next)

The stop parameter can contain up to four stop words. Note that the completion will not include the stop sequence in the result.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

next = openai.Completion.create(
model="text-davinci-003",
prompt="Once upon a time",
max_tokens=5,
stop=["\n", "Story", "End", "Once upon a time"],
)

print(next)

Using Suffic after text completion

The parameter suffix comes after the completion of inserted text.

Imagine, we want to crate a Python dict containing the list of primary numbers between 0 and 9:

1
2
3
{
"primes": [2, 3, 5, 7]
}

The API, in this case, is upposed to return 2, 3, 5, 7. We can use the suffic parmeter in this case.

Example-1:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

next = openai.Completion.create(
model="text-davinci-003",
prompt="Write a JSON containing primary numbers between 0 and 9 \n\n{\n\t\"primes\": [",
)

# print(next["choices"][0]["text"])
print(next)

Example-2:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

next = openai.Completion.create(
model="text-davinci-003",
prompt="Write a JSON containing primary numbers between 0 and 9 \n\n{\n\t\"primes\": [",
suffix="]\n}"
)

# print(next["choices"][0]["text"])
print(next)

Extracting keywords

Through appending the keywords at the end of the prompt message to get the completion text, the model will recognize that we need keywords and output should look something like this:
Plankalku\u0308l, Fortran, ALGOL 58, Lisp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

prompt = 'The first programming language to be invented was Plankalkül, which was designed by \
Konrad Zuse in the 1940s, but not publicly known until 1972 (and not implemented unt\
il 1998). The first widely known and successful high-level programming language was \
Fortran, developed from 1954 to 1957 by a team of IBM researchers led by John Backus\
. The success of FORTRAN led to the formation of a committee of scientists to develo\
p a "universal" computer language; the result of their effort was ALGOL 58. Separate\
ly, John McCarthy of MIT developed Lisp, the first language with origins in academia\
to be successful. With the success of these initial efforts, programming languages \
became an active topic of research in the 1960s and beyond.\n\nKeywords:'

tweet = openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
temperature=0.5,
max_tokens=300,
)

print(tweet["choices"][0]["text"])

You can play with the prompt and try different things such as: Keywords:\n-

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

prompt = 'The first programming language to be invented was Plankalkül, which was designed by \
Konrad Zuse in the 1940s, but not publicly known until 1972 (and not implemented unt\
il 1998). The first widely known and successful high-level programming language was \
Fortran, developed from 1954 to 1957 by a team of IBM researchers led by John Backus\
. The success of FORTRAN led to the formation of a committee of scientists to develo\
p a "universal" computer language; the result of their effort was ALGOL 58. Separate\
ly, John McCarthy of MIT developed Lisp, the first language with origins in academia\
to be successful. With the success of these initial efforts, programming languages \
became an active topic of research in the 1960s and beyond.\n\nKeywords:\n-'

tweet = openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
temperature=0.5,
max_tokens=300,
)

print(tweet.choices[0].text)

Generating Tweets

We can apped “tweet:” instead of “Keywords:” to prompt the model to return the tweet-style message.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

prompt = 'The first programming language to be invented was Plankalkül, which was designed by \
Konrad Zuse in the 1940s, but not publicly known until 1972 (and not implemented unt\
il 1998). The first widely known and successful high-level programming language was \
Fortran, developed from 1954 to 1957 by a team of IBM researchers led by John Backus\
. The success of FORTRAN led to the formation of a committee of scientists to develo\
p a "universal" computer language; the result of their effort was ALGOL 58. Separate\
ly, John McCarthy of MIT developed Lisp, the first language with origins in academia\
to be successful. With the success of these initial efforts, programming languages \
became an active topic of research in the 1960s and beyond.\n\nTweet:'

tweet = openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
temperature=0.5,
max_tokens=300,
)

print(tweet.choices[0].text)

We can also add hashtags with Tweet with hashtags:, here’s the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

prompt = 'The first programming language to be invented was Plankalkül, which was designed by \
Konrad Zuse in the 1940s, but not publicly known until 1972 (and not implemented unt\
il 1998). The first widely known and successful high-level programming language was \
Fortran, developed from 1954 to 1957 by a team of IBM researchers led by John Backus\
. The success of FORTRAN led to the formation of a committee of scientists to develo\
p a "universal" computer language; the result of their effort was ALGOL 58. Separate\
ly, John McCarthy of MIT developed Lisp, the first language with origins in academia\
to be successful. With the success of these initial efforts, programming languages \
became an active topic of research in the 1960s and beyond.\n\nTweet with hashtags:'

tweet = openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
temperature=0.5,
max_tokens=300,
)

print(tweet.choices[0].text)

Generating a Rap Song

  • Example-1 with text-davinci-002
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

prompt = 'Write a rap song:\n\n'

tweet = openai.Completion.create(
model="text-davinci-002",
prompt=prompt,
temperature=0.5,
max_tokens=200,
)

print(tweet.choices[0].text.strip())
  • Example-1 with text-davinci-003
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

prompt = 'Write a rap song:\n\n'

tweet = openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
temperature=0.5,
max_tokens=200,
)

print(tweet.choices[0].text.strip())

Generating a Todo List

In this example, we are asking the model to generate a to-do list for creating a company in the US. We need five items on the list

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

next = openai.Completion.create(
model="text-davinci-002",
prompt="Todo list to create a company in US\n\n1.",
temperature=0.3,
max_tokens=64,
top_p=0.1,
frequency_penalty=0,
presence_penalty=0.5,
stop=["6."],
)

print(next.choices[0].text)
print(next)

model:

specifies the model that the API should use for generating the text completion. In this case,
it is using “text-davinci-002”.

prompt:

is the text that the API uses as a starting point for generating the completion. In our case,
we used a prompt that is a to-do list for creating a company in the US. The first item should start
with “1.”, knowing that the output we asked for should be in this format;
1 1. <1st item>
2 2. <2nd item>
3 3. <3nd item>
4 4. <4th item>
5 5. <5th item>

temperature

controls the “creativity” of the text generated by the model. The higher temperature the
more creative and diverse completions will be. On the other hand, a lower temperature will result
in a more “conservative” and predictable completions. In this case, the temperature is set to 0.3.

max_tokens

limits the maximum number of tokens that the API will generate. In our case, the
maximum number of tokens is 64. You can increase this value but keep in mind that the more
tokens you will generate, the more credits you will be charged. When learning and testing, keeping
a lower value will help you avoid overspending.

top_p

controls the proportion of the mass of the distribution that the API considers when generating
the next token. A higher value will result in more conservative completions, while a lower value will
result in more diverse completions. In this case, the top_p is set to 0.1. It is not recommended to use
both this and temperature but it’s not also a blocking issue.

frequency_penalty

is used to adjust the model’s preference for generating frequent or rare words.
A positive value will decrease the chances of frequent words, while a negative value will increase
them. In this case, the frequency_penalty is set to 0
presence_penalty is used to adjust the model’s preference for generating words that are present or
absent in the prompt. A positive value will decrease the chances of words that are present in the
prompt, a negative value will increase them. The presence_penalty is set to 0.5 in our example.

stop

is used to specify a sequence of tokens that the API should stop generating completions after. In
our example, since we only want 5 items, we should stop generating after the token 6. is generated.

Conclusion

The OpenAI Completions API is a powerful tool for generating text in various contexts. With the
right parameters and settings, it can produce natural-sounding text that is pertinent to the task.
By configuring the right values for some parameters such as frequency and presence penalties, the
results can be tailored to produce desired outcomes.
With the ability to control when the completion stops, the user can also control the length of the
generated text. This could be also helpful to reduce the number of tokens generated and indirectly
reduce costs.

Editing Text Using GPT

After being given a prompt and a set of instructions, the GPT model you are using will take the prompt and then use its algorithms to generate a modified version of the original prompt.

The modified version can be longer and/or more detailed than the initial prompt depending on your instructions.

A GPT model is able to understand the context of the prompt of the prompt and the instructions given, allowing it to determine which additional details would be most beneficial to include in the output.

Tanslating Text

This is an example:

  • Use API endipoint: openai.Edit.create
  • Use instruction and input (optional)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

response = openai.Edit.create(
model="text-davinci-edit-001",
input='Hallo Welt',
instruction='Translate to English',
)

print(response.choices[0].text)
print(response)

The other example with instruction only, without input:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

response = openai.Edit.create(
model="text-davinci-edit-001",
instruction="Translate the following sentence to English: 'Hallo Welt'",
)

print(response.choices[0].text)
print(response)

Editing using the completions endpoint and vice versa

Some tasks you can excute using the edits endpoint can be done using the completions endpoint. It is up to you to choose which one is best for your needs.
Here’s an example of a translation task using the edits endpoint:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#Example using edit dendpoint
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

response = openai.Edit.create(
model="text-davinci-edit-001",
instruction="Translate from English to Japanese, French, Arabic, and Spanish. /n 1:Japanese: ",
input="The cat sat on the mat"
)

print(response.choices[0].text)
print(response)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Example using completion endpoint
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

content="""
Translate the following sentence from English to Japanese, French, Arabic, and Spanish.
The cat sat on the mat.

"""
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": content},
],
)

print(response.choices[0].message.content)
# print(response)

Formatting the output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Example to add comments to a Golang code
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

message='''
package main

import (
"io/ioutil"
"log"
"net/http"
)

func main() {
resp, err := http.Get("https://website.com")

if err != nil {
log.Fatalln(err)
}

body, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Fatalln(err)
}

sb := string(body)
log.Printf(sb)
}
'''

response = openai.Edit.create(
model="text-davinci-edit-001",
instruction="Explain the following Golang code",
input=message,
temperature=0.5
)

print(response.choices[0].text)

Creativity vs. Well-defined answers

Same as the completions endpoint, we have control over the creativity of the result using the temperature parameter.
You can try this example using two different temperatures to see the difference in the output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

response_1 = openai.Edit.create(
model="text-davinci-edit-001",
instruction="correct the spelling mistakes:",
input="The kuick brown fox jumps over the lazy dog and",
temperature=0,
)

response_2 = openai.Edit.create(
model="text-davinci-edit-001",
instruction="correct the spelling mistakes:",
input="The kuick brown fox jumps over the lazy dog and",
temperature=0.9,
)

print("Temperature 0:")
print(response_1.choices[0].text)
print("Temperature 1:")
print(response_2.choices[0].text)

Generally, after running the code multiple times, you may observe that the first output is consistent, while the second one changes from one execution to the next. For a use case such as fixing typos, we usually don;t need creativity, so setting the temperature parameter to 0 is enough.
We can also use top_p to add the creativitiy, it is similar to the temperature. It means only tokens comprising the top #% probability mass are considered in the result.

Generating multiple edits

In all of the previous examples, we always had a single edit. However, using the parameter n, it is possible to get more. Just use the number of edits you want to have:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

response = openai.Edit.create(
model="text-davinci-edit-001",
instruction="Edit the text to make it longer.",
input="Exercise is good for your health.",
top_p=0.2,
n=2
)


print(response.choices[0].text)
print(response.choices[1].text)

Advanced Text Manipulation

Untill now, we have see how to use different endpoints: edits and completions. Let’s do more examples to understand the different possibilities the model offers.

Chaining completions and edits

  • Use completion endpoint to generate the tweet with hashtags.
  • Use generated tweet as the input, and use instruction to translate the tweet
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

prompt = 'The first programming language to be invented was Plankalkül, which was designed by \
Konrad Zuse in the 1940s, but not publicly known until 1972 (and not implemented unt\
il 1998). The first widely known and successful high-level programming language was \
Fortran, developed from 1954 to 1957 by a team of IBM researchers led by John Backus\
. The success of FORTRAN led to the formation of a committee of scientists to develo\
p a "universal" computer language; the result of their effort was ALGOL 58. Separate\
ly, John McCarthy of MIT developed Lisp, the first language with origins in academia\
to be successful. With the success of these initial efforts, programming languages \
became an active topic of research in the 1960s and beyond.\n\nTweet with hashtags:'

english_tweet = openai.Completion.create(
model="text-davinci-002",
prompt=prompt,
temperature=0.5,
max_tokens=20,
)

english_tweet_text = english_tweet.choices[0].text.strip()

print("English tweet:")
print(english_tweet_text)

spanish_tweet = openai.Edit.create(
model="text-davinci-edit-001",
input=english_tweet_text,
instruction="Translate to Spanish",
temperature=0.5,

)

spanish_tweet_text = spanish_tweet.choices[0].text.strip()

print("Spanish tweet:")
print(spanish_tweet_text)

Aple the Company vs. Apple the Fruit (Context Stuffing)

1
2
3
4
5
6
7
8
9
10
prompt = "Determine the part of speech of the word 'light'.\n\n"

result = openai.Completion.create(
model = "text-davinci-002",
prompt = prompt,
max_tokens = 20,
temperature = 1,
)

print(result.choices[0].text.strip())
1
2
3
4
5
6
7
8
9
10
11
12
prompt_a = "The light is red. Determine the part of speech of the word 'light'.\n\n"
prompt_b = "This desk is very light. Determine the part of speech of the word 'light'.\n\n"
prompt_c = "You light up my life. Determine the part of speech of the word 'light'.\n\n"

for prompt in [prompt_a, prompt_b, prompt_c]:
result = openai.Completion.create(
model="text-davinci-002",
prompt=prompt,
max_tokens=20,
temperature=0,
)
print(result.choices[0].text.strip())
1
2
3
4
5
6
7
8
9
10
11
12
13
prompt_1 = "Huawei:\ncompany\n\nGoogle:\ncompany\n\nMicrosoft:\ncompany\n\nApple:\n"
prompt_2 = "Huawei:\ncompany\n\nGoogle:\ncompany\n\nMicrosoft:\ncompany\n\nApricot:\nFruit\n\nApple:\n"

for prompt in [prompt_1, prompt_2]:
result = openai.Completion.create(
model = "text-davinci-002",
prompt = prompt,
max_tokens = 20,
temperature = 0,
stop=["\n", " "],
)

print(result.choices[0].text.strip())

Getting cryptocurrency information based on a user-defined schema (context stuffing)

  • define a schema or template
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
prompt = """Input: Bitcoin
Output:
BTC was created in 2008, you can learn more about it here: https://bitcoin.org/en/ a\
nd get the latest price here: https://www.coingecko.com/en/coins/bitcoin.
It's all-time high is $64,895.00 and it's all-time low is $67.81.

Input: Ethereum
Output:
ETH was created in 2015, you can learn more about it here: https://ethereum.org/en/ \
and get the latest price here: https://www.coingecko.com/en/coins/ethereum
It's all-time high is $4,379.00 and it's all-time low is $0.43.

Input: Dogecoin
Output:
DOGE was created in 2013, you can learn more about it here: https://dogecoin.com/ an\
d get the latest price here: https://www.coingecko.com/en/coins/dogecoin
It's all-time high is $0.73 and it's all-time low is $0.000002.

Input: Cardano
Output:\n"""

result = openai.Completion.create(
model = "text-davinci-002",
prompt = prompt,
max_tokens = 200,
temperature = 0,
)

print(result.choices[0].text.strip())

Creating a Chatbot assistant to help with Linux commands

Disclaimer: This part was inspired by an old demo of OpenAI from 2020.
Our goal is to develop a command-line tool that can assist us with Linux commands through
conversation.
Let’s start with this example:

1
pip install click==8.1.3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import os
import openai
import click

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()



_prompt = """
Input: List all the files in the current directory
Output: ls -l

Input: List all the files in the current directory, including hidden files
Output: ls -la

Input: Delete all the files in the current directory
Output: rm *

Input: Count the number of occurrences of the word "sun" in the file "test.txt"
Output: grep -o "sun" test.txt | wc -l

Input: {}
Output:"""

while True:
request = input(click.style("Input(type 'exit' to quit): ", fg="green"))
if request == "exit":
break

prompt = _prompt.format(request)

try:
result = openai.Completion.create(
model = "text-davinci-002",
prompt = prompt,
max_tokens = 100,
temperature = 0,
stop=["\n"],
)
command = result.choices[0].text.strip()
click.echo(click.style("Output: ", fg="yellow") + command)

click.echo(click.style("Execute? (y/n): ", fg="yellow"), nl=False)
choice = input()
if choice == "y":
os.system(command)
elif choice == "n":
continue
else:
click.echo(click.style("Invalid choice. Please enter 'y' or 'n'.", fg="red"))
except Exception as e:
click.echo(click.style("The command could not be executed. {}".format(e), fg="red"))
pass
click.echo()

Embedding

  • measure how similar two text strings are to each other
  • used for tasks like:
    • finding the most relevant results to a search query
    • grouping text strings together based on how similar they are
    • recommending items with similar text strings
    • finding text strings that are very different from the other
    • analyzing how different text srings are from each other
    • labeling text strings based on what they are most like.

Here’re some practical approaches to use embeddings in the industry.

  • Tesla
  • Kalendar AI
  • Notion
  • DALL-E 2

To work with embedding, you should install datalib using the following command:

1
pip install datalib

At another level of this guide, we will need Matplotlib and other libraries:

1
pip install matplotlib plotly scipy scikit-learn

This package will also install tools like pandas and NumPy.
These libraries are some of the most used in AI and data science in general.

1
pip install datalib
1
pip install matplotlib plotly scipy scikit-learn

Understanding Text Embedding

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

response = openai.Embedding.create(
model="text-embedding-ada-002",
input="I am a programmer",
)

print(response)

# These floating points represent the embedding of the input text “I am a programmer” generated by the OpenAI “text-embedding-ada-002” model.

Embeddings for Multiple Inputs

We can use multiple inputs to get the embeddings, here’s the example

1
2
3
4
5
6
7
response = openai.Embedding.create(
model="text-embedding-ada-002",
input=["I am a programmer", "I am a writer"],
)

for data in response.data:
print(data.embedding)

Semantic Search - 语义搜索

We are going to implement a semantic search using OpenAI embeddings

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import os
import openai
import pandas as pd
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity


def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

# words.csv is a csv file with a column named 'text' containing words
df = pd.read_csv('words.csv')

# get the embeddings for each word in the dataframe
df['embedding'] = df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))

# save the dataframe to a csv file
df.to_csv('embedding.csv')

# read the csv file
df = pd.read_csv('embedding.csv')

# convert the embedding axis to a numpy array
df['embedding'] = df['embedding'].apply(eval).apply(np.array)

# get the search term from the user
user_search = input('Enter a search term: ')

# get the embedding for the search term
search_term_embedding = get_embedding(user_search, engine='text-embedding-ada-002')

# calculate the cosine similarity between the search term and each word in the dataframe
df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity(x, search_term_embedding))

# sort the dataframe by the similarity axis
df = df.sort_values(by='similarity', ascending=False)

# print the top 10 results
print(df.head(10))

Cosine Similarity

Cosine similarity is a way of measuring how similar two vectors are. It looks at the angle between
two vectors (lines) and compares them. Cosine similarity is the cosine of the angle between the
vector. A result is a number between -1 and 1. If the vectors are the same, the result is 1. If the
vectors are completely different, the result is -1. If the vectors are at a 90-degree angle, the result is 0. In mathematical terms, this is the equation:
$$
Similarity = (A.B) / (||A||.||B||)
$$

  • A and B are vectors
  • A.B is a way of multiplying two sets of numbers together. It is done by taking each number in one set and multiplying it with the same number in the other set, then adding all of those products together.
  • ||A|| is the length of the vector A. It is calculated by taking the square root of the sum of the squares of each element of the vector A.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# import numpy and norm from numpy.linalg
import numpy as np
from numpy.linalg import norm

# define two vectors
A = np.array([2,3,5,2,6,7,9,2,3,4])
B = np.array([3,6,3,1,0,9,2,3,4,5])

# print the vectors
print("Vector A: {}".format(A))
print("Vector B: {}".format(B))

# calculate the cosine similarity
cosine = np.dot(A,B)/(norm(A)*norm(B))

# print the cosine similarity
print("Cosine Similarity between A and B: {}".format(cosine))

Advanced Embedding Examples

Predicting your preferred Coffee

1
pip install nltk
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import os
import pandas as pd
import numpy as np
import nltk
import openai
from openai.embeddings_utils import get_embedding, cosine_similarity

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

def download_nltk_data():
try:
nltk.data.find('tokenizers/punkt')
except LookupError:
nltk.download('punkt')
try:
nltk.data.find('corpora/stopwords')
except LookupError:
nltk.download('stopwords')

def preprocess_review(review):
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
stopwords = set(stopwords.words('english'))
stemmer = PorterStemmer()
tokens = nltk.word_tokenize(review.lower())
tokens = [token for token in tokens if token not in stopwords]
tokens = [stemmer.stem(token) for token in tokens]
return ' '.join(tokens)

init_api()

download_nltk_data()

# Read user input
input_coffee_name = input("Enter a coffee name: ")

# Load the CSV file into a Pandas DataFrame
# (only the first 50 rows for now to speed up the demo and avoid paying for too many API calls)
df = pd.read_csv('simplified_coffee.csv', nrows=50)

# Preprocess the review text: lowercase, tokenize, remove stopwords, and stem
df['preprocessed_review'] = df['review'].apply(preprocess_review)

# Get the embeddings for each review
review_embeddings = []
for review in df['preprocessed_review']:
review_embeddings.append(get_embedding(review, engine='text-embedding-ada-002'))

# Get the index of the input coffee name
try:
input_coffee_index = df[df['name'] == input_coffee_name].index[0]
except:
print("Sorry, we don't have that coffee in our database. Please try again.")
exit()

# Calculate the cosine similarity between the input coffee's review and all other reviews
similarities = []
input_review_embedding = review_embeddings[input_coffee_index]
for review_embedding in review_embeddings:
similarity = cosine_similarity(input_review_embedding, review_embedding)
similarities.append(similarity)

# Get the indices of the most similar reviews (excluding the input coffee's review itself)
most_similar_indices = np.argsort(similarities)[-6:-1]

# why -1? because the last one is the input coffee itself

# Get the names of the most similar coffees
similar_coffee_names = df.iloc[most_similar_indices]['name'].tolist()

# Print the results
print("The most similar coffees to {} are:".format(input_coffee_name))
for coffee_name in similar_coffee_names:
print(coffee_name)
  • The previous code has limitation which need exact match for the input and existing coffee name list.
  • “Fuzzier” search leverage fuzzy search technique, such as “Levenshtein distance“ or cosine similarity search between the user input and the coffe names.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import os
import pandas as pd
import numpy as np
import nltk
import openai
from openai.embeddings_utils import get_embedding, cosine_similarity

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

def download_nltk_data():
try:
nltk.data.find('tokenizers/punkt')
except LookupError:
nltk.download('punkt')
try:
nltk.data.find('corpora/stopwords')
except LookupError:
nltk.download('stopwords')

def preprocess_review(review):
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
stopwords = set(stopwords.words('english'))
stemmer = PorterStemmer()
tokens = nltk.word_tokenize(review.lower())
tokens = [token for token in tokens if token not in stopwords]
tokens = [stemmer.stem(token) for token in tokens]
return ' '.join(tokens)

init_api()

download_nltk_data()

# Read user input
input_coffee_name = input("Enter a coffee name: ")

# Load the CSV file into a Pandas DataFrame
# (only the first 50 rows for now to speed up the demo and avoid paying for too many API calls)
df = pd.read_csv('simplified_coffee.csv', nrows=50)

# Preprocess the review text: lowercase, tokenize, remove stopwords, and stem
df['preprocessed_review'] = df['review'].apply(preprocess_review)

# Get the embeddings for each review
review_embeddings = []
for review in df['preprocessed_review']:
review_embeddings.append(get_embedding(review, engine='text-embedding-ada-002'))

# Get the index of the input coffee name, with "fuzzy search"
try:
input_coffee_index = df[df['name'] == input_coffee_name].index[0]
except IndexError:
# get the embeddings for each name
print("Sorry, we don't have that coffee in our database. We will try to find the closest match.")
name_embeddings = []
for name in df['name']:
name_embeddings.append(get_embedding(name, engine='text-embedding-ada-002'))
# perform a cosine similarity search on the input coffee name
input_coffee_embedding = get_embedding(input_coffee_name, engine='text-embedding-ada-002')
_similarities = []
for name_embedding in name_embeddings:
_similarities.append(cosine_similarity(input_coffee_embedding, name_embedding))

input_coffee_index = _similarities.index(max(_similarities))
except:
print("Sorry, we don't have that coffee in our database. Please try again.")
exit()

# Calculate the cosine similarity between the input coffee's review and all other reviews
similarities = []
input_review_embedding = review_embeddings[input_coffee_index]
for review_embedding in review_embeddings:
similarity = cosine_similarity(input_review_embedding, review_embedding)
similarities.append(similarity)

# Get the indices of the most similar reviews (excluding the input coffee's review itself)
most_similar_indices = np.argsort(similarities)[-6:-1]

# why -1? because the last one is the input coffee itself

# Get the names of the most similar coffees
similar_coffee_names = df.iloc[most_similar_indices]['name'].tolist()

# Print the results
print("The most similar coffees to {} are:".format(input_coffee_name))
for coffee_name in similar_coffee_names:
print(coffee_name)

Predicting News category using embedding(零样本)

This example will introduce a zero-shot news classifier that predicts the category of a news article.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import os
import pandas as pd
import numpy as np
import nltk
import openai
from openai.embeddings_utils import get_embedding, cosine_similarity

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

categories = [
"POLITICS",
"WELLNESS",
"ENTERTAINMENT",
"TRAVEL",
"STYLE & BEAUTY",
"PARENTING",
"HEALTHY LIVING",
"QUEER VOICES",
"FOOD & DRINK",
"BUSINESS",
"COMEDY",
"SPORTS",
"BLACK VOICES",
"HOME & LIVING",
"PARENTS",
]

# define a function to classify a sentence
def classify_sentence(sentence):
#Get the embedding of the sentence
sentence_embedding = get_embedding(sentence, engine="text-embedding-ada-002")

#Calculate the similarity score between the sentence and each category
similarity_scores = {}
for category in categories:
category_embeddings = get_embedding(category, engine="text-embedding-ada-002")
similarity_scores[category] = cosine_similarity(sentence_embedding, category_embeddings)

# Return the category with the highest similarity score
return max(similarity_scores, key=similarity_scores.get)

sentences = [
"1 dead and 3 injured in El Paso, Texas, mall shooting",
"Director Owen Kline Calls Funny Pages His ‘Self-Critical’ Debut",
"15 spring break ideas for families that want to get away",
"The US is preparing to send more troops to the Middle East",
"Bruce Willis' 'condition has progressed' to frontotemporal dementia, his family says",
"Get an inside look at Universal’s new Super Nintendo World",
"Barcelona 2-2 Manchester United: Marcus Rashford shines but Raphinha salvages draw for hosts",
"Chicago bulls win the NBA championship",
"The new iPhone 12 is now available",
"Scientists discover a new dinosaur species",
"The new coronavirus vaccine is now available",
"The new Star Wars movie is now available",
"Amazon stock hits a new record high",
]

for sentence in sentences:
print("{:50} category is {}".format(sentence, classify_sentence(sentence)))
print()

Evaluating the accuracy of a Zero-Shot classifier (零样本)

It looks like the previous classifier is almost perfect but there is a way to understand if it is really accurate and generate an accuracy score.

We are going to start by downloading this dataset from kaggle and saving it under/data/News_-Category_Dataset_v3.json.

This dataset contains around 210k news headlines from 2012 to 2022 from HuffPost.The dataset classifies the headlines of each article into a category.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import os
import openai
import pandas as pd
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity
from sklearn.metrics import precision_score

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

categories = [
"POLITICS",
"WELLNESS",
"ENTERTAINMENT",
"TRAVEL",
"STYLE & BEAUTY",
"PARENTING",
"HEALTHY LIVING",
"QUEER VOICES",
"FOOD & DRINK",
"BUSINESS",
"COMEDY",
"SPORTS",
"BLACK VOICES",
"HOME & LIVING",
"PARENTS",
]

# Define a function to classify a sentence
def classify_sentence(sentence):
# Get the embedding of the sentence
sentence_embedding = get_embedding(sentence, engine="text-embedding-ada-002")
# Calculate the similarity score between the sentence and each category
similarity_scores = {}
for category in categories:
category_embeddings = get_embedding(category, engine="text-embedding-ada-002")
similarity_scores[category] = cosine_similarity(sentence_embedding, category_embeddings)
# Return the category with the highest similarity score
return max(similarity_scores, key=similarity_scores.get)

def evaluate_precision(categories):
# Load the dataset
df = pd.read_json("News_Category_Dataset_v3.json", lines=True).head(20)
y_true = []
y_pred = []

# Classify each sentence
for _, row in df.iterrows():
true_category = row['category']
predicted_category = classify_sentence(row['headline'])

y_true.append(true_category)
y_pred.append(predicted_category)

# Calculate the precision score
return precision_score(y_true, y_pred, average='micro', labels=categories)

precision_evaluated = evaluate_precision(categories)
print("Precision: {:.2f}".format(precision_evaluated))

Fine Tuning & Best Practices

Few short learning and example of Fine Tuning practice

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
# 1. create a json file with below test prompts
[
{
"prompt":"When do I have to start the heater?",
"completion":"Every day in the morning at 7AM. You should stop it at 2PM"
},

{
"prompt":"Where is the garage remote control?",
"completion":"Next to the yellow door, on the key ring"
},

{
"prompt":"Is it necessary to program the scent diffuser every day?",
"completion":"The scent diffuser is already programmed, you just need to recharge it when its battery is low"
}
]

# 2. generate the jasonl file based on the previous json file
root@c1ws-test:~/jupyter_workspace# cat data.json
root@c1ws-test:~/jupyter_workspace# openai tools fine_tunes.prepare_data -f data.json


# 3. tune the new model based on model: davinci
root@c1ws-test:~/jupyter_workspace# export OPENAI_API_KEY="YOUR_API_KEY"

### Optional to add suffix of the new model name as below:
# root@c1ws-test:~/jupyter_workspace# openai api fine_tunes.create -t "data_prepared.jsonl" -m davinci --suffix "felix_yang"

root@c1ws-test:~/jupyter_workspace# openai api fine_tunes.create -t "data_prepared.jsonl" -m davinci
Upload progress: 100%|████████████████████████████████████████████████████████████████████| 417/417 [00:00<00:00, 540kit/s]
Uploaded file from data_prepared.jsonl: file-WIRS3kIX67OFCmGdBKlYhawT
Created fine-tune: ft-vs2MUECcW4S92WGroV4xtfuc
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-06-21 03:28:41] Created fine-tune: ft-vs2MUECcW4S92WGroV4xtfuc

Stream interrupted (client disconnected).
To resume the stream, run:

openai api fine_tunes.follow -i ft-vs2MUECcW4S92WGroV4xtfuc

root@c1ws-test:~/jupyter_workspace# openai api fine_tunes.follow -i ft-vs2MUECcW4S92WGroV4xtfuc
[2023-06-21 03:28:41] Created fine-tune: ft-vs2MUECcW4S92WGroV4xtfuc
[2023-06-21 03:29:52] Fine-tune costs $0.01
[2023-06-21 03:29:53] Fine-tune enqueued. Queue number: 0
[2023-06-21 03:30:05] Fine-tune started

Stream interrupted (client disconnected).
To resume the stream, run:

openai api fine_tunes.follow -i ft-vs2MUECcW4S92WGroV4xtfuc

root@c1ws-test:~/jupyter_workspace# openai api fine_tunes.follow -i ft-vs2MUECcW4S92WGroV4xtfuc
[2023-06-21 03:28:41] Created fine-tune: ft-vs2MUECcW4S92WGroV4xtfuc
[2023-06-21 03:29:52] Fine-tune costs $0.01
[2023-06-21 03:29:53] Fine-tune enqueued. Queue number: 0
[2023-06-21 03:30:05] Fine-tune started
[2023-06-21 03:32:56] Completed epoch 1/4
[2023-06-21 03:32:57] Completed epoch 2/4
[2023-06-21 03:32:58] Completed epoch 3/4
[2023-06-21 03:32:59] Completed epoch 4/4


# 4. Check the new model id
root@c1ws-test:~/jupyter_workspace# openai api fine_tunes.list
{
"object": "list",
"data": [
{
"object": "fine-tune",
"id": "ft-vs2MUECcW4S92WGroV4xtfuc", ## Fine_tune_job_id
"hyperparams": {
"n_epochs": 4,
"batch_size": 1,
"prompt_loss_weight": 0.01,
"learning_rate_multiplier": 0.1
},
"organization_id": "org-1kpSRIJZFLe1LzYA6NIMCm20",
"model": "davinci",
"training_files": [
{
"object": "file",
"id": "file-WIRS3kIX67OFCmGdBKlYhawT",
"purpose": "fine-tune",
"filename": "data_prepared.jsonl",
"bytes": 417,
"created_at": 1687318121,
"status": "processed",
"status_details": null
}
],
"validation_files": [],
"result_files": [
{
"object": "file",
"id": "file-4lcuZduyevJH94dl7nkf4BPO",
"purpose": "fine-tune-results",
"filename": "compiled_results.csv",
"bytes": 754,
"created_at": 1687318420,
"status": "processed",
"status_details": null
}
],
"created_at": 1687318121,
"updated_at": 1687318421,
"status": "succeeded",
"fine_tuned_model": "davinci:ft-personal-2023-06-21-03-33-39" # New model id
}
]
}

# 5. Test the new model

root@c1ws-test:~/jupyter_workspace# export FINE_TUNED_MODEL="davinci:ft-personal-2023-06-21-03-33-39"
root@c1ws-test:~/jupyter_workspace# openai api completions.create -m $FINE_TUNED_MODEL -p "When do I have to start the heater?"
When do I have to start the heater?When do I have to turn off the A/C?How low can I


# 6. Analyze the new model

root@c1ws-test:~/jupyter_workspace# openai api fine_tunes.results -i "ft-vs2MUECcW4S92WGroV4xtfuc"
step,elapsed_tokens,elapsed_examples,training_loss,training_sequence_accuracy,training_token_accuracy
1,33,1,1.262732845058199,0.0,0.6
2,66,2,1.6095716556394473,0.0,0.17647058823529413
3,91,3,1.6150656826297443,0.0,0.45454545454545453
4,124,4,1.466446138578467,0.0,0.29411764705882354
5,149,5,1.484918415422241,0.0,0.45454545454545453
6,182,6,1.0398217318346723,0.0,0.55
7,215,7,1.0021467336965726,0.0,0.55
8,240,8,1.3121158500760792,0.0,0.45454545454545453
9,273,9,1.2723229727894068,0.0,0.4117647058823529
10,298,10,1.260173139721155,0.0,0.45454545454545453
11,331,11,1.2538491002004593,0.0,0.47058823529411764
12,364,12,0.9169519251119346,0.0,0.55
13,397,13,1.246762776928954,0.0,0.4117647058823529
14,422,14,1.242958545796573,0.0,0.45454545454545453


# 7. Delete the model
- CLI:
root@c1ws-test:~/jupyter_workspace# openai api models.delete -i "davinci:ft-personal-2023-06-21-03-33-39"
{
"id": "davinci:ft-personal-2023-06-21-03-33-39",
"object": "model",
"deleted": true
}

- Python
openai.Model.delete("davinci:ft-personal-2023-06-21-03-33-39")

- cURL
curl -X "DELETE" https://api.openai.com/v1/models/<FINE_TUNED_MODEL> -H "Authorization: Bearer $OPENAI_API_KEY"

What’s the best practice - Datasets, Prompts and Completions

  • Each Prompt Should End With a Fixed Separator
  • Each Completion Should Start With a Whitespace
  • Each Completion Should End With a Fixed-Stop Sequence
  • Fine-Tuning Performs Better With More High-Quality Examples
  • For the dataset, converting the input data into natural language will likely result in better performance. This is evenclearer when you are building a generative model.
  • Review Your Data for Offensive Content
  • Review the Type and Structure of Your Dataset
  • Analyze Your Model

root@c1ws-test:~/jupyter_workspace# openai api fine_tunes.results -i “ft-vs2MUECcW4S92WGroV4xtfuc”
step,elapsed_tokens,elapsed_examples,training_loss,training_sequence_accuracy,training_token_accuracy
1,33,1,1.262732845058199,0.0,0.6
2,66,2,1.6095716556394473,0.0,0.17647058823529413
3,91,3,1.6150656826297443,0.0,0.45454545454545453
4,124,4,1.466446138578467,0.0,0.29411764705882354
5,149,5,1.484918415422241,0.0,0.45454545454545453
6,182,6,1.0398217318346723,0.0,0.55
7,215,7,1.0021467336965726,0.0,0.55
8,240,8,1.3121158500760792,0.0,0.45454545454545453
9,273,9,1.2723229727894068,0.0,0.4117647058823529
10,298,10,1.260173139721155,0.0,0.45454545454545453
11,331,11,1.2538491002004593,0.0,0.47058823529411764
12,364,12,0.9169519251119346,0.0,0.55
13,397,13,1.246762776928954,0.0,0.4117647058823529
14,422,14,1.242958545796573,0.0,0.45454545454545453

  1. “step”: This column shows the training step number or the number of iterations of the trainingprocess.
  2. “elapsed_tokens”: Shows the number of tokens processed by the training process so far. A token is a unit of text, such as a word or a punctuation mark.
  3. “elapsed_examples”: This is the number of examples (i.e., pieces of text) processed by the training process so far.
  4. “training_loss”: This number shows the value of the loss function during training. (The loss function is a measure of how well the model is performing, with lower values indicating better performance.)
  5. “training_sequence_accuracy”: The accuracy of the model in predicting the next sequence of tokens. (A sequence is a group of tokens that form a meaningful unit, such as a sentence or a paragraph.)
  6. “training_token_accuracy”: This value tells us about the accuracy of the model in predicting individual tokens.
    Some third-party tools, such as wandb⁴⁰, can also be used to analyze the results.
  • Use validation data if needed

openai api fine_tunes.create -t train_data.jsonl -v validation_data.jsonl -m

  • Tweak the Hyperparameters

    • n_epochs
    • batch_size
    • learning_rate_multiplier
  • Use Ada

    • When tackling classification problems, the Ada model is a good option. It performs only slightly
      worse than Davinci once fine-tuned, at the same time it is considerably faster and more affordable
  • Use Single-Token Classes

    • save token numbers
    • much fast
      • “sports and entertainment”:
      • {prompt:”The Los Angeles Lakers won the NBA championship last year.”, completion: “sports and entertainment”}
      • 1(for sports and entertainment):
      • {prompt:”The Los Angeles Lakers won the NBA championship last year.”, completion: “1”}
  • Other Considerations for Classification

    • Ensure that the prompt and completion combined do not exceed 2048 tokens, including the separator.
    • Try to provide at least 100 examples for each class.
    • The separator should not be used within the prompt text. You should remove it from the prompt if it’s the case. Example: If your separator is !#! you should preprocess the text of the prompt to remove any !#!

Advanced Fine Tuning: Drug Classification

Dataset Used in the Example

In this example, we are going to use a public dataset containing drug names and the correspoinding malady, illness, or condition that they are used to treat.

We are going to create a model and “teach” it to predict the output based on user input.

The user input is the name of the drug and the output is the name of malady.
The dataset is available at Kaggle.com, you will need to download using the following URL:
https://www.kaggle.com/datasets/saratchendra/medicine-recommendation/download?datase\tVersionNumber=1

  • Use sheet1
  • Use first and second column: drug_name and reason
    • A CN Gel(Topical) 20gmA CN Soap 75gm ==> Acne
    • PPG Trio 1mg Tablet 10’SPPG Trio 2mg Tablet 10’S ==> Diabetes
    • Iveomezoled 200mg Injection 500ml ==> Fungal

Preparing the Data and Launching the Fine Tuning

  • We are going to use below format:

    • {“prompt”:”Drug: \nMalady:”,”completion”:” “}
  • As you can see, we will be using \nMalady: as a separator.

The completion will also start with a whitespace. Remember to start each completion with a
whitespace due to tokenization (most words are tokenized with preceding whitespace.)
Also, we have learned that each completion should end with a fixed stop sequence to inform the
model when the completion ends. for example \n, ###, END, or any other token that does not appear
in the completion.
However, in our case, this is not necessary as we are going to use a single token for classification.
Basically, we are going to give each malady a unique identifier. For example:

1
2
3
4
Acne: 1
Allergies: 2
Alzheimer: 3
..etc

This way, the model will return a single token at inference time in all cases. This is the reason why
the stop sequence is not necessary.
To begin, use Pandas to transform the data into the desired format

1
pip install openpyxl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pandas as pd

# read the first n rows
n = 2000

df = pd.read_excel('Medicine_description.xlsx', sheet_name='Sheet1', header=0, nrows=n)

# get the unique values in the Reason column
reasons = df["Reason"].unique()

# assign a number to each reason
reasons_dict = {reason: i for i, reason in enumerate(reasons)}

# add a new line and ### to the end of each description
df["Drug_Name"] = "Drug: " + df["Drug_Name"] + "\n" + "Malady:"

# concatenate the Reason and Description columns
df["Reason"] = " " + df["Reason"].apply(lambda x: "" + str(reasons_dict[x]))

# drop the Reason column
df.drop(["Description"], axis=1, inplace=True)

# rename the columns
df.rename(columns={"Drug_Name": "prompt", "Reason": "completion"}, inplace=True)

# convert the dataframe to jsonl format
jsonl = df.to_json(orient="records", indent=0, lines=True)

# write the jsonl to a file
with open("drug_malady_data.jsonl", "w") as f:
f.write(jsonl)

snippet of the generated file: drug_malady_data.jsonl

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{"prompt":"Drug: A CN Gel(Topical) 20gmA CN Soap 75gm\nMalady:","completion":" 0"}
{"prompt":"Drug: A Ret 0.05% Gel 20gmA Ret 0.1% Gel 20gmA Ret 0.025% Gel 20gm\nMalady:","completion":" 0"}
{"prompt":"Drug: ACGEL CL NANO Gel 15gm\nMalady:","completion":" 0"}
{"prompt":"Drug: ACGEL NANO Gel 15gm\nMalady:","completion":" 0"}
{"prompt":"Drug: Acleen 1% Lotion 25ml\nMalady:","completion":" 0"}
{"prompt":"Drug: Aclene 0.10% Gel 15gm\nMalady:","completion":" 0"}
{"prompt":"Drug: Acnay Gel 10gm\nMalady:","completion":" 0"}
{"prompt":"Drug: Acne Aid Bar 50gmAcne Aid Bar 100gm\nMalady:","completion":" 0"}
{"prompt":"Drug: Acne UV Gel 60gm\nMalady:","completion":" 0"}
{"prompt":"Drug: Acne UV SPF 30 Gel 30gm\nMalady:","completion":" 0"}
{"prompt":"Drug: Acnecure Gel 20gm\nMalady:","completion":" 0"}
{"prompt":"Drug: Acnedap Gel 15gm\nMalady:","completion":" 0"}
...

Now we need to train the module for a new model based on the jsonl file. Here’re the details:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
root@c1ws-test:~/jupyter_workspace# openai tools fine_tunes.prepare_data -f drug_malady_data.jsonl
Analyzing...

- Your file contains 2000 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used fortraining
- All prompts end with suffix `\nMalady:`
- All prompts start with prefix `Drug: `

No remediations found.
- [Recommended] Would you like to split into training and validation set? [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified files to `drug_malady_data_prepared_train.jsonl` and `drug_malady_data_prepared_valid.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "drug_malady_data_prepared_train.jsonl" -v "drug_malady_data_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 7

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\nMalady:` for the model to start generating completions, rather than continuing with the prompt.
Once your model starts training, it'll approximately take 50.33 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.
root@c1ws-test:~/jupyter_workspace# openai api fine_tunes.create \
> -t "drug_malady_data_prepared_train.jsonl" \
> -v "drug_malady_data_prepared_valid.jsonl" \
> --compute_classification_metrics \
> --classification_n_classes 7 \
> -m ada \
> --suffix "drug_malady_data"
Error: No API key provided. You can set your API key in code using 'openai.api_key = <API-KEY>', or you can set the environment variable OPENAI_API_KEY=<API-KEY>). If your API key is stored in a file, you can point the openai module at it with 'openai.api_key_path = <PATH>'. You can generate API keys in the OpenAI web interface. See https://platform.openai.com/account/api-keys for details.


root@c1ws-test:~/jupyter_workspace# export OPENAI_API_KEY="<your key>"
root@c1ws-test:~/jupyter_workspace# openai api fine_tunes.create -t "drug_malady_data_prepared_train.jsonl" -v "drug_malady_data_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 7 -m ada --suffix "drug_malady_data"
Upload progress: 100%|██████████████████████████████████████████████████████████████████| 128k/128k [00:00<00:00, 187Mit/s]
Uploaded file from drug_malady_data_prepared_train.jsonl: file-0NFPDCuDyX1nfLv3dFuarPDr
Upload progress: 100%|███████████████████████████████████████████████████████████████| 32.0k/32.0k [00:00<00:00, 59.6Mit/s]
Uploaded file from drug_malady_data_prepared_valid.jsonl: file-kxzT0UhZZRwam7RfjYIFcuTF
Created fine-tune: ft-zk8Ojv3J1Nj1HpzRdicC2iPL
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-06-21 08:54:19] Created fine-tune: ft-zk8Ojv3J1Nj1HpzRdicC2iPL

Stream interrupted (client disconnected).
To resume the stream, run:

openai api fine_tunes.follow -i ft-zk8Ojv3J1Nj1HpzRdicC2iPL

root@c1ws-test:~/jupyter_workspace# openai api fine_tunes.follow -i ft-zk8Ojv3J1Nj1HpzRdicC2iPL
[2023-06-21 08:54:19] Created fine-tune: ft-zk8Ojv3J1Nj1HpzRdicC2iPL
[2023-06-21 08:56:36] Fine-tune costs $0.05
[2023-06-21 08:56:36] Fine-tune enqueued. Queue number: 20


# It took mroe than 1 hour to complete the model training

root@c1ws-test:~/jupyter_workspace# openai api fine_tunes.follow -i ft-zk8Ojv3J1Nj1HpzRdicC2iPL
[2023-06-21 08:54:19] Created fine-tune: ft-zk8Ojv3J1Nj1HpzRdicC2iPL
[2023-06-21 08:56:36] Fine-tune costs $0.05
[2023-06-21 08:56:36] Fine-tune enqueued. Queue number: 20
[2023-06-21 09:06:35] Fine-tune is in the queue. Queue number: 19
[2023-06-21 09:06:39] Fine-tune is in the queue. Queue number: 18
[2023-06-21 09:07:09] Fine-tune is in the queue. Queue number: 17
[2023-06-21 09:08:12] Fine-tune is in the queue. Queue number: 16
[2023-06-21 09:09:27] Fine-tune is in the queue. Queue number: 15
[2023-06-21 09:10:00] Fine-tune is in the queue. Queue number: 14
[2023-06-21 09:10:55] Fine-tune is in the queue. Queue number: 13
[2023-06-21 09:11:40] Fine-tune is in the queue. Queue number: 12
[2023-06-21 09:13:14] Fine-tune is in the queue. Queue number: 11
[2023-06-21 09:13:48] Fine-tune is in the queue. Queue number: 10
[2023-06-21 09:15:02] Fine-tune is in the queue. Queue number: 9
[2023-06-21 09:15:08] Fine-tune is in the queue. Queue number: 8
[2023-06-21 09:18:55] Fine-tune is in the queue. Queue number: 7
[2023-06-21 09:19:27] Fine-tune is in the queue. Queue number: 6
[2023-06-21 09:30:57] Fine-tune is in the queue. Queue number: 5
[2023-06-21 09:32:08] Fine-tune is in the queue. Queue number: 4
[2023-06-21 09:32:29] Fine-tune is in the queue. Queue number: 3
[2023-06-21 09:33:03] Fine-tune is in the queue. Queue number: 2
[2023-06-21 09:33:54] Fine-tune is in the queue. Queue number: 1
[2023-06-21 09:37:01] Fine-tune is in the queue. Queue number: 0
[2023-06-21 09:37:21] Fine-tune started
[2023-06-21 09:42:31] Completed epoch 1/4
[2023-06-21 09:52:33] Completed epoch 3/4
[2023-06-21 09:57:33] Completed epoch 4/4
[2023-06-21 09:58:06] Uploaded model: ada:ft-personal:drug-malady-data-2023-06-21-09-58-05
[2023-06-21 09:58:07] Uploaded result file: file-tzqW6Vjcqg9thdkCBzgihH8M
[2023-06-21 09:58:07] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m ada:ft-personal:drug-malady-data-2023-06-21-09-58-05 -p <YOUR_PROMPT>


# List the generated new models
root@c1ws-test:~/jupyter_workspace# openai api fine_tunes.list
{
"object": "list",
"data": [
{
"object": "fine-tune",
"id": "ft-zk8Ojv3J1Nj1HpzRdicC2iPL",
"hyperparams": {
"n_epochs": 4,
"batch_size": 2,
"prompt_loss_weight": 0.01,
"classification_n_classes": 7,
"learning_rate_multiplier": 0.1,
"compute_classification_metrics": true
},
"organization_id": "org-1kpSRIJZFLe1LzYA6NIMCm20",
"model": "ada",
"training_files": [
{
"object": "file",
"id": "file-0NFPDCuDyX1nfLv3dFuarPDr",
"purpose": "fine-tune",
"filename": "drug_malady_data_prepared_train.jsonl",
"bytes": 128249,
"created_at": 1687337657,
"status": "processed",
"status_details": null
}
],
"validation_files": [
{
"object": "file",
"id": "file-kxzT0UhZZRwam7RfjYIFcuTF",
"purpose": "fine-tune",
"filename": "drug_malady_data_prepared_valid.jsonl",
"bytes": 32007,
"created_at": 1687337659,
"status": "processed",
"status_details": null
}
],
"result_files": [
{
"object": "file",
"id": "file-tzqW6Vjcqg9thdkCBzgihH8M",
"purpose": "fine-tune-results",
"filename": "compiled_results.csv",
"bytes": 168489,
"created_at": 1687341486,
"status": "processed",
"status_details": null
}
],
"created_at": 1687337659,
"updated_at": 1687341487,
"status": "succeeded",
"fine_tuned_model": "ada:ft-personal:drug-malady-data-2023-06-21-09-58-05" ---> fine tuned model ID
}
]
}


# Simple Test the new model
root@c1ws-test:~/jupyter_workspace# openai api completions.create -m ada:ft-personal:drug-malady-data-2023-06-21-09-58-05 -p "What is 'A CN Gel(Topical) 20gmA CN Soap 75gm' used for?"
What is 'A CN Gel(Topical) 20gmA CN Soap 75gm' used for? 0.0% Gel(Topical) 60gm

Testing the Fine Tuned Model

When the model is ready, you can test with below code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

# Configure the model ID. Change this to your model ID.
model = "ada:ft-personal:drug-malady-data-2023-06-21-09-58-05"

# Let's use a drug from each class
drugs = [
"A CN Gel(Topical) 20gmA CN Soap 75gm", # Class 0
"Addnok Tablet 20'S", # Class 1
"ABICET M Tablet 10's", # Class 2
]

class_map = {
0: "Acne",
1: "Adhd",
2: "Allergies",
# ...
}

# Returns a drug class for each drug
for drug_name in drugs:
prompt = "Drug: {}\nMalady:".format(drug_name)
response = openai.Completion.create(
model=model,
prompt= prompt,
temperature=1,
max_tokens=1,
)
response = response.choices[0].text
try:
print(drug_name + " is used for " + class_map[int(response)])
except:
print("I don't know what " + drug_name + " is used for.")
print()

Advanced Fine Tuning: Creating a Chatbot Assitant

Based on previous truning model, the goal of this chapter is to create a chatbot with clarification more human friendly.

Interactive Classification

Three functions defined:

  • regular_discussion()
  • get_malady_name()
  • get_malady_description()

The end user, when asking about a drug name will get the malady(from the fine-tuned model) and its description(from Davinci).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import os
import openai

def init_api():
with open('.env') as env:
for line in env:
key, value = line.strip().split('=')
os.environ[key] = value

openai.api_key = os.environ.get("API_KEY")

init_api()

def regular_discussion(prompt):
"""
params: prompt - a string
Returns a response from the API using Davinci.
If the user asks about a drug, the function will call get_malady_name()
"""

prompt = """
The following is a conversation with an AI assistant. The assistant is helpful, \
creative, clever, very friendly and careful with Human's health topics
The AI assistant is not a doctor and does not diagnose or treat medical conditions to Human
The AI assistant is not a pharmacist and does not dispense or recommend medications to Human
The AI assistant does not provide medical advice to Human
The AI assistant does not provide medical and health diagnosis to Human
The AI assistant does not provide medical treatment to Human
The AI assistant does not provide medical prescriptions to Human
If Human writes the name of a drug the assistant will reply with "######".

Human: Hi
AI: Hello Human. How are you? I'll be glad to help. Give me the name of a drug a\
nd I'll tell you what it's used for.
Human: Vitibex
AI: ######
Human: I'm fine. How are you?
AI: I am fine. Thank you for asking. I'll be glad to help. Give me the name of a\
drug and I'll tell you what it's used for.
Human: What is Chaos Engineering?
AI: I'm sorry, I am not qualified to do that. I'm only programmed to answer ques\
tions about drugs. Give me the name of a drug and I'll tell you what it's used for.
Human: Where is Carthage?
AI: I'm sorry, I am not qualified to do that. I'm only programmed to answer ques\
tions about drugs. Give me the name of a drug and I'll tell you what it's used for.
Human: What is Maxcet 5mg Tablet 10'S?
AI: ######
Human: What is Axepta?
AI: ######
Human: {}
AI:""".format(prompt)

# Get the response from API
response = openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
max_tokens=100,
stop=['\n', 'Human:', 'AI:'],
)

if response.choices[0].text.strip() == "######":
get_malady_name(prompt)
else:
final_response = response.choices[0].text.strip() + '\n'
print("AI: {}".format(final_response))

def get_malady_name(drug_name):
"""
params: drug_name - a string
Returns a malady name that corresponds to a drug name from the fine-tuned model.
The function will call get_malady_description() to get a description of the malady.
"""

# Configure the model ID. Change this to your model ID.
model = "ada:ft-personal:drug-malady-data-2023-06-21-09-58-05"
class_map = {
0: "Acne",
1: "Adhd",
2: "Allergies",
# ...
}

# Returns a drug class for each drug
prompt = "Drug: {}\nMalady:".format(drug_name)
response = openai.Completion.create(
model=model,
prompt=prompt,
temperature=1,
max_tokens=1,
)

response = response.choices[0].text.strip()
try:
malady = class_map[int(response)]
print("AI: This drug used for {}.".format(malady))
print(get_malady_description(malady))
except:
print("AI: I don't know what '" + drug_name + "' is used for.")


def get_malady_description(malady):
"""
params: drug_name - a string
Get a description of a malady from the API using Davinci.
"""

prompt = """
The following is a conversation with an AI assistant. The assistant is helpful, \
creative, clever, and very friendly.
The assistant does not provide medical advice. It only defines a malady, a disea\
se, or a condition.
If the assistant does not know the answer to a question, it will ask to rephrase\
it.

Q: What is {}?
A:""".format(malady)

# Get the response from the API
response = openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
max_tokens=100,
stop=['\n', 'Q:', 'A:'],
)
return response.choices[0].text.strip()

if __name__ == '__main__':
while True:
regular_discussion(input("Human:"))
Human:What is Fantanyl
AI: This drug used for Allergies.
Allergies are a common condition caused by an overly sensitive immune system. Symptoms usually include sneezing, runny nose, itchy eyes, and skin rash. Allergies can be triggered by something in the environment, such as pollen, pet hair, dust, or certain foods.



---------------------------------------------------------------------------

KeyboardInterrupt                         Traceback (most recent call last)

Cell In[3], line 129
    127 if __name__ == '__main__':
    128     while True:
--> 129         regular_discussion(input("Human:"))


File ~/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py:1191, in Kernel.raw_input(self, prompt)
   1189     msg = "raw_input was called, but this frontend does not support input requests."
   1190     raise StdinNotImplementedError(msg)
-> 1191 return self._input_request(
   1192     str(prompt),
   1193     self._parent_ident["shell"],
   1194     self.get_parent("shell"),
   1195     password=False,
   1196 )


File ~/.local/lib/python3.8/site-packages/ipykernel/kernelbase.py:1234, in Kernel._input_request(self, prompt, ident, parent, password)
   1231 except KeyboardInterrupt:
   1232     # re-raise KeyboardInterrupt, to truncate traceback
   1233     msg = "Interrupted by user"
-> 1234     raise KeyboardInterrupt(msg) from None
   1235 except Exception:
   1236     self.log.warning("Invalid Message:", exc_info=True)


KeyboardInterrupt: Interrupted by user

Intelligent Speech Recognition Using OpenAI Whisper


OpenAI GPT For Python Developers
https://blog.excelsre.com/2023/05/30/openai-gpt-for-python-developers/
作者
Felix Yang
发布于
2023年5月30日
许可协议