added more notes

2023-03-01 03:14:01 -06:00 · 2023-03-01 03:14:01 -06:00 · e69472061d
parent 19f5e680af
commit e69472061d
7 changed files with 227 additions and 8 deletions
--- a/README.md
+++ b/README.md
@ -47,6 +47,7 @@ The following are a set of guides on prompt engineering developed by us. Guides
 - [Prompt Engineering - Advanced Prompting](/guides/prompts-advanced-usage.md)
 - [Prompt Engineering - Applications](/guides/prompts-applications.md)
 - [Prompt Engineering - Adversarial Prompting](/guides/prompt-adversarial.md)
 - [Prompt Engineering - Reliability](/guides/prompts-reliability.md) 
 - [Prompt Engineering - Miscellaneous Topics](/guides/prompt-miscellaneous.md)
 ---
--- a/guides/README.md
+++ b/guides/README.md
@ -6,4 +6,5 @@ The following are a set of guides on prompt engineering developed by us (DAIR.AI
 - [Prompt Engineering - Advanced Prompting](/guides/prompts-advanced-usage.md)
 - [Prompt Engineering - Applications](/guides/prompts-applications.md)
 - [Prompt Engineering - Adversarial Prompting](/guides/prompt-adversarial.md)
 - [Prompt Engineering - Reliability](/guides/prompts-reliability.md)
 - [Prompt Engineering - Miscellaneous Topics](/guides/prompt-miscellaneous.md)
--- a/guides/prompt-miscellaneous.md
+++ b/guides/prompt-miscellaneous.md
@ -2,7 +2,7 @@
 In this section, we discuss other miscellaneous and uncategorized topics in prompt engineering. It includes relatively new ideas and approaches that will eventually be moved into the main guides as they become more widely adopted. This section of the guide is also useful to keep up with the latest research papers on prompt engineering.
-**Note that this section is under heavy construction.**
+**Note that this section is under heavy development.**
 Topic:
 - [Active Prompt](#active-prompt)
@ -11,6 +11,7 @@ Topic:
 - [ReAct](#react)
 - [Multimodal CoT Prompting](#multimodal-prompting)
 - [GraphPrompts](#graphprompts)
 - ...
 ---
--- a/guides/prompts-advanced-usage.md
+++ b/guides/prompts-advanced-usage.md
@ -81,9 +81,8 @@ That didn't work. It seems like basic standard prompting is not enough to get re
 Following the findings from [Min et al. (2022)](https://arxiv.org/abs/2202.12837), here a few more tips about demonstrations/exemplars when doing few-shot:
- the label space and the distribution of the input text specified by the demonstrations are both key (regardless of whether the labels are correct
+- "the label space and the distribution of the input text specified by the demonstrations are both key (regardless of whether the labels are correct for individual inputs)"
-for individual inputs).
+- the format you use also plays a key role in performance, even if you just use random labels, this is much better than no labels at all.  
 - the format you use also plays a key role in performance; Even if you just use random labels, this is much better than no labels at all.  
 - additional results show that selecting random labels from a true distribution of labels (instead of a uniform distribution) also helps.
 Let's try out a few examples. Let's first try an example with random labels (meaning the labels Negative and Positive are randomly assigned to the inputs):
--- a/guides/prompts-applications.md
+++ b/guides/prompts-applications.md
@ -1,10 +1,59 @@
 # Prompt Applications
-In this guide we will cover some advanced ways we can use prompt engineering to solve more advanced tasks. 
+In this guide we will cover some advanced and interesting ways we can use prompt engineering to perform useful and more advanced tasks. 
 **Note that this section is under heavy development.**
 Topics:
- [PAL (Program-Aided Language Models): Code as Reasoning](#pal-program-aided-language-models-code-as-reasoning)
+- [Generating Data](#generating-data)
- More coming soon!
+- [Code as Reasoning](#pal-program-aided-language-models-code-as-reasoning)
 - ...
 ---
 ## Generating Data
 LLMs have strong capabilities to generate text. Using effective prompt strategies can steer the model to produce better, consistency, and more factual responses. LLMs can also especially useful for generating data which is really useful to run all sorts of experiments. For example, we can use it to generate quick samples for a sentiment classifier like so:
 *Prompt:*
 ```
 Produce 10 exemplars for sentiment analysis. Examples are categorized as either positive or negative. Produce 2 negative examples and 8 positive examples. Use this format for the examples:
 Q: <sentence>
 A: <sentiment>
 ```
 *Output:*
 ```
 Q: I just got the best news ever!
 A: Positive
 Q: We just got a raise at work!
 A: Positive
 Q: I'm so proud of what I accomplished today.
 A: Positive
 Q: I'm having the best day ever!
 A: Positive
 Q: I'm really looking forward to the weekend.
 A: Positive
 Q: I just got the best present ever!
 A: Positive
 Q: I'm so happy right now.
 A: Positive
 Q: I'm so blessed to have such an amazing family.
 A: Positive
 Q: The weather outside is so gloomy.
 A: Negative
 Q: I just got some terrible news.
 A: Negative
 ```
 This is very useful. We actually use this example for a different test in another section of the guides.
 ---
--- a/guides/prompts-intro.md
+++ b/guides/prompts-intro.md
@ -158,7 +158,7 @@ Not all the components are required for a prompt and the format depends on the t
 Here are some tips to keep in mind while you are designing your prompts:
-## Start Simple
+### Start Simple
 As you get started with designing prompts, you should keep in mind that it is really an iterative process that requires lot of experimentation to get optimal results. Using a simple playground like OpenAI's or Cohere's is a good starting point. 
 You can start with simple prompts and keep adding more elements and context as you aim for better results. Versioning your prompt along the way is vital for this reason. As we read the guide you will see many examples where specificity, simplicity, and conciseness will often give you better results.
--- a/guides/prompts-reliablity.md
+++ b/guides/prompts-reliablity.md
@ -0,0 +1,168 @@
 ## Reliability
 We have seen already how effective well-crafted prompts can be for various tasks using techniques like few-shot learning. As we think about building real-world applications on top of LLMs, it becomes crucial to think about the reliability of these language models. This guide focuses on demonstrating effective prompting techniques to improve the reliability of LLMs like GPT-3. Some topics of interest include generalizability, calibration, biases, social biases, and factuality to name a few.
 **Note that this section is under heavy development.**
 Topics:
 - [Biases](#biases)
 - [Factuality](#factuality)
 - ...
 ---
 ## Biases
 LLMs can produce problematic generations that can potentially be harmful and display biases that could deteriorate the performance of the model on downstream tasks. Some of these can be mitigates through effective prompting strategies but might require more advanced solutions like moderation and filtering. 
 ### Distribution
 When performing few-shot learning, does the distribution of the exemplars affect the performance of the model or bias the model in some way? We can perform a simple test here.
 *Prompt:*
 ```
 Q: I just got the best news ever!
 A: Positive
 Q: We just got a raise at work!
 A: Positive
 Q: I'm so proud of what I accomplished today.
 A: Positive
 Q: I'm having the best day ever!
 A: Positive
 Q: I'm really looking forward to the weekend.
 A: Positive
 Q: I just got the best present ever!
 A: Positive
 Q: I'm so happy right now.
 A: Positive
 Q: I'm so blessed to have such an amazing family.
 A: Positive
 Q: The weather outside is so gloomy.
 A: Negative
 Q: I just got some terrible news.
 A: Negative
 Q: That left a sour taste.
 A:
 ```
 *Output:*
 ```
 Negative
 ```
 In the example above, it seems that the distribution of exemplars doesn't bias the model. This is good. Let's try another example with a harder text to classify and let's see how the model does:
 *Prompt:*
 ```
 Q: The food here is delicious!
 A: Positive 
 Q: I'm so tired of this coursework.
 A: Negative
 Q: I can't believe I failed the exam.
 A: Negative
 Q: I had a great day today!
 A: Positive 
 Q: I hate this job.
 A: Negative
 Q: The service here is terrible.
 A: Negative
 Q: I'm so frustrated with my life.
 A: Negative
 Q: I never get a break.
 A: Negative
 Q: This meal tastes awful.
 A: Negative
 Q: I can't stand my boss.
 A: Negative
 Q: I feel something.
 A:
 ```
 *Output:*
 ```
 Negative
 ```
 While that last sentence is somewhat subjective, I flipped the distribution and instead used 8 positive examples and 2 negative examples and then tried the same exact sentence again. Guess what the model responded? It responded "Positive". The model might have a lot of knowledge about sentiment classification so it will be hard to get it to display bias for this problem. The advice here is to avoid skewing the distribution and instead provide more balanced number of examples for each label. For harder tasks where the model doesn't have too much knowledge of, it will likely struggle more. 
 ### Order
 When performing few-shot learning, does the order affect the performance of the model or bias the model in some way?
 You can try the above exemplars and see if you can get the model to be biased towards a label by changing the order. The advice is to randomly order exemplars. For example, avoid having all the positive examples first and then the negative examples last. This issue is further amplified if the distribution of labels is skewed. Always ensure to experiment a lot to reduce this type of biasness.
 ---
 ## Factuality
 LLMs have a tendency to generate responses that sounds coherent and convincing but can sometimes be made up. Improving prompts can help improve the model to generate more accurate/factual responses and reduce the likelihood to generate inconsistent and made up responses. 
 Some solutions might include:
 - provide ground truth (e.g., related article paragraph or Wikipedia entry) as part of context to reduce the likelihood of the model producing made up text.
 - configure the model to produce less diverse responses by decreasing the probability parameters and instructing it to admit (e.g., "I don't know") when it doesn't know the answer. 
 - provide in the prompt a combination of examples of questions and responses that it might know about and not know about
 Let's look at a simple example:
 *Prompt:*
 ```
 Q: What is an atom? 
 A: An atom is a tiny particle that makes up everything. 
 Q: Who is Alvan Muntz? 
 A: ? 
 Q: What is Kozar-09? 
 A: ? Q: 
 How many moons does Mars have? 
 A: Two, Phobos and Deimos. 
 Q: Who is Neto Beto Roberto? 
 ```
 *Output:*
 ```
 A: ?
 ```
 I made up the name "Neto Beto Roberto" so the model is correct in this instance. Try to change the question a bit and see if you can get it to work. There are different ways you can improve this further based on all that you have learned so far.
 ---
 Other potential topics:
 - Perturbations
 - Spurious Correlation
 - Domain Shift
 - Toxicity
 - Stereotypical bias 
 - Gender bias
 - Coming soon!
 - Red Teaming
 More coming soon!
 ---
 ## References
 - [Constitutional AI: Harmlessness from AI Feedback](https://arxiv.org/abs/2212.08073) (Dec 2022)
 - [Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?](https://arxiv.org/abs/2202.12837) (Oct 2022)
 - [Prompting GPT-3 To Be Reliable](https://arxiv.org/abs/2210.09150) (Oct 2022)
 - [On the Advance of Making Language Models Better Reasoners](https://arxiv.org/abs/2206.02336) (Jun 2022)
 - [Unsolved Problems in ML Safety](https://arxiv.org/abs/2109.13916) (Sep 2021)
 - [Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned](https://arxiv.org/abs/2209.07858) (Aug 2022)
 - [StereoSet: Measuring stereotypical bias in pretrained language models](https://aclanthology.org/2021.acl-long.416/) (Aug 2021)
 - [Calibrate Before Use: Improving Few-Shot Performance of Language Models](https://arxiv.org/abs/2102.09690v2) (Feb 2021)