diff --git a/pages/applications/_meta.en.json b/pages/applications/_meta.en.json index e34b985..8c23b98 100644 --- a/pages/applications/_meta.en.json +++ b/pages/applications/_meta.en.json @@ -1,4 +1,5 @@ { "pal": "Program-Aided Language Models", - "generating": "Generating Data" -} \ No newline at end of file + "generating": "Generating Data", + "workplace_casestudy": "Graduate Job Classification Case Study" +} diff --git a/pages/applications/workplace_casestudy.en.mdx b/pages/applications/workplace_casestudy.en.mdx new file mode 100644 index 0000000..d5297af --- /dev/null +++ b/pages/applications/workplace_casestudy.en.mdx @@ -0,0 +1,56 @@ +# Graduate Job Classification Case Study + +[ClaviƩ et al., 2023](https://arxiv.org/abs/2303.07142) provide a case-study on prompt-engineering applied to a medium-scale text classification use-case in a production system. Using the task of classifying whether a job is a true "entry-level job", suitable for a recent graduate, or not, they evaluated a series of prompt engineering techniques and report their results using GPT-3.5 (`gpt-3.5-turbo`). + +The work shows that LLMs outperforms all other models tested, including an extremely strong baseline in DeBERTa-V3. `gpt-3.5-turbo` also noticeably outperforms older GPT3 variants in all key metrics, but requires additional output parsing as its ability to stick to a template appears to be worse than the other variants. + +The key findings of their prompt engineering approach are: + +- For tasks such as this one, where no expert knowledge is required, Few-shot CoT prompting performed worse than Zero-shot prompting in all experiments. +- The impact of the prompt on eliciting the correct reasoning is massive. Simply asking the model to classify a given job results in an F1 score of 65.6, whereas the post-prompt engineering model achieves an F1 score of 91.7. +- Attempting to force the model to stick to a template lowers performance in all cases (this behaviour disappears in early testing with GPT-4, which are posterior to the paper). +- Many small modifications have an outsized impact on performance. + - The tables below show the full modifications tested. + - Properly giving instructions and repeating the key points appears to be the biggest performance driver. + - Something as simple as giving the model a (human) name and referring to it as such increased F1 score by 0.6pts. + +### Prompt Modifications Tested + +| Short name | Description | +|------------|----------------------------------------------------------------------------| +| Baseline | Provide a a job posting and asking if it is fit for a graduate. | +| CoT | Give a few examples of accurate classification before querying. | +| Zero-CoT | Ask the model to reason step-by-step before providing its answer. | +| rawinst | Give instructions about its role and the task by adding to the user msg. | +| sysinst | Give instructions about its role and the task as a system msg. | +| bothinst | Split instructions with role as a system msg and task as a user msg. | +| mock | Give task instructions by mocking a discussion where it acknowledges them. | +| reit | Reinforce key elements in the instructions by repeating them. | +| strict | Ask the model to answer by strictly following a given template. | +| loose | Ask for just the final answer to be given following a given template. | +| right | Asking the model to reach the right conclusion. | +| info | Provide additional information to address common reasoning failures. | +| name | Give the model a name by which we refer to it in conversation. | +| pos | Provide the model with positive feedback before querying it. | + + +### Performance Impact of All Prompt Modifications + +| | Precision | Recall | F1 | Template Stickiness | +|----------------------------------------|---------------|---------------|---------------|------------------------| +| _Baseline_ | _61.2_ | _70.6_ | _65.6_ | _79%_ | +| _CoT_ | _72.6_ | _85.1_ | _78.4_ | _87%_ | +| _Zero-CoT_ | _75.5_ | _88.3_ | _81.4_ | _65%_ | +| _+rawinst_ | _80_ | _92.4_ | _85.8_ | _68%_ | +| _+sysinst_ | _77.7_ | _90.9_ | _83.8_ | _69%_ | +| _+bothinst_ | _81.9_ | _93.9_ | _87.5_ | _71%_ | +| +bothinst+mock | 83.3 | 95.1 | 88.8 | 74% | +| +bothinst+mock+reit | 83.8 | 95.5 | 89.3 | 75% | +| _+bothinst+mock+reit+strict_ | _79.9_ | _93.7_ | _86.3_ | _**98%**_ | +| _+bothinst+mock+reit+loose_ | _80.5_ | _94.8_ | _87.1_ | _95%_ | +| +bothinst+mock+reit+right | 84 | 95.9 | 89.6 | 77% | +| +bothinst+mock+reit+right+info | 84.9 | 96.5 | 90.3 | 77% | +| +bothinst+mock+reit+right+info+name | 85.7 | 96.8 | 90.9 | 79% | +| +bothinst+mock+reit+right+info+name+pos| **86.9** | **97** | **91.7** | 81% | + +Template stickiness refers to how frequently the model answers in the desired format. diff --git a/pages/papers.en.mdx b/pages/papers.en.mdx index 18fc9ab..06fdb35 100644 --- a/pages/papers.en.mdx +++ b/pages/papers.en.mdx @@ -133,6 +133,7 @@ The following are the latest papers (sorted by release date) on prompt engineeri - [Large Language Models and Simple, Stupid Bugs](https://arxiv.org/abs/2303.11455) (March 2023) - [Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?](https://arxiv.org/abs/2303.09325) (Mar 2023) - [SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models](https://arxiv.org/abs/2303.08896) (Mar 2023) + - [Large Language Models in the Workplace: A Case Study on Prompt Engineering for Job Type Classification](https://arxiv.org/abs/2303.07142) (March 2023) - [ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction](https://arxiv.org/abs/2303.05063) (March 2023) - [MathPrompter: Mathematical Reasoning using Large Language Models](https://arxiv.org/abs/2303.05398) (March 2023) - [Prompt-Based Learning for Thread Structure Prediction in Cybersecurity Forums](https://arxiv.org/abs/2303.05400) (March 2023) @@ -170,4 +171,4 @@ The following are the latest papers (sorted by release date) on prompt engineeri - [Chain-of-Thought Papers](https://github.com/Timothyxxx/Chain-of-ThoughtsPapers) - [Papers with Code](https://paperswithcode.com/task/prompt-engineering) - - [Prompt Papers](https://github.com/thunlp/PromptPapers#papers) \ No newline at end of file + - [Prompt Papers](https://github.com/thunlp/PromptPapers#papers)