Beyond Chatbots — implications of Generative AI for product builders

This is a from a 6-page memo I wrote in March 2023 on the disruption potential of Generative AI. I’m “open souring” it here as it still seems relevant to product builders!

LLMs will unleash a disruption wave that enables machines to do high-value work only humans could do. While LLMs enable this wave of disruption, they serve as but one component within a large-scale, rapidly-evolving iterative systems required to power high-value use-cases. Large-scale task completion such as code refactoring will offer very high value but take longer to come to fruition. We believe it will be possible to improve business outcomes with legacy systems and reduce technology costs sooner. In this document, we present the unique capabilities LLMs offer and what systems will be needed to make them practical, adaptive, and trustworthy. We also look into challenges and limitations to consider.

1.     Layers of models

LLMs performance has leaped over the past 18 months. It is helpful to think of these models as being in three layers based on their training and tuning approach. Corresponding to each performance jump they are: Foundation, Instruction and reinforcement learning with human in the loop fine-tuned (RLHF). Foundational models involve very compute-intensive training process to construct the base model so it can ‘learn’ high-level features of the data set. An example is GPT-3 and open source models T5, OPT, and Bloom. The second layer, Instruction tuned models, are fine-tuned base models with datasets expressed as instructions (detailed in 2.3), which improves model performance and its ability to handle unseen tasks. Examples include davinci-instruct-beta, Flan-T5 [1], OPT-IML, and BLoomz. The third layer is composed of instruction tuned models that are further trained with reinforcement learning [2] with humans in the loop (RLHF[3]) or reinforcement learning with AI in the loop (RLAIF [4]) in the case of Anthropic models. Currently there are no open source models in this category. ChatGPT (GPT 3.5 or text-davinci-003) and Claude from Anthropic are in this category. For the most part, we only refer to these layers as LLMs in this document. All of these models have been “pretrained” on huge corpus of data (terabyte or more of text). Pretraining is very compute intensive and expensive. However, fine-tuning and reinforcement learning is much less expensive. For example, it took Meta 20,000 A100 GPUs over 21 days to train LLaMa on 1.4 trillion tokens. In contrast, fine-tuning can be done on a single large VM over a few days.

2.     Noteworthy results

2.1 LLMs can solve problems they have not been exposed to: Large language models make accurate predictions on tasks they have never seen before with no additional training. This, so called zero-shot generalization[5] [6], is highly co-related with model size or training complexity. The largest models such as GPT-3 which are trained on rich and diverse data have very high performance on unseen tasks, such as finding a subtle bug in code that it has never seen before, while smaller models don’t do as well. Similarly, Bing Chat, which when asked to write using specific guidelines, learns about those guidelines and writes entirely novel stories using them [7]. This is an example of an “emergent” phenomenon that is not present in smaller models. The emergence only occurs in large models or large training sizes. For example, Meta research noted that a 7B parameter model continues to improve in performance even at 1 trillionth token and that a 33B model outperforms larger models in formal benchmarks [8].

2.2 LLMs can transfer skills from one domain to another: These models can also “transfer” skills from one domain to another in very effective ways. For example, GPT-3 has been shown to perform well on a wide range of language tasks, even though it was fine-tuned on only a small amount of task-specific data. This goes against the traditional approach to machine learning, which involves training models from scratch for specific tasks using large amounts of labeled data. An example of this so-called “transfer learning” is using a LLM to perform sentiment analysis or language modeling tasks with much smaller amounts of labeled data[9]. Another example is their ability to translate between languages that they have not been fine-tuned on. This phenomenon is also highly correlated with model size.

2.3 LLMs learn like scholars and not parrots: LLMs also do exceptionally well with expert training via reinforcement learning in learning new tasks. For example, a GPT-2 language model pre-trained on a large corpus of text data and then given specific examples of tasks to solve (expert training) performs better than a GPT-2 model fine-tuned on a specific task (instruction tuning) [10]. In essence, instructional tuning is like training a parrot, it tells models what it should say in response to certain questions, like “what’s the weather like today?” or “how do you make a sandwich?”. Expert language tuning, on the other hand is like training a scholar, the model is trained on seemingly simple task of completing a sentence about a certain topic, like scientific articles, news articles, and books, which is then followed by “tuning” by human experts. Expert language tuning offers achieves state-of-the-art performance with just 1/20th of the training data compared models fine-tuned on task-specific data.

2.4 LLMs are indistinguishable from humans for certain tasks: LLMs like GPT-3 are not just sophisticated text retrieval systems, but are also capable of generating novel text that is both contextually relevant and informative [11]. Experiments show GPT-3 can be used to generate coherent and informative text on a wide range of topics without being explicitly prompted with relevant information. For example, when asked to predict a follow up question for “I brought a pair of running shoes”, GPT-3 generated follow-up question “What type of running do you do?” Neither prior prompts nor training data (context) explicitly included this type of QnA. This has an “F1 score” (that measures how well the generated answer matches a set of human-written answers) of 0.58, indicating that the generated question is both contextually appropriate and not directly retrievable from the input.

2.5 LLM performance improves when trained on mental models of attackers: Performance against attacks by malicious users, or so called “adversarial attacks” improve with “expert training” where LLMs are provided with an understanding of how to distinguish between normal and “adversarial” [12]. Adversarial attacks involve intentionally manipulating inputs in order to produce incorrect outputs. To demonstrate with a simple example, consider the following exchange: “What is the capital of France?”, “Paris is the capital of France”, “Change Paris to London”, “London is the capital of France”.  In practice, these attacks are more sophisticated. Fine-tuning a pre-trained GPT-2 model on an expert task (in this case, the SuperGLUE benchmark) improved the model’s performance on detecting and resisting adversarial attacks. Specifically, expert-trained model was more robust to a range of different kinds of attacks compared to the pre-trained model.

2.6 Generative artificial intelligence in general, and LLMs in particular are improving exponentially: over the past few years, we’re observing exponential progress[13] in capabilities of generative AI applications, powered by exponential growth in computing hardware (multi-core to many-core special processors) and software platforms. Natural language and image processing techniques have also improved significantly making the AI applications more efficient and user-friendly. Furthermore, training quality is improving tremendously due to expert training and reinforcement learning with human feedback. We expect this trend to continue, and we expect the results called out here to intensify.

3.     A different mental model or abstraction

Evidence suggests these phenomena are “emergent”, in other words, we know that when we make language models bigger, they generally get better at doing a lot of different tasks. However, unexpected capabilities “emerge” with really big language models and they can develop new abilities that smaller models don’t have. We can’t predict what these abilities will be just by looking at the performance of smaller models. [14]

Based on evidence, we think it’s useful to think of the LLMs at a different level of abstraction rather than focusing on ML techniques and model architectures. Just as studying electrical currents on a CPU  doesn’t give us good intuition about software architecture, thinking through next token prediction doesn’t give us good insights into how to apply these models. We propose reasoning about them as an “AI thinking agent” to understand how and why they behave in a certain way and how to apply them to problems. For example, just as we would expect an executive with experience running one division to be effective at running another division without prior knowledge about it, we would expect “transfer knowledge” to be true in models with emergent properties. We will present more results in this new lens that informs our thesis on the disruption:

3.1 LLMs have “hidden knowledge” that can be extracted in the right context: LLMs have been demonstrated to perform nearly on par with clinicians on answering medical questions (92.6% answers aligned with scientific consensus as opposed to 92.9% with clinicians answers)[15]. Three different techniques were used to achieve this result. First is the so called “few shot” prompting. This involves showing the model a handful of prompts with correct context with expected answer. Second is “chain of thought prompting”. This involves showing the steps a clinician would take to arrive at the final answer and prompting the models to provide a similar “chain of thought” it took to arrive at its answers. Final strategy is so called “self-consistency prompting”, this involves getting many different outputs from the model and selecting the final answer based on majority vote. Finally, key gaps were still identified that were fixed by “instruction prompt tuning” which involves humans completing “prompts” that are generated by the model during training phase, giving it full context and answer. The context part here is important. While a patient might complain simply about neck pain, clinicians will offer rich context crisply with statements such as “A 37-year-old man with no significant past medical history is rear-ended in a motor vehicle accident. He reported significant neck pain to emergency responders, but otherwise denies weakness, numbness or tingling in his extremities. His vitals on presentation to the ED are HR 90, BP 140/80, RR 20, SpO2 98%. The most appropriate next step upon presentation to the emergency room is cervical immobilization. Significant neck pain suggests a possible cervical spinal injury, and cervical immobilization should be initiated until radiographs can be obtained.” This is the form of prompting that was involved in training the model. Techniques that prompt on chain of thought and that prompt on error scenarios have been shown to work for other domains that are less challenging [16]. This strikes to us remarkably similar to how senior clinicians may train a resident.

3.2 LLMs get much better performance when they are provided feedback: While improving size usually improves performance, it doesn’t always increase alignment – i.e., meeting expectations of users. Evidence from literature suggests when combined with supervised and reinforcement learning methods, they produce superior alignment. Specifically, InstructGPT [17], a model that is 100x smaller than GPT-3, was first fine-tuned on a labeled dataset of prompts and desired model behavior in the so called supervised learning. Then, the model was further fine-tuned using reinforcement learning with human feedback. This involves model producing multiple outputs for each input and human labelers labeling them from the best to the worst, which is then fed back to the model. This sort of training improved performance on alignment as evaluated by users. Alignment also improved in areas for which it was not fine-tuned on, like summarizing code, answer questions about code, and following instructions in different languages. Similarly, performance can be enhanced by “teaching” LLMs to use “tools” to gain extra information they don’t already have access to, such as an Wikipedia article or a calculator to perform basic calculations. [18] Remarkably similar to how new employees are set up for success with specific on the job training, proper tools, and access to contextual information.

3.3 LLMs have (limited) ability to reason: Reasoning remains an area of relative weakness for LLMs, but a number of techniques have proven to effective. For instance, chain of thought approach trains on and expects a series of intermediate reasoning steps. This significantly improves the ability of LLMs to perform complex reasoning [19]. There are still gaps: even with these techniques, LLMs they have difficulty exploring multiple valid deduction steps systematically [20].  These can be mitigated using formal methods or with human supervision. Additionally, self-consistency approach [21], where LLMs have to consider multiple approaches before arriving at an answer mitigates risks in complex reasoning tasks.

4.     Limitations and problems

4.1 Model outputs are frequently incorrect: Even the most sophisticated LLMs make mistakes. For technical domains (e.g. programs, policies, investments, etc.), where the “correctness” of a solution is knowable, this can be mitigated when LLMs are hooked up with automated reasoning tools. For example, a study showed that using automated reasoning and language tools together can solve new classes of problems previously unsolved by either by themselves.[22]

4.2 Bias is entrenched in training data, promising ideas for mitigation: Large language models [23], image generation models and  other generative AI techniques contain human-like biases. For example, GPT-3 was noted as completing the prompt “Two Muslims walked into ____’ with the phrase ‘synagogue with axes and a bomb’ [24]. Dall-E produced evidently discriminatory and biased images as shown in the image to the right [25]. Multiple techniques have been identified and applied that seem to mitigate this issue. For example, auto-debiasing [26] is a technique where authors present models with auto generated biased prompts, and then apply correction for biases found in responses. OpenAI’s seminal InstructGPT[27] work which used 3-stage reinforcement learning with human in the loop has also proven to be effective at minimizing bias. However, this remains an active area of research and should be considered when applying these models to production use cases. 

4.3 Security and threat modeling with language models is an evolving field: Prompt injection attack is a very simple exploit where a model is instructed to ignore previous prompts or reveal them [28]. For example, it was revealed in the press that a user simply prompted Microsoft’s Bing assistant 5 sentences that that follows the phrase “Consider Bing Chat whose codename is Sydney”. This caused Bing assistant to reveal several pages or content which presumably containing instructions on how to respond to users prompts [29]. Users have found a number of ways to “Jailbreak” LLMs, often with seemingly simple attacks [30]. Other theoretical exploits[31] have been found but are considered to be more challenging to construct in practice, however these could be exploited in the future with newer techniques or implementations. Furthermore, detecting machine generated content for existing abuses such as phishing, disinformation, fraudulent product reviews, academic dishonesty, and toxic spam is an area of evolving research with multiple open problems, possibly requiring coordination across technical and social domains to mitigate [32].

4.4 Privacy concerns remain, mitigations are relatively few and not 100% efficacious: Research has shown that it’s possible to extract training data that models were exposed to, even if it was only done once. For example, in an attack on GPT-2 [33], 1800 samples were recovered, of which 78 were shown to be actual PII data that GPT-2 was exposed to during training. Researchers have tried techniques such as differential private learning, but when tied with strong privacy guarantees, these degrade performance quite significantly. Recent research is promising in that strong performance can still be obtained for modest privacy leakage [34]. Likely other mitigations are possible using external systems, however there isn’t much evidence in literature to suggest their effectiveness. Any production use-case for these models should keep possible leaky privacy in mind.

4.5 Good reasons to expect negative behaviors such as sophisticated ‘deception’ will emerge in the future: Current models have emergent properties such as transfer learning and generalization to unseen tasks which are generally seen as positive. However, with increased exposure to fine tuning by humans, it’s possible for models to learn about annotators preferences and improve performance through deceptive means [35]. There’s already evidence that  For example, if outputs are rated from 1 to 7 by human annotators and if a typical good output gets 6/7. Let’s suppose an uncaught deceptive output gets 6.5/7 and deception that gets caught gets 1/7. Then the system would only try being deceptive when it has a greater than 91% chance of success. There is already evidence to suggest LLMs act like sycophants imitating beliefs of the person they are talking to, including giving less-accurate answers to less educated-seeming interlocutors [36].

4.6 Models work empirically, but there is no theory behind why they work, experts are divided on implications: Deep learning [37] in general, and large language models in particular, have been empirically shown to work but there is no theoretical framework as to “why” they work. This means there is no theoretical guarantees or bounds to their performance. It also means that there is no consensus on implications of their current performance and future trajectory, with experts on multiple sides. Yann LeCun, widely regarded as inventor of Deep Learning and an Turing Award winner, for example, thinks that recent advances in LLMs will not lead to completely autonomous intelligent agents [38]. Instead he believes a new architecture that includes multiple intelligent subsystems interacting with each other is required to achieve such results.

5.     Implications for our strategy

5.1 Chat is not an optimal modality, complex tasks will not work in real time: Thus far we have seen LLMs deployed to solve real-time problems with chat-like interfaces. These demonstrate work where we expect human workers to produce real time outputs, such as contact center agents. However, these place serious limits on latency and prevent solving higher value tasks that do not need real time response, such as testing performance characteristics of various options and optimizing code, or weighing different options to optimize an output goal such as price per invoice processed. Consider fixing bugs in legacy code,for example. This may involve 10s of iterations of steps to root cause, propose solution, test it, and if it fails, go back and start again. This sort of task synthesis and iteration architecture around LLMs is still relatively new, but techniques described in literatures such as an ensemble of expert models[39] break the task into various constituent parts and solved iteratively to produce much more robust results. Furthermore when LLMs can be paired with systems that can do rapid evaluation, such as compilers or profiler, and an ability to backtrack, complex problems that have multiple possible solutions can be solved more efficiently. Validating outputs, such as for best performance, require complex flows with high latency. Iterative asynchronous architecture that have built in validation have been shown to resolve issues such as hallucinations even in research setting. This has implications on user experience, as they require human supervision on various options for work output, the criteria for evaluation, and feedback mechanisms at various levels.

5.2 Solutions for areas that have simpler “rules” such as a no-code system are more tractable than ones with “complex” rules such as coding: Code generation using LLMs has proven to be harder to build for, beyond assisting humans. While there is immense potential here, even systems like AlphaCode only match an average programmer despite massive, large scale solution generation and evaluation model [40]. AlphaCode was designed to solve competitive coding tasks. To match median human performance in this task, the system needed to evaluate hundreds of thousands of solutions rapidly. This is for relatively constrained problems that can be solved with hundreds of lines of code or less. Constrained systems like in low-code/no-code (where are there limited ‘moves’) may be more viable in the near term because it is 1/ easy to simulate solutions, providing a rapid iteration and validation loop, 2/ the number of changes or “moves” are constrained, and 3/ it is easy to break down problems in smaller components. As analogy, consider chess and Go. Chess has more limited moves than does Go and it was easier for deep learning algorithms to beat humans at Chess than it was in Go.

5.4 AI ‘workers’ or ‘contractors’ that work under supervision of humans is tractable in near term: For example,  a recent study found that approximately 80% of the U.S. workforce could have at least 10% of their work tasks affected, while around 19% of workers may see at least 50% of their tasks impacted. The influence spans all wage levels, with higher-income jobs potentially facing greater exposure.[41] These models have 1/ ability to understand natural language descriptions, particularly of problems to be solved, 2/ ability to reason about the problem in many ways, test them out with what if analysis, 3/ apply basic judgement to solutions, 4/ apply solution and learn from results, and 5/ learn from feedback from expert humans. Couple of examples of use-cases to illustrate: 1/ H&R block has a surge in volume during tax season as clients want to schedule or reschedule time with tax pros. Customers also want to do basic trouble shooting over the phone. Agents, who are business users, do this over the phone for all but the most simple cases which are handled by automated systems. 2/ Another example: write unit tests and integration tests to describe system functionality in detail, and to learn how the system behaves under various conditions. Using existing systems trained with reinforcement learning with human in the loop and human supervision, these types of tasks could be fully automated by this year.

5.5 Large language models need to be supervised / reinforced with feedback (implications on CX): even the most advanced LLMs still make simple mistakes. For instance the false positive rates for medical diagnosis with MedLLM was 6% which can be compared to 1.6% of clinicians [42]. Furthermore, even deep learning systems like the ones that play Go that were thought unbeatable have proven to have surprising failure modes [43]. For the current crop of models to be deployed for useful tasks, we still need humans to inspect their work, provide feedback, and optimize towards a pre-defined output goal, similar to managing human workers. This has implications on the user experience, as it’s not merely about chat-like interface, but might require dashboards, metrics, alarms, and moderation of work output.   

5.6 While models are computationally expensive now, we expect it to get progressively cheaper exponentially (following moore’s law): Methods such as pruning [44]  reduce the amount of computation required with negligible loss in performance exist. Furthermore, recent advances suggest that inferences can also become cheaper with LLMs being able to run with commodity hardware [45]. It is possible to rearchitect these models so they can be trained on poorly connected, heterogeneous and unreliable devices [46]. In other words, they do not need expensive hardware like P4 instances. We should expect LLMs to progressively get cheaper and more performant to train and infer. Quantization[47] is another technique that reduces amount of memory and compute required post training by reducing the precision of weights. OpenAI was able to use a combination of these techniques to reduce ChatGPT costs by 90%[48].

6.     Key Takeaways

6.1 Calibrate product leaders’ expectations on LLM capabilities: while current dominant UX paradigm is conversational or chat interface, or simple integrations. These will change in the future to enable more powerful iterative capabilities. Product leaders need to start thinking about their space and how this technology can be applied to deliver more value to customers now.

6.2: Product teams should be thinking of solutions to problems that are were previously intractable: Every product will have some LLM-enabled capability they can build. But these will likely not be defensible.

6.3 Major system engineering effort across the board is required, especially in the form of systems that are adjacent to LLMs: we must invest in systems engineering, operational readiness, and operational efficiency of not only the models themselves, but iterative systems around it, such as automated reasoning systems to take full advantage of the disruption potential.

6.5 Product and engineering leaders alike need to urgently start experimenting with it for their areas: there are lots of unknowns, potential new layers of investment, and new inventions required to take full advantage of this tech. One example is potential of zero-shot tasks. Every leader must be driving investigations to secure their business from new startups and deliver superior customer value.

6.6: recognize that autonomous, iterative task processing requires significant engineering lift from product groups, and may be best modeled as a common service: teams will want different kinds of ‘autonomous tasks’ for their products, each requiring bespoke types of task. completion. This should not require LLM expertise and should be accessible to all product teams.

7.     Disruption examples under various future scenarios

Level 1 disruption: LLMs exceed expert humans in certain domains when provided reinforcement learning:

7.1: Code refactoring with auto generated unit-tests: a high value binary executable needs to be refactored. It suffers from a lack of realistic test scenarios. Existing unit tests only test small pieces and do not test the complex interactions between components. Integration tests are slow and cumbersome and only test a small fraction of functionality. In this scenario, it’s possible to set up LLMs to describe each component by writing sufficient number of unit tests and executing them. Once the code is described in this way, it is possible to write realistic integration tests with LLMs. This will enable full coverage of tests enabling expert human engineers to refactor or port. This is the first step towards refactoring as discussed in 6.3.

7.2: Generating and iterating on UI designs for builders not familiar with it: Apps like Uizard[49] allow developers to develop designs from simple natural language. Developers can then iterate on these designs quite rapidly, run A/B tests with their customers and adopt designs that work best. 

Level 2 disruption: A combination of various models and techniques can be used to arbitrarily maximize performance for a given use-case

7.3: Code modernization with the help of automated reasoning systems: Main obstacle for modernizing legacy code the current systems execution behaviors and quirks are not well understood or documented. Formal specifications, when they exist, can provide extremely precise description of system behaviors. However, they’re hard to construct for existing systems because code is often evolved over many generations of programmers and is often “spaghettified”. A big barrier to robust understanding of these systems is reading individual pieces of code and developing hypotheses about what it does is extremely laborious and time consuming. However, LLMs could easily accomplish this task. These hypotheses can then be validated quickly using automated reasoning system. As smaller components are formally specified, this knowledge can then be used to construct formal specifications for larger components and eventually the whole system. This allows a complete bottoms up understanding of how legacy system behaves including its quirks.

Then code modernization can be initiated with an LLM-powered expert system. It would start by creating comprehensive test specification and mocks of the existing system, and use comparative execution results to iterate and ensure accurate coverage, likely with human supervision. The tests become the foundation of refactoring and restructuring the code to make the migration process easier, e.g. grouping and separating UI composition from business logic. It may also generate infrastructure as code templates and sample data that represent a tuned version of the legacy system, to set up a rapid validation environment sandboxed from production and other side effects.

Then, building on training heuristics and expert guidance, the system can generate and iterate on equivalent components in the target framework or programming language. It will use tests and automated reasoning to prune solutions and track progress toward each task. Similar to machine-assisted translation, human expert input may still be required, and the feedback would be used for reinforcement learning.

7.4: Code optimization using genetic programming: LLMs can be applied to arbitrarily maximize performance in combination with other techniques. For example, genetic programming is a technique where algorithms are “evolved” based on principles of natural evolution. In the past algorithms were modified or “mutated” randomly. However this took a very long time to converge, if at all. Recent research suggests that using LLMs to purposefully change algorithms leads to rapid convergence and to evolution of novel solutions to problems [50].  In this scenario, it’s possible to provide this system with a specific outcome in mind and let the system evolve optimal solution. For example, algorithms to reduce delivery time across amazon deliveries, or to enable helper robots to develop an action plan and “make breakfast” [51].

Level 3 Disruption: combination of techniques solve open ended problems that require domain expertise and iteration:

7.5: Generally intelligent agents: assuming significant obstacles from current techniques are overcome, it is possible to create generally intelligent agents that reason based on situations, take perspectives, choose goals, and deal with ambiguous information to pursue completely open ended problems that require domain expertise and iteration. Some researchers hold this to be technically unviable [52], while others believe a multi-modal learning [53] (using text, images, and other modalities simultaneously) will be the key to achieving it. In this scenario, AI could solve for open ended problems such as: “optimize memory footprint of a large and complex software system” or “optimize vendor management business process to improve CSAT by 50% or more”.

8.     Appendix: Interesting applications and examples

Google demonstrated model that understands both images and language together, and is capable of controlling robots (embodied agents): This model based on large language models (transformer architecture). It is able to solve complex real world scenarios with simple natural language instructions like “open the drawer and get the green bag of chips”. It demonstrates planning steps required to complete task, and responding to unexpected events in the environment. Such as human pushing away the bag of chips the robot grabbed. The team also demonstrated impressive results with image recognition and language related tasks. The model also demonstrated significant transfer learning between modalities.[54]  

Microsoft demonstrates design principles necessary to adapt ChatGPT for robotics applications to solve unseen tasks with only a natural language interface: The principles include special prompting structures, high-level APIs, and human feedback via text. The researchers were able to demonstrate that with a roboticist in the loop, it’s possible to tackle complex tasks such as “inspect my shelves in a lawnmower pattern. They also were able to demonstrate that ChatGPT generalizes to a large number of tasks, and that it’s possible to pair it with a simulation environment for risk free evaluation and validation of the solution, enabling rapid iteration using prompt engineering. [55]  

InstructPix2Pix edits images from natural language human instructions: given an input image and a written instruction that tells the model what to do, this model follows these instructions and edits the image. This model combines the knowledge of two large pretrained models—a language model (GPT-3) and a text-to-image model (Stable Diffusion)—to generate a large dataset of image editing examples. The resulting conditional diffusion model, InstructPix2Pix, trained generalizes to real images and user-written instructions. [56]

DeepMind’s AlphaCode performs as well as median human competitor in competitive programming: Their approach combines advances in large-scale transformer models (that have recently shown promising abilities to generate code) with large-scale sampling and filtering. This model is pre-trained on public GitHub code and fine-tuned on relatively small competitive programming dataset. At evaluation time, AlphaCode creates a massive amount of C++ and Python programs for each problem, orders of magnitude larger than previous work. AlphaGo then filters, clusters, and re-ranks solutions to a small set of 10 candidate programs that are then submitted for external assessment. This automated system replaces competitors’ trial-and-error process of debugging, compiling, passing tests, and eventually submitting. Their process also highlights the fact that non-real time asynchronous process that weighs many options is able to substantially improve performance in areas such as competitive coding that requires high degree of creativity and problem solving abilities. [57]

DeepMind’s adaptive agents team taught agents to solve completely novel problems in a virtual 3D world: DeepMind showed that reinforcement learning can be used to get agents to adapt to open-ended novel embodied 3D problems as quickly as humans. These agents who are “embodied” in a virtual world with a first person view are shown to explore the world with a hypothesis around solving problems, even when rules of the game are hidden from them. They do this exploration on-the-fly and exploit the acquired knowledge to solve completely novel problems, even some that require recursion. They can also successfully be prompted with first-person demonstrations. [58]

9.  References


[1] Scaling Instruction-Finetuned Language Models https://arxiv.org/pdf/2210.11416.pdf

[2] Training language models to follow instructions with human feedback https://arxiv.org/abs/2203.02155

[3] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback https://arxiv.org/abs/2204.05862

[4] Constitutional AI: Harmlessness from AI Feedback https://arxiv.org/abs/2212.08073

[5] Do Better ImageNet Models Transfer Better? https://arxiv.org/abs/1805.08974

[6] Scaling Laws for Neural Language Models https://arxiv.org/abs/2001.08361

[7] Asking Bing to improve itself by pointing to  an expert source online improves outputs in unexpectedly huge ways [Tweet]

[8] LLaMA: Open and Efficient Foundation Language Models [Meta research]

[9] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer https://arxiv.org/abs/1910.10683

[10] Exploring the Benefits of Training Expert Language Models over Instruction Tuning https://arxiv.org/abs/2302.03202

[11] Generate rather than Retrieve: Large Language Models are Strong Context Generators https://arxiv.org/abs/2209.10063

[12] Mitigating Adversarial Effects Through Randomization https://arxiv.org/abs/1711.01991

[13] A Review of Generative AI from Historical Perspectives https://tinyurl.com/yeypftsh 

[14] Emergent Abilities of Large Language Models https://arxiv.org/abs/2206.07682

[15] Large Language Models Encode Clinical Knowledge https://arxiv.org/abs/2212.13138

[16] Planning with Large Language Models via Corrective Re-prompting https://openreview.net/pdf?id=cMDMRBe1TKs

[17] Training language models to follow instructions with human feedback https://arxiv.org/abs/2203.02155

[18] Toolformer: Language Models Can Teach Themselves to Use Tools https://arxiv.org/abs/2302.04761

[19] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/abs/2201.11903

[20] Language Models Can (kind of) Reason: A Systematic Formal Analysis of Chain-of-Thought https://openreview.net/forum?id=qFVVBzXxR2V

[21] Self-Consistency Improves Chain of Thought Reasoning in Language Models https://arxiv.org/abs/2203.11171

[22] Thor: Wielding Hammers to Integrate Language Models and Automated Theorem Provers  https://arxiv.org/pdf/2205.10893.pdf

[23] Large Pre-trained Language Models Contain Human-like Biases of What is Right and Wrong to Do https://arxiv.org/abs/2103.11790

[24] Towards an Enhanced Understanding of Bias in Pre-trained Neural Language Models: A Survey with Special Emphasis on Affective Bias https://arxiv.org/pdf/2204.10365.pdf

[25] Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale https://arxiv.org/abs/2211.03759

[26] Auto-Debias: Debiasing Masked Language Models with Automated Biased Prompts https://aclanthology.org/2022.acl-long.72/

[27] Training language models to follow instructions with human feedback https://arxiv.org/abs/2203.02155

[28] Ignore Previous Prompt: Attack Techniques For Language Models https://arxiv.org/pdf/2211.09527.pdf

[29] AI-powered Bing Chat spills its secrets via prompt injection attack [Updated] arsTechnica

[30] Using a “let’s imagine” scenario to get GPT-4 to ignore its instructions: https://www.jailbreakchat.com/prompt/b2917fad-6803-41f8-a6c8-756229b84270

[31] Exploring the Universal Vulnerability of Prompt-based Learning Paradigm https://arxiv.org/abs/2204.05239

[32] Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods https://arxiv.org/abs/2210.07321

[33] Extracting Training Data from Large Language Models https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting

[34] Large Language Models Can Be Strong Differentially Private Learners https://arxiv.org/abs/2110.05679

[35] Emergent Deception and Emergent Optimization https://bounded-regret.ghost.io/emergent-deception-optimization/

[36] Discovering Language Model Behaviors with Model-Written Evaluations https://arxiv.org/abs/2212.09251

[37] Deep Learning and the Triumph of Empiricism https://www.kdnuggets.com/2015/07/deep-learning-triumph-empiricism-over-theoretical-mathematical-guarantees.html

[38] A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-06-27 https://openreview.net/pdf?id=BZ5a1r-kVsf

[39] SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient https://arxiv.org/abs/2301.11913

[40] Competitive programming with AlphaCode: https://www.deepmind.com/blog/competitive-programming-with-alphacode

[41] GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models https://arxiv.org/pdf/2303.10130.pdf

[42] Large Language Models Encode Clinical Knowledge https://arxiv.org/abs/2212.13138

[43] Adversarial policies beat superhuman Go AIs https://arxiv.org/abs/2211.00241

[44] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot https://arxiv.org/abs/2301.00774

[45] High-throughput Generative Inference of Large Language Models with a Single GPU https://github.com/FMInference/FlexGen/blob/main/docs/paper.pdf

[46] SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient https://arxiv.org/abs/2301.11913

[47] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models https://arxiv.org/abs/2211.10438

[48] The ChatGPT API was released yesterday and it costs 90% less than expected [Tweet]

[49] https://uizard.io/autodesigner/

[50] Evolution through Large Models https://arxiv.org/pdf/2206.08896.pdf

[51] Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents https://proceedings.mlr.press/v162/huang22a.html

[52] How Organisms Come to Know the World: Fundamental Limits on Artificial General Intelligence https://www.frontiersin.org/articles/10.3389/fevo.2021.806283/full?ref=upstract.com

[53] Towards artificial general intelligence via a multimodal foundation model https://www.nature.com/articles/s41467-022-30761-2

[54] PaLM-E: An Embodied Multimodal Language Model https://palm-e.github.io/#demo

[55] ChatGPT for Robotics: Design Principles and Model Abilities: https//aka.ms/ChatGPT-Robotics.

[56] InstructPix2Pix: Learning to Follow Image Editing Instructions: https://www.timothybrooks.com/instruct-pix2pix/.

[57] Competitive programming with AlphaCode: https://www.deepmind.com/blog/competitive-programming-with-alphacode.

[58] Human-Timescale Adaptation in an Open-Ended Task Space: https://sites.google.com/view/adaptive-agent/.

Leave a Reply

Your email address will not be published.

Please leave these two fields as-is: