10 Brilliant Datasets based on ChatGPT Outputs
No one on the internet has been barred by the presence of ChatGPT (based on GPT-3.5 and GPT-4), the driving power behind Silicon Valley’s favourite chatbot. With over a 100 million users already, the OpenAI model has also captivated the research community.
Since the release of GPT-4, AI researchers have been using the model’s outputs to train their own language models and datasets for benchmark results. Here are 10 datasets trained on GPT4 output handpicked for the GPT4 enthusiasts!
Lima
According to the researchers behind “LIMA: Less Is More for Alignment,” a small dataset containing 1000 examples (available on Hugging Face). The study suggests that LIMA can push forward the research for developing proficient LLMs (large language models).
Notably, the researchers demonstrated that a 65B LLaMA model, fine-tuned only on these 1000 examples using a supervised approach, achieved competitive performance compared to ChatGPT.
MiniGPT4
Researchers from Vision-CAIR introduced MiniGPT4, pre-trained and aligned with Vicuna-7B. The updated model shows a significant reduction in GPU memory consumption, as low as 12GB. The researchers propose a novel approach for generating quality image-text pairs by the model itself and ChatGPT. This methodology allows to create a compact yet superior dataset, consisting of a total of 3500 pairs.
Find the GitHub repository here.
Dolly
Dolly, a groundbreaking open source project by Databricks, shows the capability of transforming a pre-existing, outdated open-source LLM into a ChatGPT-like system to follow instructions swiftly. This is made possible by a mere 30-minute training process on a single machine, utilising high quality training data.
Notably, the underlying model in Dolly comprises only 6 billion parameters,compared to other models with much larger parameters. The researchers also released a predecessor for the model, Dolly 2.0 which was lauded by the open source community.
Find the GitHub repository here.
Code Alpaca
The Code Alpaca project aims to construct and distribute an instruction-following Meta AI’s LLaMA model designed specifically for code generation. This repository is built upon Stanford’s Alpaca, with the only modification being the data used for training. The training method remains the same as the original approach.
For refining the Code Alpaca models, a 7B and 13B LLaMA models were used. These models were then fine-tuned using a dataset of 20,000 instruction-following examples, generated through techniques inspired by the Self-Instruct paper, with certain adaptations for better outputs.
Find the GitHub repository here.
Instruction Tuning With GPT4
GPT-4-LLM has a primary objective of facilitating sharing of data produced by GPT-4, which can be used for building instruction-following LLMs through supervised learning and reinforcement learning techniques.
This project pushes the boundaries of instruction-tuning in the LLMs world, as it is one of the initial initiatives to leverage OpenAI’s GPT-4’s capabilities in generating instruction-following data specifically tailored for LLM fine-tuning. Notably, the development holds the potential to advance the state of the art in language model training.
Find the GitHub repository here.
LLaVA-Instruct-150K
LLaVA Visual Instruct 150K is a collection of multimodal instruction-following data, generated using GPT. The dataset is curated for visual instruction tuning, to enhance the development of large multimodal models with advanced vision and language capabilities, geared towards the GPT-4 vision/language framework. The dataset holds great promise for research in the intersection of vision and language for creating capable multimodal models.
Find the GitHub repository here.
UltraChat
UltraChat offers valuable open-source, large-scale, and multi-round dialogue data powered by ChatGPT Turbo APIs.To prioritise privacy protection, the data collection process does not directly use any internet-based prompts. Furthermore, to maintain high standards of generation quality, a dual API approach is used.
One API operates as the user, generating queries, while the other API assumes the role of generating responses. This approach ensures a reliable dialogue generation process, promoting advancements in conversational AI while also prioritising privacy and data integrity.
Find the GitHub repository here.
GPTeacher
GPTeacher is a compilation of modular datasets, crafted by the GPT-4, General-Instruct, Roleplay-Instruct, Code-Instruct, and Toolformer.
Each dataset serves a specific purpose, and together form a valuable resource for researchers. With GPT-4’s data generation prowess, these datasets showcase the model’s versatility and contribute to the landscape of language modelling.
Find the GitHub repository here.
ShareGPT
The collection of 70k user-shared conversations through public APIs has served as the foundational dataset for Vicuna-13B, an open-source chatbot. The dataset is based on an open-source Chrome Extension of ShareGPT which was used by users to share their ChatGPT conversations before OpenAI introduced the feature in the chatbot.
Find the Hugging Face repository here.
HC3
The HC3 (Human ChatGPT Comparison Corpus) dataset is an extensive collection of approximately 40k questions and their corresponding responses, generated by ChatGPT users.
The primary aim of this dataset is to conduct an analysis and comparison of ChatGPT’s responses in contrast to human-generated answers. The questions range from subjects, including open-domain, financial, medical, legal, and psychological areas.
Find the Hugging Face repository here.
The post 10 Brilliant Datasets based on ChatGPT Outputs appeared first on Analytics India Magazine.



