📚Academy
likeone
online

Data Preparation & Curation.

Your model is only as good as your data. This is where quality is won or lost.

After this lesson you'll know

  • How to format training data for instruction fine-tuning (chat and completion formats)
  • Quality filtering techniques that eliminate noisy examples
  • How to synthetically generate training data using stronger models
  • Dataset sizing: how many examples you actually need

Data Formats

Fine-tuning data must be formatted to match how the model will be used in production. Two primary formats dominate: **Chat format (instruction tuning):** The standard for conversational models. Each example is a conversation with system, user, and assistant messages. ```json { "messages": [ { "role": "system", "content": "You are a legal document classifier. Respond with only the category name." }, { "role": "user", "content": "Classify this document: 'The tenant agrees to pay monthly rent of $2,400 on the first of each month...'" }, { "role": "assistant", "content": "Residential Lease Agreement" } ] } ``` **Completion format (text generation):** For models that generate continuations rather than chat responses. ```json { "prompt": "Summarize the following legal clause in plain English:\n\n'Notwithstanding the foregoing, the indemnifying party shall...'", "completion": "Despite what was said earlier, the party responsible for covering losses will..." } ``` **Multi-turn chat format (for conversational fine-tuning):** ```json { "messages": [ {"role": "system", "content": "You are a technical support agent for CloudDB."}, {"role": "user", "content": "My database is returning timeout errors."}, {"role": "assistant", "content": "I can help with timeout errors. What is your database instance size and current connection count?"}, {"role": "user", "content": "It is a db.t3.medium with about 200 connections."}, {"role": "assistant", "content": "A db.t3.medium supports approximately 150 connections. You are exceeding the connection limit. I recommend either upgrading to db.t3.large (supports 300 connections) or implementing connection pooling with PgBouncer to reduce active connections."} ] } ```
Match your training format exactly to your inference format. If you will use a system prompt in production, include system prompts in training. If you will not use multi-turn conversations, do not train on multi-turn data. Format mismatch is a top-3 cause of poor fine-tuning results.
🔒

This lesson is for Pro members

Unlock all 518+ lessons across 52 courses with Academy Pro.

Already a member? Sign in to access your lessons.

Academy
Built with soul — likeone.ai