AI: Datasets
Muzukuke, Team! We’ve talked about training AI models, but where does the “learning” material come from? Today, we focus on the absolute foundation of any AI model: the Dataset.
Think of the dataset as the textbook, the examples, and the experiences you provide to your AI student. If the textbook is poor or biased, the student won’t learn well!

Lesson Topic: AI: Datasets
Healthy Datasets

AI Needs lots of data

Part 1: What Makes a Dataset “Healthy”? (The Recipe for Good AI)
For your AI model (especially classification models) to learn effectively and fairly, the dataset you feed it needs several key qualities. Think of it like a recipe:
- LOTS of Data (Quantity): AI needs many examples to see patterns. The lesson suggests a minimum of 50 examples for each class (category) you want the AI to recognize. More is often better!

- Fair Portions (Balance): Try to have roughly the same number of examples for each class.
- Example: If you’re teaching AI to classify market goods and you have 100 photos of matooke but only 10 photos of nakati, the AI might become biased and just guess “matooke” more often because it saw it more. Aim for balance, maybe 50 matooke photos and 50 nakati photos.
- Keep Some for the Exam (Test Data): This is crucial! You must set aside a portion of your data (around 10-20%) that the AI never sees during training. This is your test set. You use it after training to see how accurately your AI performs on new, unseen examples. It’s like giving the AI its final exam!
- Rich Variety (Diversity): This might be the most important ingredient for creating fair and robust AI! Your examples within each class need to be varied.
- Lesson Example (Face Masks): If training AI to detect masks, you need pictures of: different mask types/colors, people of different genders, ages, and ethnicities (reflecting Uganda’s diversity!), different backgrounds (inside, outside, bright light, dim light), different head angles, people close up and far away.


- Local Example (Cassava Disease): If classifying cassava leaves as ‘Healthy’ vs. ‘Diseased’, you need photos of healthy leaves AND diseased leaves from: different plants, different angles, different times of day (lighting), maybe even different soil types or locations if possible.

- Why Diversity Matters (Avoiding Bias): If your training data isn’t diverse, your AI model will likely be biased. It will only be good at recognizing the specific things it was trained on.
- The mask example: Trained only on light-skinned men with blue masks, it might fail completely for a dark-skinned woman with a kitenge mask. That’s not useful or fair!
- Think about potential biases in Uganda: regional differences, urban vs. rural settings, different accents if using sound data, varying skin tones in image data. Your dataset MUST reflect the diversity of the people or situations your AI will encounter.
Part 2: What Kind of Data Does Your AI Eat? (Data Types)
What form will your data take? Choose the type that fits your project:
- Images: Photos, drawings, diagrams (e.g., classifying crop types, identifying plastic waste, recognizing landmarks).
- Text: Written words (e.g., analyzing sentiment in customer feedback SMS, classifying news headlines by topic, translating text).
- Sound: Audio recordings (e.g., recognizing simple voice commands in Luganda, identifying different bird calls, classifying cough sounds for health screening).
- Numbers: Numerical data (e.g., predicting rainfall based on sensor readings, classifying customers based on spending habits, analyzing survey results with number scales).
What data type makes sense for the AI component of your ICT Club project?
Part 3: Where to Find Your Data? (Gathering Methods)
Okay, you know what kind of data you need and that it needs to be plentiful, balanced, and diverse. But where do you get it?
- From Your Community: Collect data directly from the people or environment your project aims to help. This often leads to the most relevant data for local problems.
- Methods:
- Taking pictures or recording videos (e.g., photos of market produce, videos of traffic flow).
- Recording sounds (e.g., local bird songs, spoken phrases in a local language).
- Conducting surveys with specific questions (like your user research!).
- Interviews to gather text data.
- CRITICAL ETHICS: Informed Consent! If you collect any data from people (their photo, voice, opinions, personal info), you MUST get their clear permission first. Explain what data you’re collecting, how you’ll use it (for your ICT Club project), how you’ll protect it, and that they can refuse. Respect people’s privacy and rights. Get permission from parents for minors, and potentially community leaders (LCs, elders) for wider community activities.
- Methods:
- Public Datasets: Use existing datasets available online. Websites like Kaggle, Google Dataset Search, government portals (like Uganda Bureau of Statistics – UBOS), or academic sites sometimes have data you can use.
- Pros: Can be very large, saves you collection time.
- Cons: Might not be specific to your exact Ugandan context, might have its own biases you need to be aware of. (The lesson links to a tutorial on using Kaggle).
- Using Sensors: Collect data automatically using electronic sensors connected to microcontrollers (small computers like Arduino or Raspberry Pi).
- Examples: A camera taking pictures automatically, a microphone recording ambient sound, a temperature/humidity sensor, an air quality sensor, a GPS module (like the Location Sensor in phones).
- Note: This might require buying hardware and learning more about electronics, so it could be more advanced for some teams. (The lesson links to resources on sensors/microcontrollers).
Which data gathering method seems most feasible and appropriate for your project? Often, it might be a combination!
Part 4: Let’s Plan YOUR Dataset! (Activity – 45 mins)
Now, let’s get specific about your project’s AI data.
Your Goal: Use the Dataset Planning Worksheet from the lesson to outline your data strategy.
Tool: The Dataset Planning Worksheet (make a copy or use it as a guide).
Process – Discuss and Write Down:
- What data will you collect? (Be specific: “Photos of boda-boda riders with/without helmets,” “Audio recordings of ‘yes’/’no’ in Lusoga,” “Text descriptions of symptoms”).
- What are your classes/labels? (The categories you want the AI to predict: e.g., “Helmet”, “No Helmet”; “Yes_Lusoga”, “No_Lusoga”; “Malaria_Symptoms”, “Flu_Symptoms”, “Other”).
- Where will you get the data? (Community, Public Dataset, Sensors? Be specific: “From boda stages in Jinja,” “Download from Kaggle dataset X,” “Use phone microphone”).
- How exactly will you collect it? (Describe the process: “Ask riders for permission, take photos with phone,” “Record teammates saying words,” “Extract text from online health forum”). Remember ethical considerations!
- How many examples per class? Aim for balance and the minimum (e.g., “Target 100 photos for ‘Helmet’, 100 for ‘No Helmet'”).
- Test Set Plan: How many examples will you reserve for testing? (e.g., “Keep 20 photos from each class aside for testing”).
Be realistic about what you can achieve in the ICT Club timeline, but ensure you plan for quantity, balance, and diversity!
Part 5: Data is Power, Use it Responsibly!
Creating the dataset gives you, the developers, enormous influence over how your AI behaves. A well-planned, healthy dataset is the foundation for an accurate, fair, and useful AI model. A poor dataset leads to poor results – remember Garbage In, Garbage Out (GIGO).
Keep your collected data organized and safe. Label it clearly. And always keep your training data separate from your test data until it’s time for the final evaluation!
Part 6: Quick Review (Key Terms)
- Datasets: Collections of data (images, text, sound, numbers) used to train AI.
- Healthy Dataset: Has good Quantity, Balance, Diversity, and a separate Test Set.
- Sensor: A device that detects environmental changes (light, sound, temp, etc.).
- Microcontroller: A small computer (like Arduino) often used with sensors.
- Bias (in AI): When an AI model performs unfairly or inaccurately for certain groups due to unrepresentative training data.
- Informed Consent: Getting clear permission from people before collecting data about them, explaining how it will be used and protected.
Part 7: More Resources
The lesson links to helpful resources:
- A Wikipedia list of sensors.
- A video about microcontrollers.
- A tutorial on using public datasets from Kaggle.
Conclusion
Webale Kunoonyereza! (Thank you for researching/planning!) Planning your dataset is a critical thinking exercise. It forces you to consider not just the technical aspects but also the ethical implications of building AI. A thoughtfully collected dataset is your best tool for creating an AI solution that genuinely helps your community in Uganda.
Now, complete that worksheet and start thinking about how you’ll carefully gather your data! Musigale bulungi! (Stay well!)