Do you know how large-scale blockbusters are made? The process includes carefully selected locations, professional equipment, actors, camera operators, lighting specialists, and an entire crew to recreate each scene precisely. In the world of AI, data creation works the same way. It mirrors this cinematic process, but instead of entertaining audiences, the goal is to produce the “frames” required for algorithms to learn effectively.
According to Cognilytica, 80% of AI development isn’t about the actual training but data preparation — creating, collecting, annotation, and processing. At one of these stages, when real-world data is insufficient, data creation steps in. The more realistic and diverse the “scene,” the smarter the AI becomes.
Keymakr’s Head of Project Management, Dennis Sorokin, shares insights into the importance, process, challenges, and real-world applications of Data Creation.
What is Data Creation?
Data Creation is the process of generating custom image and video datasets tailored to specific project needs. These datasets should accurately reflect real-world scenarios. Data Creation is becoming increasingly popular due to rising demands for data quality and volume, especially in automotive, medicine, security systems, sports, and retail. Companies invest in data creation to improve model accuracy and performance.
Data Creation is typically used when real-world data is unavailable or insufficient. This process may include:
-
Augmenting Existing Datasets: Modifying conditions, adding objects, or increasing variability. Companies can purchase existing datasets and have them annotated by specialized companies.
-
Synthetic Data Generation: Using software tools to create images, texts, or videos for model training. For example, software can generate images or videos based on a given scenario. However, synthetic data has limitations: it is generated based on predefined parameters and lacks the natural variability of real data. As Dennis Sorokin explains, “In real-world tasks, especially when accuracy above 99% is required, synthetic data doesn’t provide the needed quality. A system with even a 0.1% error rate could misidentify hundreds of people in an airport or cause dangerous situations on the road. That’s why custom scenarios are crucial.”
-
Creating Data for Edge Cases: Capturing images and videos in unique scenarios for model reliability. For complex tasks, real data is essential. For example, to train a model to recognize driver unconsciousness, at least 1,000 videos with different people simulating this condition are required. Participants are given simple instructions like “pretend to lose consciousness” without specifying how. One person might slump their head, another might close their eyes, and another might lean sideways. This natural variability makes real data incredibly valuable, significantly improving model training accuracy.
Use Cases for Data Creation
Keymakr’s portfolio includes numerous shoots for diverse projects, each with unique requirements — from equipment and cameras to actors and locations across Europe, America, and Canada. “Understanding all project nuances is essential to deliver unique solutions. This process truly resembles directing a Hollywood film and is highly engaging. Any scenario is solvable as long as it aligns with ethical, moral, and legal standards,” says Sorokin.
In-Cabin Projects
One example is projects focused on detecting driver distractions. Keymakr has developed a range of scenarios to simulate common distraction behaviors, such as:
- Using mobile phones while driving
- Frequently checking the rear-view mirror instead of focusing on the road
- Lighting cigarettes or using lighters
- Drinking from bottles or through straw
- Wearing hats that obscure their faces, making it difficult for models to identify them
These scenarios were modeled under controlled conditions with dozens of participants. For one project, over 5,000 short videos of 1-5 minutes captured participants performing various distracting activities. This enabled the system to recognize behavioral patterns and respond appropriately to unusual situations.
Armed Attack Recognition
Data creation is often used for AI models focused on office security. One recent project involved scenarios simulating:
- The appearance of an armed person threatening hostages
- The transfer of weapons between individuals
- Shooting incidents and victims injured
Training the model required over 3,000 videos showcasing various combinations of aggressive behavior, group movements, and object handling.
Security Projects
Keymakr worked on projects for airport security cameras designed to replace border guards. The cameras needed to:
- Recognize faces and match them with passport data
- Automatically control access gates
The project required:
- Data from 5,000 individuals of diverse ethnic backgrounds
- Around 1,000 scenarios under different conditions (low lighting, direct light exposure, bad weather)
- Scenarios where participants covered their faces with their hands, wore glasses, hats, or hoods
A critical aspect was gathering data from specific demographics, such as African Americans over 50 or South Asian individuals. Such niche data isn’t publicly available, underscoring the need for custom Data Creation.
Medical Data and Virtual Fitness Instructors
Keymakr also creates data for medical projects and virtual fitness instructor systems. While the latter is still emerging, demand is growing, especially with the rise of remote workouts and rehabilitation.
Similar to Xbox Kinect, these systems use sensors to track user movements in real time. Modern technology allows not just motion tracking but detailed analysis of exercise execution. For rehabilitation, precise movements are crucial, such as reaching a fingertip to the shoulder at a specific angle. The system provides feedback, corrects posture, highlights errors, and suggests adjustments.
For one project, Keymakr extensively filmed training sessions, including exercises like lunges, jumps, and leg raises. Around 60 participants performed exercises for 15 minutes each, with continuous recording to gather data for accurate motion annotation. The shoots were physically demanding, even for younger participants, due to repetitive, high-intensity activities.
Medical Studies: Pupil Reaction to Light
For a biometrics company project, Keymakr captured data on pupil reactions to light stimuli using specialized equipment resembling binoculars. The goal was to analyze pupil response times to changing light conditions.
About 200 participants took part. They were thoroughly briefed to ensure the procedure’s safety.
The experiment involved:
- Turning off the lights
- Waiting 30 seconds
- Gradually increasing light Analyzing pupil reactions
- The study provided valuable data on eye response dynamics, aiding in diagnosing neurological and ocular conditions.
The Data Creation Process
Creating quality data is a multi-step process involving careful planning, collection, processing, and delivery. Depending on the task, this process can vary significantly.
Key stages include:
- Defining Objectives: Clarifying model requirements, scenarios, and expected outcomes. The scope of work includes:
- Required data types Shooting conditions (lighting, environment, angles)
- Participant demographics (age, gender, ethnicity)
- Equipment (cameras, sensors, devices)
- Annotation methods
- Organizing and Conducting Shoots: The process depends on data type:
-
Medical research uses specialized sensors
-
Motion analysis employs multi-camera setups
-
In-car cameras capture driver/passenger behavior
Before shooting, equipment is checked, scenarios are tested, and participants are briefed. Special attention is paid to creating data in conditions that closely mimic real-world operations. For example, in driver fatigue analysis projects, conditions of long trips are simulated, while in motion sickness studies, passenger state changes are recorded under different movement conditions.
- Data Processing and Annotation: After shooting:
- Filter and select relevant footage
- Adjust image quality (color, lighting, sharpness)
- Annotate key points (eyes, lips, hands, body posture)
- Classify actions (head turns, blinking, phone use)
Both manual methods and automated tools are used for annotation. Sometimes, clients require specific details, such as tracking micro-eye movements in medical research or analyzing hundreds of driver behavior parameters.
- Data Delivery: Final datasets are structured for client use, including:
- Annotated videos
- Labeled images
- Parameter tables with motion characteristics
Issues related to data storage and transfer are also considered. For example, the volume of 4K video from several hours of filming can reach several terabytes, which requires special servers or cloud solutions.
Challenges in Data Creation
Providing data creation, it’s essential to consider not only the technical limitations but also the legal and ethical aspects of working with data.
“In the world of data, where every detail matters, it’s not enough to just create data; it’s crucial to ensure its accuracy, diversity, and compliance with ethical standards. Without this, the entire process loses its value and risks distorting reality,” says Dennis Sorokin.
- Diversity of Participants
Depending on the project, participants may need to come from different age groups, genders, nationalities, and skin tones. In some cases, participants with specific characteristics are required — such as elderly individuals for medical studies with various facial expressions for emotion analysis or individuals with particular physiological traits for biometric systems.
Finding suitable participants in different regions can be challenging. Sometimes, the ‘casting’ process can take weeks or even months to ensure the right amount of participants to create truly varied datasets with different community members.
- Data Volume and Technical Limitations
Capturing high-quality video requires substantial storage and data transfer resources. For example, recording 4K video for one hour can take up several tens of gigabytes. Special cameras like infrared, thermal, etc, can produce even more data. If multiple cameras are used in the project, the total data volume can increase to several terabytes. Organizing the workflow requires powerful equipment and carefully planned logistics, from efficient data transfer to annotation and delivery to clients.
- Ethical and Legal Challenges
Data creation raises several ethical and legal concerns, especially when it involves collecting information containing images of people, biometric data, or actions in public places. From an ethical perspective, all participants in the filming must provide informed consent for their data to be used by signing the necessary documents. Confidentiality also plays a pivotal role; it’s necessary to ensure that people cannot be identified when the client does not require it and to comply with data protection standards. Another pressing issue is data manipulation — artificial modeling or staged scenes must closely reflect reality to prevent information distortion and algorithmic bias.
From a legal standpoint, the primary challenge lies in protecting personal data. Regulations such as the GDPR in Europe and CCPA in the U.S. set strict guidelines for data collection and processing, including participants’ rights to request the removal of their data. There are also restrictions on using collected data for commercial purposes: information gathered for one project cannot always be resold or used in other research without participants’ consent. Furthermore, laws around public filming differ from country to country — some places allow filming people without their consent. In contrast, others require specific permissions, especially when the data is used for commercial or research purposes. Adhering to ethical standards and legal requirements is a key aspect of data handling, helping to mitigate risks and ensuring that information is used appropriately and safely.
Conclusions
Dennis Sorokin believes that data creation remains a highly sought-after field, particularly in projects requiring specific video materials that cannot be found in the public domain. “Whether you’re training AI for next-gen transportation, analyzing consumer behavior in stores, or pushing the boundaries of medical research, the key is staying flexible, precise, and aligned with what clients need,” he says. Despite the challenges, this field continues to evolve, finding applications across various industries and gaining increasing attention and demand.