Bootcamp briefing 2023
Motivation
Rare events, high costs of data acquisition and privacy limitations lead to data scarcity. All machine learning tuning and optimisation approaches at the end will reach their limits. At the end the data matters the most. Nowadays AI practitioners are given amazing tools to synthesize the data. The proposed task aims to explore the potential of synthetic data. It's an opportunity to investigate how creating our own data can revolutionize AI capabilities and potentially enhance machine learning model performance.
Task
Explore the data augmentation property of synthetic data by creating a mixed dataset (original data with synthetic) and measure performance of already working downstream ML pipelines.
Approach
Go to Kaggle, find competition with a tabular dataset
Take one of participants' solutions with an already given downstream ML implementation. It will be your baseline
Take the data from the competition
Train a CTGAN model on it and generate synthetic dataset
Blend original data with synthetic copy and use it to train the solution you found on the Kaggle
Evaluate the results of the retrained original solution
Limitations
You cannot change hyperparameters of the original solution
You can change hyperparameters of a CTGAN model
The only thing that can be changed from Kaggle competition is the dataset, which must be replaced with “new dataset”. Here "new dataset" is a mix of real train data + synthetic data.
Tricky part is smartly blending the real data with synthetic copy in a way that the downstream pipeline shows better predictions.
Some hints for the experiment:
Find limited observations in real data and add synthetic copies so that the downstream machine learning model can generalize better.
Train CTGAN model with different hyperparameters
Running several “data blending” approaches at the same time with different conditions
Expected outcome
Discovered methods of mixing original data with synthetic that:
Bring negative impact of the downstream ML pipeline performance with examples that can be replicated
Bring positive impact of the downstream ML pipeline performance with examples that can be replicated
What is given
Example of CTGAN pipeline on tabular dataset that can be reused to generate synthetic copy.
Example of MLOps tool (WandB) to track results of your experiments.
Contact
The main contact persons for the project are Max Fediushkin, Anna Chechulina and Mike Shubov from AITAU.
About AITAU
AITAU is a start-up working with artificial intelligence (AI), focusing on synthetic data. Synthetic is a statistical twin of real data, mimicking its patterns while maintaining privacy and anonymity. Given the growing need for diverse and accurate data, we offer solutions to create, evaluate, and share synthetic data to drive the evolution of data-driven decisions and accelerate AI innovation worldwide. The ultimate goal is to unlock the full potential of synthetic data in data driven industries.
More about www.aitau.org
Further briefings
#IMM2023 Bootcamp Briefings
AITAU: The synthetic data challenge
District of Passau: The new health app
E.I.N.S. Hub Challenge powered by Artivive/Belevedere: Belvedere AR exhibition experience
Landesgartenschau Schärding: With robots, sensors and AI to smart plant care at INNsGRÜN
XrXperience: Explore the world of VR streaming
#IMM2023 Briefings
Infineon & Würth Elektronik: BMS² Battery Management System = Be More Sustainable
Philips Domestic Appliances: The self-cleaning food processor
LiSEC: With image processing, AI and machine learning to sustainable glass processing
City of Amstetten: Concept for the new economic service centre
City of Tulln/Donau: With sensors, data and innovative ideas to a digital and green smart city
FH Wiener Neustadt & Makerspace[A]: Clean Tech Club
bee produced: Become a recycling heroe!
Meshmakers & GEMINI Startup Base: Hack the alps – IoT in the mountains
tec2bee: Calm bees with a hand-free solution
net for future | beta campus: beta world – Co-creating a circular future