🧑‍💻Cyber.vision🧑‍💻@pythonwithmedev P.366

🧑‍💻Cyber.vision🧑‍💻

یک نفر در Stackoverflow سوال کرده بود "چطور میشه گپ بین دقت داده train و test رو در مدل‌های Machine Learning حل کرد"؟ سوال برای یک مسئله سری زمانی بود. اول با خودم گفتم آقا خسته نباشی ملت صبح و شب در تلاش برای همین کار هستن تا هوش مصنوعی بهتر یاد بگیره. اما…

In theory, the gap between train and test sets' error can not be less than what is called Bayes error, which is sometimes equivalent to human-level intelligence/error in fields where human natural perception is high (such as NLP and Vision). However, in Time Series, it is difficult to predict how far we can minimize this gap. The following steps are what I suggest and they are all basically about using model's bias & variance in each experiment and then use some techniques to improve the model:

0. Use an experiment tracking tool: Start by organizing all your experiments using MLOps tools such as WandB and MLflow that let you log metadata (such as cross-validation results) and save models as artifacts. I prefer Weights&Biases which lets you do multiple experiments using Sweep and Grid Search or Bayesian Optimization to maximize a defined metric on your cross-validation for HPO. Note: Do not waste your time by overly tuning the models' parameters when doing HPO. It is wise to work on data centric approaches instead

1. Start with simple models: Avoid starting with irrelevant or overly complicated models. Begin with simple models and monitor their bias and variance. If you observe underfitting, you might want to use models that can capture non-linear relationships and work well with tabular time series data, such as Random Forest and XGBoost. Avoid jumping directly to complicated RNN models like LSTM, which were initially developed for NLP applications and have not performed well in time series competitions.

2. Address overfitting: Once you solve the underfitting problem, you may reach a model that can learn non-linear relationships in the training data. At this point, your model might exhibit high variance and overfitting on the training data. There are several ways to mitigate overfitting:

Add more training data or use data augmentation techniques. For example, a 2017 Kaggle winning solution for tabular data augmentation and representation learning used DAE. Regularization techniques: Apply L1 and L2 regularization (known as reg_lambda and reg_alpha in XGBoost) to penalize large weights and coefficients. Early stopping, Dropout, and Reduce Learning Rate on Plateau are other techniques commonly used for neural networks.

3. Use ensemble methods: Combine multiple models using techniques like soft voting.

4. Blending & stacking: Implement blending and stacking techniques to leverage the strengths of different models.

5. Advanced time series representations: Explore advanced methods such as signature kernels and wavelets to create better features and representations of your data.

6. Advanced tabular ML models: Look into new models like GRANDE, which combines the advantages of tree-based models and neural networks. Note that if you want to use models such as RF, XGB or GRANDE for time series problems you should do some shape transform first.

7. Improved time-series CV: You can use more advanced time-series Cross-Validation techniques like Embargo & Purge which usually used in quantitative finance.

Kaggle

Porto Seguro’s Safe Driver Prediction

Predict if a driver will file an insurance claim next year.

www.tgoop.com/pythonwithmedev/366

228 viewsSep 14, 2024 at 13:28

tgoop.com/pythonwithmedev/366

Create: 2024-09-14
Last Update: 2025-07-14 16:19:15

BY 🧑‍💻Cyber.vision🧑‍💻

Share with your friend now:
tgoop.com/pythonwithmedev/366

Telegram News

In theory