​
Automated ML regression within a Cloud Function
Inferring the Trending Index for unemployment searches in Google Spain 4 weeks ahead.
Weekly updated on Tuesdays. (Toy project. Not a valid ML Project)
More Gdelt projects and ETLs => https://github.com/albertovpd/analysing_world_news_with_Gdelt​
LinkedIn => https://www.linkedin.com/in/alberto-vargas-pina/
Background by Lukas from Pexels
I have scheduled the gathering and processing of all data from this other project.
Sample of the result (there are more than 120 columns):
Low variance feature removal:
Columns that are mostly constant along time tend to not offer valuable information. According to their low variance, this are the ones removed this week:
After that, the data is standarized.
Outliers removal.
The Cloud Function available in the repo, loads data to Cloud Storage, but also images to bucket configured to have a pubilc url.
Outliers are calculated with z-score, and then plotted with non-outliers thanks to dimensionality reduction (from ~120D to 2 and 3).
Weekly, the ML algorithm selects the best features to play with and delivers the following information:
Left:
Selected features (scored from 1 to 10).
Right:
Performance VS number of chosen features.
Here are the results
About this pictures:
They are going to be overwritten each week but Data Studio does not recognize the change. Click the corresponding urls.
-Starting axiom: Evolution of Google Trends keywords can be inferred from what was written in
news media and other searched keywords 4 weeks ago.
- Results:
Warning: Significant changes were made on 15th February.
- From that date, RFECV is used to select the model with better R² adjustment (coefficient of determination) and the chosen features (columns). Then, the selected one is used with cross_validate to generate the following metrics and their associated standard deviation of cv folders (with greater_is_better=False).
- Before that date, I was studying the differences of measurements between RFECV scoring with RMSE and cross_val_predict. Results were normalized.
I am well aware that errors of the same magnitude than measurements are not something to celebrate. At the moment, this are my results:
A lot of reading!
Not a proper dahsboard!