With the Bayesian padding I think you’re missing something if you pull to group mean rather than expectation given the sample size you have. This is particularly true for three point shooting, a guy who has zero attempts across a full season of minutes certainly wouldn’t be expected to be average
What are you using to define your draft prospect pool for each class? E.g. Why those 110 players for 2026 and those 80 players for 2025?
Do you use exact birthdate / age in the model? Or just class / years in college as a proxy for age? Burries for example is the age of an older sophomore despite being a freshman and the exact age methodology will have an outsized impact on any model projection for players like him.
By "playtype frequencies" do you mean Synergy play types or something else? Does the model have access to Synergy data?
How are you defining BPM share when a team's minute-weighted average BPM is negative?
Do you have a version that does not include scouting consensus features? That would interesting to look at.
“Gradient boosting regression to a target like WAR has a few limitations, one of them being that the projections don’t exceed the training set ceiling.” - this is true for random forest but not true for gradient boosting, right? If the trees are fit sequentially to residuals the sum of many trees is not constrained.
I’m still unclear on why you took the pairwise comparison approach instead of predicting a target variable, especially if your pairwise comparison was comparing 7-year EPM WAR among each pair of prospects anyway. 7-year EPM WAR projections would be much more interpretable than the current formulation for PRISM scores.
With the Bayesian padding I think you’re missing something if you pull to group mean rather than expectation given the sample size you have. This is particularly true for three point shooting, a guy who has zero attempts across a full season of minutes certainly wouldn’t be expected to be average
Also honestly a bit confused by choosing the binary player vs player fitting. Obviously margin of difference matters a ton
Good stuff! Few questions/comments:
What are you using to define your draft prospect pool for each class? E.g. Why those 110 players for 2026 and those 80 players for 2025?
Do you use exact birthdate / age in the model? Or just class / years in college as a proxy for age? Burries for example is the age of an older sophomore despite being a freshman and the exact age methodology will have an outsized impact on any model projection for players like him.
By "playtype frequencies" do you mean Synergy play types or something else? Does the model have access to Synergy data?
How are you defining BPM share when a team's minute-weighted average BPM is negative?
Do you have a version that does not include scouting consensus features? That would interesting to look at.
“Gradient boosting regression to a target like WAR has a few limitations, one of them being that the projections don’t exceed the training set ceiling.” - this is true for random forest but not true for gradient boosting, right? If the trees are fit sequentially to residuals the sum of many trees is not constrained.
I’m still unclear on why you took the pairwise comparison approach instead of predicting a target variable, especially if your pairwise comparison was comparing 7-year EPM WAR among each pair of prospects anyway. 7-year EPM WAR projections would be much more interpretable than the current formulation for PRISM scores.