Discussion about this post

User's avatar
Morgan's avatar

Back in the day, I used to lead some research efforts on neural network speech recognition. One approach that looked promising (mostly done in labs other than ours) was to train a large model on many hours of speech, and then use that model to generate synthetic data that could be used to train a much smaller model. While the very "deep" and large models were better at learning, the smaller ones were just about as good, and could be pretty shallow and require much less computation for inference. In our lab we had also used such methods just to improve our models for speech recognition using the same networks. So I can see that it could be helpful. But there is no question that having more real data, especially in something like the political landscape that is so time variable, would be better than synthetic data, which probably just fills in a few holes in the models.

John's avatar

One way that these methodologies could exceed the performance of traditional polling not addressed here is in handling non-response bias. For example, while many demographics may be very hard to reach in a traditional poll, they may still have an observable footprint in other ways such as social media interaction, purchasing decisions, and news consumption. I find it quite plausible that a highly semantically aware model like an LLM could semi-reliably glean how people in the demographic might vote based on that data. It definitely won’t be perfect but I don’t find it hard to believe that it will be better than noise or no data at all.

So I agree that claiming these models know people better than themselves is quite preposterous, and there certainly will be significant sources of error, I think it is entirely reasonable that these approaches work with a lot more data than traditional polling, would could result in superior performance. What are your thoughts on this?

5 more comments...

No posts

Ready for more?