I’ve been traveling a lot lately and managed to catch up on a bit of reading when I’m crusing at 30,000 feet. On my nook right now is a fascinating book that all text miners should at least browse in a book store. It’s called “The Secret Life of Pronouns,” by James Pennebaker.
The premise of the book is that your social status, sex, personality, and secret intentions can be determined by analyzing pronouns (I, you, they), artciles (a, an, the), and few other functional words. In the beginning of his research, James used the Liguisitic Inquiry and Word Count (LIWC) program but appears to have modified it with proprietary word dictionaries.
From the surface, LIWC looks similar to the word frequency routine that Rapidminer does in the Process Documents operator, but they went further and added a bit more “intelligence” to the analysis. What they did was roll out a fun servce called Analyze Words. You just enter your Twitter handle, click the button, and it gives you a snapshot into your tweet sentiment. So how does this work? I suspect that James and team use their dictionaries to categorize incoming text documents and test against them and for the author’s sex, social status, personality, and sentiment. I’m sure that a lot of “up front” and hard work was done to build these dictionaries. A lot of “up front” work is the norm with text mining and if you try using short cuts, you’ll likely get crappy models.
Stemming and removing stopwords goes a long way here and with the creation of new pretrained word embeddings, a textual model can easily be built.
I think a model like his can be done quite easily in RapidMiner or Python, especially if you build a good crawling and sentiment system to test against. All that it requires is a bit of thought and the will to do it.
Isn’t the data driven world we live in, cool?