Trying to classify tweets containing the word "job" was probably more ambitious than I realized.
My project's objective is to create signals for trading algorithms. Unfortunately, "job" is one of those common words that can have multiple meanings. References to Steve Jobs or quotes from the Book of Job are not relevant to build early economic indicators. There are also references to paint jobs, and adult references like these:
My grandma gave gary paulson a hand job, we got a ton of free books
— Phil (@BigCountryPhil) January 24, 2013
Even when job refers to employment, things can get complicated:
Blogging like it's my job
— Daniel Haran (@danielharan) June 23, 2013
Any Bayesian classifier will correctly consider "my" and the bigram "my job" as strong evidence that this tweet is talking about employment rather than one of the above meanings. However, it is not the subject of the phrase.
To get a sense of people's sentiment towards their jobs, I need to parse sentences to ensure job is the subject and verify that job means employment. As if that wasn't complicated enough, sentiment analysis tools seem to be mostly lacking, maybe because they are generally using naive Bayesian approaches for something that requires word sense disambiguation prior to classification.
Sentiment analysis can be comically broken. Consider this gem:
@JGend19 no I hate not doing shit with my life. I need a job Asap been home two days without a job and I already feel like a low life.
— trill*bish (@LMG_XXO) January 24, 2013
According to Datasift's sentiment analysis APIs, the above tweet is *neutral*. On some of my test subsets, error rates range from 50 to 70%.
My next step is assessing various natural language processing APIs to see which ones can properly classify my (hand-labelled) gold data.