Bastille Day

225 years ago, a few folk in Paris stormed a prison where the King could hold anyone he wished without any due process. They dismantled the building and ushered in a ghastly, bloody civil war.

The Statue of Liberty is today one reminder of those ideals, shared with the United States.

France is now on its 5th incarnation as a republic. I was born in one of its colonies, far enough to be able to see it critically.

In the last years, I've travelled a lot. I thought of my uncle who fought in "Indochine" while I was in a café in Saigon. I observed a father caring for his daughter and wondered how our countries ever got to war. How we ever justified killing a man like him, or how anyone in my country thought it right and proper to rule over him because of our skin's pigmentation.

In Thailand, a Victory Monument that neither French or Thai could explain: a war so futile neither wants to teach its young.

I won't burn a French flag (or my French passport), nor will I only celebrate the country for its croissants. I know a lot of French people, and many of them are pretty awesome. I also owe a great debt to its writers and philosophers. Pasteur, too.

However I can't go waving my flag. Atavistic identity displays terrify me. Nationalistic fervour was an essential element that allowed my uncle to go to war.

So, happy day, France. You're messed up and beautiful.

Effective Altruism

Effective altruism is a growing social movement based around the idea of using evidence and reason to work out how best to make the world a better place. About effective altruism

Some charities are far more effective. As Peter Singer says in his TED presentation, a guide dog costs $40,000, while ensuring someone doesn't become blind because of a preventable disease, as little as $20-50:

This kind of thinking makes people very uncomfortable, which leads to terrible critiques. One doesn't have to be Spock to think it more effective to save 800 to 2000 people from blindness than training a guide dog.

EAs are not above critique, and good ones should actually improve their behaviour: as you would expect with rationalists. Even non-Vulcan ones!

I have some cognitive dissonance with regards to EA. Take the Against Malaria Foundation. I thought it was shown that selling the nets was more cost effective than giving them; but a market-driven approach doesn't feel very altruistic. It also seems as though there's too much emphasis on exploitation over exploration and an over-confidence in their estimates, suggesting a portfolio approach would be more effective.

A lack of emphasis on environmental issues is rather puzzling given how much EAs tend to worry about X-risks.

This suggests that the next frontier for EAs will be building better models. Accepting measures such as disability-adjusted life-years creates a de facto implicit model, one in which an intervention that gave you an extra 10 years with a disability would never be as good as another that extended your healthy life just a single day. Such a ridiculous utility function probably doesn't matter when making a first pass, but we get what we measure; if EAs succeed to shifting people's charitable giving habits, we could quickly be making these kinds of perverse optimizations.

Don't let those critiques deter you from learning more about giving more effectively and researching charities to see which ones do the most good.

Travel advice

A lot of people ask me about travelling, especially in South-East Asia. I shall dispense this advice forthwith.

Stay a bit longer

How much is it costing you to fly there? How much time will it take to get over jet lag? If you're flying to the other side of the world, try and make it worthwhile.

It's worth considering an extra 2 weeks unpaid vacation and subletting your apartment while you're gone or having your friends coordinate airbnb bookings for you.

Getting around

Now that you're a bit less rushed, you can travel a bit more slowly. Take the night train between two cities on your itinerary. AirAsia is ridiculously cheap, but nothing beat the $14 bus ride from Phnom Penh to to Ho Chi Minh City with a land crossing. We saw so much more of those countries than we would otherwise have, met interesting people and did I mention how cheap it was?

We also traveled on night buses in Vietnam, another crazy adventure.

Pack light

Every long-term traveller I've met downsized their packs. Having to carry a suitcase means you're a whole lot less mobile. Getting off a bus at 6AM for a day of sightseeing is a whole lot easier with a small pack.

There's almost nothing you absolutely need to bring with you. The only thing I wasn't able to buy everywhere is deodorant. Either pack your own, or do what the locals do and shower twice a day.

You can buy clothes once you get to your destination, and they will be more adapted to the climate. It's also more fun to buy something at a market than visit yet another temple. Camping style clothes that dry fast are worth it; you can hand wash them at night, hang dry and have them ready for the next day.

Electronics? An unlocked phone is a good idea. A small laptop or tablet if you're an addict like me.

Gifts and all that shit you bought? Mail it.

Visas

Not credit cards (but have an extra, warn your bank and write down a number you can call collect *from the countries you'll be in*). Getting into countries can be a drag and depends on your nationality. If possible get them ahead of time, otherwise you can't use some cheaper land transport options.

Vaccines and meds

You can get some good vaccines cheaply in Bangkok, and medications are available just about everywhere in SEA without a prescription (there are even pills for erectile function sold on the streets, near prostitutes). It's a good idea to have medications for most common problems (food poisoning, diarrhea, dehydration), or just their names so you can buy them there if needed.

Stay flexible - only book your return ticket

On my last trip I bought a one-way ticket. If you know when you need to be back, buy a return ticket. Regional tickets are usually cheap, even at the last minute.

While in Cambodia the King Father died and for the national mourning period all bars were closed. People even looked at us funny for daring to smile, so we booked the next bus ticket to Ho Chi Minh City - then found out we didn't have the right visa, rescheduled the bus ticket and went to the embassy in person for a rush visa that cost several times more than the bus ticket. We crossed the border less than 24 hours after our decision to leave.

Floods, revolutions, strikes, outbreaks and typhoons can all happen. Most of the time it's not dangerous though family and friends might worry. I was in Seoul while media were announcing that nuclear war was imminent, and everyone there knew it was fine.

Sometimes it can be something as simple as rain on the days you had planned to be on the beach in Nha Trang, so you leave early to spend more time in Hoi An where you'll get some cheap shirts tailored.

What to see

There are plenty of websites covering each destination. My only advice is to go to a market, buy all the weird fruits and eat in places outside the banana pancake trail.

Tech meetups: enough with the pizza and beer

Is there any other profession that routinely asks for sponsors to give them beer and pizza money? Can we picture lawyers, doctors or other engineers doing this?

I don't understand why software folks do this. Every damned meetup and hackathon does it. Sometimes we get real fancy and have Subway cater the event.

We can't increase diversity by putting off such a frat house vibe, nor can we market ourselves as competent, highly-paid professionals when we can be so cheaply bribed.

So what should we do about it? 

If you're running a meetup, either stop providing food or ask sponsors for enough cash to afford better snacks. Sponsors can get much better publicity from providing good food; those that have catering for their lunches could have them make extra for the evening's event and get an opportunity to brag about their perks.

As an attendee, ask your meetup organizers and sponsors to get us better food.

Hacking bixi to save Montreal taxpayer's money

Here are two ideas to optimize bixi that could save Montreal tens of millions of dollars.

Open a competition to optimize redistribution

Two "bike depots" and half a dozen trucks criss-cross Montreal to move bixis from full to empty stations. Yet we still see plenty of both full or empty stations, which results in fewer trips and fewer people buying and renewing subscriptions.

Let's call in the artificial intelligence talent that works in gaming companies or universities, and the operations research people that know how to find elegant solutions to this problem. Their goal would be to ensure the highest possible availability: whenever possible, there should be a bike or free dock in every station.

What needs to be opened up? Data about bike trips, without user identifiers. Basically a file with a long list of lines in the form of

"station X, 10:30AM, station Y, 10:48AM"

We also need some cost values for the redistribution, e.g. how long it takes to load and unload bikes.

Specifically, the goal should be software that handles these inputs and produces real-time redistribution recommendations.

Ask mobile users to help

For any trip, there can be several stations near the origin and the destination. A mobile application can highlight stations from which it would be best to get or dock a bike to change redistribution.

How this saves money for Montreal

In 2011, Montreal's city council approved a $108-million bailout package for Société de vélos en libre-service, which administers Bixi and sells the system to other cities. Without government backing, that corporation would be bankrupt. Not only are they not making payments right now, we still don't have audited statements from *last year*.

Our next mayor must enforce some transparency and accountability, and replace some of the leaders at SVLS. I hope the ideas I presented will be considered by city council or in job interviews for new leaders.

Fixing and optimizing Bixi has a lever effect on our ability to sell the system to more cities, as well as on our local budget. Unfortunately SVLS has been downright hostile to developers that would love nothing more than make this project succeed.

SVLS estimated Montreal's bike share could break even with 50,000 subscribers. Redistribution is not only an important cost driver, it's also a reason people don't renew their subscriptions.

Finance Post-mortem

TL;DR

  • Abandoning this project
  • Suggest you stay out of the market until P/E ratios fall a bit
  • Some funds are going to blow up

I'm no longer pursuing the creation of signals for trading algorithms, and have come to be very cynical about the idea of mining social signals.

There are a lot of people claiming to have found signal, *even after their pet theory has been debunked*. Every single pattern that has been announced was probably garbage. Practitioners seem to keep making the same classes of mistakes, over-fitting and cherry-picking.

Smart hedge fund types stay as far away as possible from the hucksters. So even if I had something legitimate, they would lump me in with all the chartists and assorted loons.

If some people found something useful, they are not be talking about it and trying to use the advantage in-house. Any edge will wear off with competition and if it was chance they might even blow up. Odds are high we will see a few  hedge funds go under in the next 2 years.

Socially-mined signals may not correlate well to financial metrics because the real economy is increasingly disconnected from our financial system. Our market P/E ratios are quite high by historical standards, so staying out or shorting the entire market is probably the best idea, without requiring sentiment analysis voodoo. Of course, fund managers won't tell you this.

Although I wanted to assess various NLP APIs, I decided to stop this project shortly afterwards and didn't do a thorough job of it. Early results were abysmal.

Avoid Hadoop: a beginner's checklist for big data reporting

A lot of companies are trying to get value out of "big data". Most go through a period of panic and flailing around using ill-adapted tools. In consulting engagements or during the sales process, the same themes tend to come back.

You might have an idea something's wrong when your web logs are filling up your database, reports patched from SQL queries take over minutes to generate and your team is now considering a web-scale NoSQL data store.

It's common for great developers to waste weeks or months on a flawed approach. I'm hoping this list helps some of my readers avoid that fate.

(If all of this already sounds old-hat to you, go read Memory Architecture Hacks from which I've taken the Hadoop criteria below) 

0: Avoid or delay using Hadoop

Hadoop is only useful when you don't need real-time results, data is too big to ever fit in memory and the map-reduce algorithm is well-suited to the task, e.g. the output of the map is not bigger than the size of input. Unfortunately many people spend weeks learning this new framework, without realizing it forces them to solve their problems in unnatural ways.

One special case that must be mentioned is search. Just because MySQL or Postgres aren't up to the task should not have you reaching for Hadoop if Lucene will do the trick. And by "do the trick" I mean it will sip resources on a single machine, returning results faster than a dozen machines painstakingly configured with hdfs and Hadoop.

There's a similar rush towards using various NoSQL data stores even though flat files can be perfectly adapted.

1: Shrink it, Cache it, Fudge it

Ask how accurate the reports need to be.

Sample: sometimes 5-10% of the data can get a good approximate result.

Save computations: one of my first reports was total visits to a website. An extra table saved totals for arbitrary ranges, replacing an expensive sum with a handful of selects and a bit of code overhead.

Fudge it: while visits are easy to sum, uniques seemed to be a harder problem. Fortunately Bloom filters are useful when counting set intersections. You can set the trade-off between accuracy and size.

2: Buy more RAM

It's often cheaper to buy more RAM than use engineering cycles. $200-1000 sometimes speeds up your reports by keeping your entire data in memory instead of doing disk IO. In-memory database are much faster than disk.

3: Avoid or speedup IO

Store the memory mapped representation of your problem. Create smaller pre-processed records by stripping fields you won't use. Gzip the files. Save indexes.

On the hardware side, there's SSDs, RAIDs and networked RAM.

4: Cram it into memory

Remove unused fields, replace large strings with indexes, or use sparse matrices - whatever is needed to fit everything into RAM.


The above aren't universally applicable, although they should solve the majority of the pain most teams encounter when first venturing into the world of big data.

Quoting Steve Jobs like it's my job

Trying to classify tweets containing the word "job" was probably more ambitious than I realized. 

My project's objective is to create signals for trading algorithms. Unfortunately, "job" is one of those common words that can have multiple meanings. References to Steve Jobs or quotes from the Book of Job are not relevant to build early economic indicators. There are also references to paint jobs, and adult references like these:

Even when job refers to employment, things can get complicated:

Any Bayesian classifier will correctly consider "my" and the bigram "my job" as strong evidence that this tweet is talking about employment rather than one of the above meanings. However, it is not the subject of the phrase.

To get a sense of people's sentiment towards their jobs, I need to parse sentences to ensure job is the subject and verify that job means employment. As if that wasn't complicated enough, sentiment analysis tools seem to be mostly lacking, maybe because they are generally using naive Bayesian approaches for something that requires word sense disambiguation prior to classification.

Sentiment analysis can be comically broken. Consider this gem:

According to Datasift's sentiment analysis APIs, the above tweet is *neutral*. On some of my test subsets, error rates range from 50 to 70%.

My next step is assessing various natural language processing APIs to see which ones can properly classify my (hand-labelled) gold data.

Twitter "jobs"

People tweet about how much they love or hate their job, or how they're looking for or need a second job. Basically, a lot of data that could be part of a consumer confidence index or an early indicator for the job market.

Unfortunately, tweets containing the word "job" are mostly job offers written by bots. To tease out the signal in this stream, I first used a naive Bayesian classifier to separate out bots from humans. As expected, there is a pattern to the tweeting: humans don't talk much about their jobs on the week-ends.

Since bots are trying to get clicks, I conclude that people are least likely to click on job ads on a Saturday and least likely job hunting on a Thursday.

Here are some pretty graphs, care of Google Docs:

If you want access to the raw data or have ideas for how to use Twitter to build indicators, please get in touch.

Next up: teasing out the different meanings of "Job" and "Jobs". Steve Jobs, Book of Job, attaboys... few occurrences of "job" are actually employment-related.

March, April: dancing in Malaysia, Thailand, Vietnam, China, Japan

10 countries in 9 months have left me a bit overwhelmed and ready to go back "home".

Nestled in high mountains in rural Japan, it's time to record some of the craziness of the last 5 countries in 2 months.

SEA JAM in Kuala Lumpur was fantastic, and reminded me of how much I love Lindy Hop. This event was followed by a stop in Thailand, then off to Nha Trang for Vietnam Lindy Exchange, Seoul for Camp Swing It and Beijing's Great Wall Swing Out.

72 hours in China cured me of any desire to learn Chinese. The remaining choices are Korean and Japanese; basically a choice between living in a place with great dancing or another with a fascinating culture.

Learning

I couldn't keep up the Coursera classes in places with bad bandwidth, and got very frustrated with the deadlines and tests. Just like cars were initially horseless carriages, online classes are university classes without accreditation yet with all the trappings of the old system.

If there's no accreditation, why am I getting a mark? Why all these exams? It's natural to have test problems to verify one's understanding. I should be able to do 1 problem, see if I got the answer right, and get a chance to try a similar problem if I didn't solve it. Right now Intro to Finance, has 10 questions in a test, and you don't get the right answers. This keeps the cheaters out-- but what's the point of cheating in an online class if there are invigilated exams for those wanting verifiable credentials?

Fitness

Too much moving around means no stead access to a gym. When I did have access, they didn't have a squat rack, so I tried the leg press which is a lot easier.

Software

Just a little bit of coding gave me an important insight: one leitmotif has been wanting tools to handle data streams on the web as easily as it is on a command-line.

May

I'll fly from Tokyo to San Francisco on the 1st, then to Montreal on the 8th. Besides a conference on May 3rd, the only plan is to reconnect with friends and network my way back into my field after a long absence.