Google IO Leadership Talk



I attended this talk at the Google IO conference. The speakers delivered a very insightful and even entertaining talk to a packed room.


See the live wave discussion here.


    Leadership talk

    Speakers build subversion SCC.

    Leaders naturally emerge. Plus, it has to be done and people give them that role.

    Manager –> leader. Management is left over from 19th century industrial revolution. The wrong model for creative environments.

    Not desirable for engineers – you feel like you do nothing. Human router – just talk to people.

    So what is a ‘leader’? Is there to serve. Element of trust. Promote technical and social health – need to balance both. But you can have more impact through the people you lead.


    Be everyone’s friend.

    Be involved, be social, have lunch with the team. You also have to deliver bad news and push in another direction.

    Treat reports like children

    People will behave like you treat them. IT departments often treat people like they cage them – as if they have to prevent them from doing wrong.


    Is basically about distrust. People will get used to it and expect to be told every detail, and otherwise do nothing.

    Hire pushovers

    You need a team that can pull itself.

    Compromise when hiring

    Ignore low performers

    Nothing more aggravating. It will cause the best people in the team to leave. Don’t wait too long. If you step in early you can manage them up – or out.

    Ignore human problems

    People problems are the hardest. Compilers are so much more consistent. Simple empathy sometimes all that’ s needed. Pay attention to happiness level of your team. Are they working too much, successful.





    "The complement sandwich": it’s a lie. Give constructive feedback, but be tactful.

    Temporary lapse of integrity – means you have no integrity.

    Be a wise zen master

    Sometimes don’t want to hear your solution. Get them emotionally more focused.

    Loose the ego.

    Don’t be egotistical. Trust your team. Listen.

    Appreciate inquiry

    If someone is questioning what you do – listen. Answer. They are probably not questioning your abilities but are just interested.


    Be human – we make mistakes. It increases respect. But don’t be a doormat. It’s a fine line.

    Get your hands dirty


    Ask people to help you.

    Seek to replace yourself

    Automate. Empower the team, make yourself unnecessary.

    Make waves

    Sometimes you think that it will just go away. Sometimes if you ignore this it is unproductive.

    Shield your team

    Allow them to focus on what they need to get done.

    Succeed and fail as a team

    Don’t blame Joe – take it as a team. If someone messes up you know, the team knows, you talk about it in private and prevent it in the future. Public praise and privately criticize. Tactful.

    Be a catalyst

    Make things happen. Know when to take risks, and be there to take responsibility for failures.

    Be a teacher and a mentor

    Don’t do it for them. Failure is an option, just not over and over again. Example managers looses $10m CEO says – just spend $10m training you…

    Set clear goals

    Focus the mind. You’ll find out that people think different things. Write up a mission statement. Have the conversation. Helps to handle distractions.

    Track happiness and careers

    Ask employees if they are – they might just tell you. "What do you need from me?"


    Your job is to find out what people need. Water, sunlight and bullshit.



    Self directed

    Sweet spot


    Dan Pink – carrot/stick vs intrinsic motivation. Get them to care what they are working about.

    Intr motiv:

    Autonomy: flexibility, let them get their work done. e.g Google 20% time.

    Mastery: opportunity to learn and grow. The same thing over and over is frustrating. Don’t burn people out by having them do the same thing over and over.

    Sense of purpose: sense of ownership, stake. Let them have a voice.

    Managing your manager – help me help you. How to make yourself easier to manager.

    Act like a grownup – don’t expect to be micro-managed. Get work done on time. Don’t expect to be nagged to get your work done.

    Persue responsibility – step out of your confort zone.

    Allow mistakes – write post-mortem, learn. Move on.

    Talk – let people know what you think. Don’t go away and be frustrated in your cube.

    Point out obstacles – tell mgr about problems you see.

    Argue! Find what the problems are. Don’t be a yes man.

    3 concepts:

  1. Serve people
  2. Mutual respect
  3. Motivations (not carrot/stick)
Posted in Uncategorized | Leave a comment

Dare 2b Digital Debrief


82 parents

40 speakers

20 staff

242 young women (some name tags were left). High satisfaction rate.


Two negative reviews: class/instructor went too fast. No one helped when they needed assistance. The problem might have been proctor training.


Too much waiting in line for the keynote.

Girls did not like food, parents did. Potato chips were expired. Fruit was missing.

Kodu had computer issues.

Some girls went into workshops that had filled up. Giveaways ran out (mostly at Pixar).

Lunch – girls wanted to explore campus, get out of the lunchroom.

A lot said workshops too short. Want more workshops.

Many want to come back next year.

First keynote too long. Interviewer questions were a bit disconnected. Parents didn’t like keynote. High-schools girls more positive.

Raffle too long and some felt it was unfair. Parents names were in there too.

Last keynote too complicated. Sound quality not good. Too NASA/science heavy.

Most want to go next year.

Volunteer Feedback

Volunteers needed badges to be identifiable.

Parking signs weren’t up early enough.

Attendee lists were not sorted correctly.

Shirts: get extra shirts next time. Especially for volunteers. Nina mostly had issues with speaker shirts. There were not enough.

Plastic print on back caused sweat-stain.

Orientation with balloons: make a main street and mark it with one color. Put other colors on hot spots to the side and mark everything on the map.

Getting kids to their groups after keynote was a problem. It was hard for parents to find their groups.

The parents classes were very cookie cutter, the same old.

Tammy: need better way to map class name to room number. It was hard to match them.

Software loading: 45% failure rate. Didn’t install. Incorrectly installed. In the end presenters didn’t use what they requested. It is important for the presenters to test it.

The cameras weren’t compatible with the computers, plug-wise. Victor saved the day by hooking it up to the TV.

It was very hard to get requirements from the speakers.

Parents Feedback

Want counselors from high school. Want for girls and boys.

Financial workshop targeted for high school. Overview was missing.

Couldn’t find restrooms. They were removed from map.

Better advertising to schools needed. Was not advertised properly. –> we need a network and more pre-work.

There was a good spread of schools. Space was scarce in the end. 

How to keep girls in the loop for follow-up and next year?

Comments on the web site and fan page.

Post Facts and numbers.

Mercury news was on the way but got called to an earthquake. Radio interview with Anne & Ruth. Bloggers at the conference.

Re-list sponsors and give them feedback and thanks.

Interest from women in Toronto, CAN. Elizabeth Vanderbelt.

Have to send out a short email to maintain the connection. Come back to work as proctor.

Following up with parents would be a good idea. Maybe better to use the Facebook fan page. Allow them to post code, robot pictures, etc.

FB web page is mostly adults.  Next time need to prepare post-conference social group ahead of time.

Future events: Invent your future – conference. Has student track. Kenny can organize event with a weeks notice.

Next time

There is definitively a demand.

Where? SF, SAC, Southern CA, Israel.

Not-for-profit? Different legal structure. Ruth thinking about it. Gather information.

Posted in Uncategorized | Leave a comment

C-Media CMI8788 Windows 7 Driver for Razer Barracuda AC-1

I have one of those nice Razer audio cards that I bought off of woot a while ago for real cheap. Razer has been really slow in coming out with Windows 7 drivers. The Vista ones work, but I like to have the latest.

I saw that other vendors had come out with their drivers, but they don’t install with the Razer card. The reason is that the INF files that help install the driver do not contain the necessary hardware IDs. My AC-1 has the ID VEN_13F6&DEV_8788&SUBSYS_09101A58. By googling around I found the hardware vendor’s generic drivers labelled CMI8788 driver for Windows 7 version 8.17.77 date 2010-02-09 that support a list of cards, but not the AC-1.

I looked at the supplied INF files and found in SoftwareDriver/Driver/CMPCIP0.INF the hardware IDs of the other cards supported. I just added the AC-1’s IDs here like:

%CMI8788.DeviceDesc%=CMPCI,    PCI\VEN_13F6&DEV_8788&SUBSYS_878813F6
%CMI8788.DeviceDesc%=CMPCI,    PCI\VEN_13F6&DEV_8788&SUBSYS_000113F6
%CMI8788.DeviceDesc%=CMPCI,    PCI\VEN_13F6&DEV_8788&SUBSYS_001013F6
%CMI8788.DeviceDesc%=CMPCI,    PCI\VEN_13F6&DEV_8788&SUBSYS_021610B0
%CMI8788.DeviceDesc%=CMPCI,    PCI\VEN_13F6&DEV_8788&SUBSYS_021810B0
%CMI8788.DeviceDesc%=CMPCI,    PCI\VEN_13F6&DEV_8788&SUBSYS_021910B0
%CMI8788.DeviceDesc%=CMPCI,    PCI\VEN_13F6&DEV_8788&SUBSYS_A017147A
%CMI8788.DeviceDesc%=CMPCI,    PCI\VEN_13F6&DEV_8788&SUBSYS_09101A58

%CMI8788.DeviceDesc%=CMPCIX64,    PCI\VEN_13F6&DEV_8788&SUBSYS_878813F6
%CMI8788.DeviceDesc%=CMPCIX64,    PCI\VEN_13F6&DEV_8788&SUBSYS_000113F6
%CMI8788.DeviceDesc%=CMPCIX64,    PCI\VEN_13F6&DEV_8788&SUBSYS_001013F6
%CMI8788.DeviceDesc%=CMPCIX64,    PCI\VEN_13F6&DEV_8788&SUBSYS_021610B0
%CMI8788.DeviceDesc%=CMPCIX64,    PCI\VEN_13F6&DEV_8788&SUBSYS_021810B0
%CMI8788.DeviceDesc%=CMPCIX64,    PCI\VEN_13F6&DEV_8788&SUBSYS_021910B0
%CMI8788.DeviceDesc%=CMPCIX64,    PCI\VEN_13F6&DEV_8788&SUBSYS_A017147A
%CMI8788.DeviceDesc%=CMPCIX64,    PCI\VEN_13F6&DEV_8788&SUBSYS_09101A58

Each of these entries seems to point to one of the other CMPCIP?.INF files. I copied the last one to CMPCIP7.ini and patched the hardware IDs there to match mine:


%CMI8788.DeviceDesc%=CMPCI,    PCI\VEN_13F6&DEV_8788&SUBSYS_09101A58

%CMI8788.DeviceDesc%=CMPCIX64,    PCI\VEN_13F6&DEV_8788&SUBSYS_09101A58

I uninstalled the Vista drivers, rebooted, installed the new ones with the supplied setup.exe and rebooted again. The machine came up with sound working and the CMI sound panel.

Posted in Uncategorized | 1 Comment


Cloud computing makes computing resources available just like utilities provide resources. It scales with the demand up and down. Physical computers become less important and act like a cloud. It is possible to give more power to an application by adding CPUs or storage. Failure is accepted as normal and handled.

Storage is available as NoSQL, classic relational on block storage or some relational database. Messaging authorization services are also available.

Middleware includes provisioning, cache and grid computing resources.

A cloud API would also support monitoring/management.


Scale. Illusion of infinite capacity and pay only for what you need.  You can ‘burst’; i.e. acquire many more computing resources with ease.

Low cost of entry. Start small and grow into your needs.

It is easier to manage risk.


Performance will be impacted. Use the right middleware.

Keep data close to computing – expensive to copy.

Managing SLAs can be a problem depending on your neighbors.

Security – exposed on a public cloud.  Q really?

Lock-in? What if you want to move away from your provider?

It may be hard to take a legacy app and move it.

It is important to automate deployment.

Expect failure – design for it!

EHCache & Terracotta

Commercial open source.

Q use only from Java? via REST API from others.


Used in Hibernate. Accessible as a REST-service.

Deployed as jar and available in a couple of flavors for various purposes.

Ehcache on its own: memory and disk store. 64 bit can store a lot. 6-8GB is comfortable. Limiting factor is GC, can be configured to be < 100ms. On disk store is 20GB, upto 100GB.

Usually accessed via RMI. JMS is rather slow.

Wasn’t transactional or HA.

After being acquired by Terracotta a number of features were added. HA, coherence.

Coherence – read-lock has to be acquired, no writes possible until all read locks released. Newest version has 2 phase-commit using TC technology.

Eh has subset of TC cache, analoguous to L1/L2 CPU architecture. Possible to configure for various scenarios.

You can configure coherence and locking for more consistency or performance. e.g for batch runs during the night you can turn coherence off.

Copy-on-read for isolation is possible (OSGi apps need that).

There is an eventing API if the storage gets disconnected, default action is to disable the cache.

Q managing API? JMX? Monitoring?

** Performance **

Used to by a synchronized Hashmap. Then problems cropped up with apps waiting for the cache.

Reworked beyond Java 1.5 – metadata determines eviction policy via plug-in algorithms. Probabilistic eviction.

Concurrent Hashmap –> selective concurrent hashmap.

Performance vs memcache – it is on the network, so there is a lot of overhead. Ehcache is in-process and 1000x faster; on disk still 2x faster than memcache.

On Hibernate Compared to MySQL, memcache. Coherent.

30-95% database load reduction.

80 times read-only performance of MySQL. Lower latency.

Coherence makes sure that what’s in the cache is also in the database.

Writing is harder – the mechanisms may be hard to configure.

People keep transitional data in the cache (disk-based) for up to a week.

It is easier to change a key-value store programming model than a database.


Best practices:

Set often use reference data to ‘eternal’.

n* problem: if no coherence between distributed caches the work has to be done n times. With coherence you have to do the work only once.

ORM – cache entities, collections and queries. Have to turn that on individually in Hibernate. Turn on showSQL to see what SQL gets generated.

Consider cache above the DB layer.

People over-provision for VMWare – too many instances on the hardware. Start with one, add later. You still have advantages when running only one image per server.


GAE – ehcache 1.6+ works on it.

Why? will improve GAE responsiveness. Familiar API.

GAE gives you max 100MB but JVM reservation coming 2010, because frameworks get started up when hit which is slow.

Amazon EC2.

Ehcahce is monitored by JMX, which needs portmap (many ports) – bad for security. New Terracotta tools allow secure monitoring.


Example app, Examinator

Posted in Uncategorized | Leave a comment

Configuring an E92 BMW

The shop did not know how to reset my 325ix service reminder after an oil change. They oil-reset tool they had bought for hundreds of dollars did not work. The new one they ordered didn’t either. I searched the message boards using my phone and found this:

Step 1
Insert key into slot
Step 2
Press start/stop button without depressing clutch/brake and wait for the service reminder to dissapear.
Step 3
Immediately after service reminder goes out, press and hold the odometer reset. After 3 secs a warning triangle will appear, keep the odo. button pressed and after another 2-3 secs the Oil can will appear. If you hold it too long (10 secs in total) you will overshoot the runway and some German writing will appear giving the software level/ can bus etc. of the car. Go back to the beginning and start again.
Step 4
You are now in the service menu, use the toggle switch on indicator stalk to scroll up/down through the various service items.
Step 5
When you have the item you want to reset showing, press the BC button on end of indicator stalk. Reset should now appear in the display. Press in and hold the BC button for 2-3 secs and a clock will start whirling around and hey presto, its done
Step 6
You can now either scroll up/down to select another item (as in step 4) or you can press the start/stop button to exit.

Posted in Uncategorized | Leave a comment

Language Shootout – SDForum

Steve Mezak – what language for web app?

Closure – Amith Rathore – Runa, Thoughtworks, IBM

Writing ‘clojure in action’ Manning book.

“Is a Lisp” on JVM. Has [] too. Has higher-order functions (map, reduce, filter – accept other functions). Data structures are immutable like in Erlang – that eliminates a whole range of defects. Lazy sequences are like Python generator functions, allows to easily process lists.

Writing multi-threaded code is made easier by distinguishing between identity and state. You change objects by mutating their state in-place. in clojure state is copied – old state stays around, new users can get new data. Change has to managed specially and inside of transactions. STM system. Has all the common data structures. Other buzzwords: agents, atoms, vars.

JVM gives access to a lot of existing (Java) libraries with an easy to use calling method. () = list; [] = vector. Code is data. Macro feature allows you to change the language by inserting code in the middle of the compilation process. It becomes a programmable programming language.

Runa models low-level attributes and tries to combine them into executable functionality that can be processed with a DSL for the subject experts.

Rapid prototyping through Emacs plugin.

OO? Is supported.

It is very easy to expose functions on web services.

Is fast and has powerful features.

Ruby – Evan Phoenix – Engine Yard

Everything is an object and you call methods on it. Principle of least surprise – if it looks like it should work it does.

Works on Rubinius, a new Ruby runtime. Ruby has multiple representations: JRuby on top of JVM,

Multi-threading is supported, varies between implementations. No language facilities for multi-threading, uses a lock library.

Compared against Smalltalk? Used to be a mix between Smalltalk and Lisp and evolved more towards Perl.

Used as an all-purpose language, not just for web apps. Rails functions as an entry-drug.

Compared to Python? Only one way, Guido’s way. Ruby is more flexible.

Scala – David Pollak  –

Hybrid functional-OO language. Runs on the JVM. Is compiled. Designed by Odersky. Used by Linkedin, Twitter, Foursquare, Office Depot, ebay, SAP, etc. Why did they switch? Scala gives you all the benefits of Ruby with the benefits of Java.

Has a type system that doesn’t get in your way through type inferencing.

Wrote beginning Scala book. Started goat radio distr programming model, actor model.

Allows writing very concise code.

Concurrent. Parallel collections. Library, but feels like build in. Like Erlang.

You can write type checking code.

Has a great community, and the 2.8 release has IDE support.

Lift: web framework like Rails.

Compared to Groovy? G is slow, bad type system.


Go – Robert Griesemer – Google

Conceived by Google, still very experimental. Allows control over memory layout, for systems programming. Could implement an OS with it. Want to replace C++ with it.

A Go program is a collection of packages. You can import other packages. Looks like C without semicolon.

Motivation is frustration with current state-of-the-art build systems. A lot of toneyping in the code today, as well as dependency problems. Some problems are solved by dynamic languages. Shows a C++ error message of 40 lines caused by a simple omission of const keyword.

Focus is on edit-compile cycle.

Concurrency is build in via the ‘go’ statement. Communication through ‘channels’ which connect “go-routines”. Make it easy to communicate. 100k channels build up, used and taken down, in 1s on laptop.

Open source, portable, many platforms.

Next steps: generics, exception handling.

Not ready for prime time – 1/2 year?

No VM? Directly to machine. Language supports garbage collection, doesn’t need a VM. Systems language with nothing underneath.

Does type system allow generic map? Not currently but clearly something they are working on. sort library expects a container that implements three methods. Like pre-generics Java.



The Goto statement is in Go as well as in PHP. Someone implemented goto in Scala using continuations.

Test-driven development?

Clojure comes with and supports many frameworks for TDD, including everything Java has.

Ruby has TestUnit, RSpec. Ruby community embraces testing.

Scala also has everything that Java has. Also can test Java. Comes with two test frameworks. Scalacheck calculates what possible parameters are for methods and exercises them.

Go has early-stage easy-to-use test frameworks.

How do go-routines interact with IO? They have their own stacks.

IDE support?

Clojure. REPL dev style. Lisp-like dev style.  Grow a function and turn it into a test.

Ruby? Netbeans has Ruby support, Eclipse Rubymine, Textmate has completions,

Scala, REPL like Clojure. In 2.8 reworked the compiler to be incremental, dump more info out.

Go. Go-mode for emacs, Xcode. IDE – godoc extracts documentation into HTML.

Learning curve

Clojure – a couple of days to get going once you’re over the parenthesis.

Ruby – pretty simple.

Scala – similar like Ruby and Python, less complex than Java.

Go – if you’ve done C before you’ll pick it up quickly.

Posted in Uncategorized | Leave a comment

Log Management

In today’s internet services you usually have an ever growing zoo of machines, populated by many, wildly different services. This poses a problem for an administrator that needs to keep all these machines and services running properly, and for the developers that need to find causes for problems that inevitably crop up.

The usual solution is to log major activities that each service performs, especially problems that occur. This often involves more than one service that communicate across several machines. If you are lucky the system assigned a ‘magic cookie’ to each user action that is kept as the action triggers services handling the request. In any case the engineer looking into the problem has to collect evidence from many log files. These logs need to be parsed and filterered to display relevant information. The size of the logs can easily reach terabytes.

Another use for the logs is to quickly understand the health of the system – how many problems did we have over the last 24h? In addition useful information, like least active customers over the last week could provide important information for service renewal efforts.

How does one get all this information out of the logs and displayed on a nice dashboard in a timely manner?

Stefan Groschupf says the best way is to just copy (rsync) files to central location and index using lucene.

Scribe (Facebook) is a lot effort getting implemented, ported and running. Maybe CPU/network intensive. Hooks into syslogd. Pushes every 64MB of log records to collector.

SOLR (apache) for log indexing with REST interface. Easy to use, hits performance wall at some time due to REST access. This is only when the number of indexed machines is near the terabyte area.

Katta builds a distributed index. Used for terabyte-size logs. More involved than SOLR but relatively straightforward. Can handle hundred of machines with logs.

Splunk is pretty expensive. Cool, easy to use. They index local logs and have a great UI. $10k/month.

Putting logs on a message bus is not such a great idea – unknown how much load on network/CPU is generated. EMI tried that, but lots of overhead for one message per log record. Expensive guarantees provided by the message bus are not needed.

Can put together script that collects logs every hour. This only gets problematic when too many machines (1000+).

Some questions that need answering:

Critical time monitoring: what is time span until we need to see the alert? It is probably better to do alerting in the application logging framework.

Q: Size of log per customer? How many?

Posted in Uncategorized | Leave a comment

The problem of personal space and Skiing

I hate standing in line. Especially when it is cold and I could be moving, but I have to wait because some people feel uncomfortable sitting together on a bench. In an American lift line, this is the standard modus operandi. If no one forces 4 people to sit on one four-person chair, they will use two or more chairs. Why? Because in this culture people do not like to be closer than arm’s length to each other. So when sitting on a chair-lift where you end up shoulder to shoulder they feel uncomfortable and rather let everyone behind them wait.

This gets even worse on 6-person chairs. You would think people cannot read. Actually they can’t count to six, because there are perfect pictograms with six people next to each other on the chair.

I often think about worst-case scenarios – what if the lift breaks and you are forced to wait a couple of hours for rescue? As a lone rider you freeze, but four people can entertain and heat each other up.

Posted in Uncategorized | Leave a comment

Hudson talk SF Java JUG on February 9, 2010

Kohsuke Kawaguchi from Sun introduced us to Hudson, an open-sourcecontinuous integration (CI) system.

Why do you need CI? A developer forgets to add a class to the SCC, and everybody else’s build breaks after checking out the latest. Hudson catches problems like this. Since it can also run tests a development team is always assured that things still work. Often regression tests are forgotten after the 1st week of excitement. Hudson will send email to the developer who caused the problem. It is like driving with seatbelts.

You run Hudson from the command line java –jar hudson.war. Also has OS-specific packages. Hundred of plugins exist that integrate lots of things. 200+ developers! And many corporate users.

Every Friday a new release. For enterprises there is a bug-fix only release available that comes out every 18 months. The core development team is small, and there is a plugin architecture that makes it easy to extend the system. This way big architecture changes can be introduced with a small, agile team.

Functionality is basically listening to source code control system, trigger build, and record and publish results. It also keeps build results and reports available for later use.

Configuration can be done via a web server run from the .war file using a quite comfortable UI. It is also possible to trigger builds and test runs from there. Cool. There is a source code browser, and a nice build report.

Hudson is aware of Maven – you point to the repository URL, the target and Hudson kicks off a Maven build. Very nice.

Hudson can also run tests under a number of different environments (e.g. varying JDK and library versions), and catch incompatibilities that way.

There is a view where you can see individual builds, with the change lists from the SCC also displayed. Very useful for looking through a project progress over time.

It is possible to set up ‘jobs’ that can be correlated – and then builds can be ‘promoted’ based on test outcomes. The libraries used are also tracked – that feature is called ‘fingerprint’. Only programs con provide that accuracy, humans are not detail-oriented enough. This makes relying on one ‘build expert’ who is the only person in the company how to build the thing unnecessary.

Teams working together on components that rely on each other can use build promotion to build, run tests and promote a binary to the next dependent group.

In summary automation allows you to do more more frequently. The web UI makes things transparent. Less people are involved in the build and test process. Lengthy tests can be run on servers keeping user’s machines free.

Distributed builds – it is possible to install new slave servers automatically using the PXE plugin to boot a server from an ISO image hosted on a main server. Boot computers from the network (using PXE) and install “OS for Hudson”. There is a master server and an “unlimited” number of slaves. Hudson includes a SSH client that can transfer and install Hudson and even the JRE. For Windows they do that via DCOM! If the slaves are behind a firewall the slaves initiate the connection by using Java Web Start to download a Java program and initiating the connection from there. There is also a JDK installation option that automatically selects the right JDK.

It is then possible to run concurrent builds.

Hudson can also clean up demon processes left behind.

Load statistics are kept.

It is possible to deploy onto Amazon EC2 on demand (based on the current load).

There are lots of other features like IDE integration.

Video and slides are now available:



Posted in Computers and Internet | Leave a comment

ACM Local Chapter Hadoop Talk

Several (~5) companies announced hiring. Data mining appears to be a growth area.

Amr Asish, CTO Cloudera. Introductory lecture on Hadoop.

Storage systems can’t compute, they just store. They tried to store 20TB per day which was not a problem, querying it was because loading the data from the storage farm into the RDBMS was too slow. The foundation of Hadoop was storing and querying massive amounts of data at the same time.

In addition ad-hoc querying and data mining were impossible to do on the storage farm. The goal for Yahoo mail was to query for which user send which mail from which IP address. Mining the data approached 24h per day and new functionality could not be scaled into the existing infrastructure.

Hadoop introduces an additional layer that performs both the storage and computation functions in addition to the relational data base. It is a scalable, fault-tolerant grid operating system for data storage and processing. The name was invented by a kid that named its plush toy elephant Hadoop. The kid declined to give a reason but the inventor liked it. It is a distributed file system and a job distribution engine.

You can load anything into Hadoop, the job distribution engine will apply the jobs to the data, which can be structured or unstructured. Learn the details at Cloudera and Apache.

History: Nutch was a system for finding words (?). In 2003 Google came out with GFS and MapReduce paper. D Cutting and M Cafarella added DFS and MR support to Nutch and in 2006 Yahoo hired Cutting and Hadoop was spun out of Nutch. Yahoo made the efforts available as open source to blunt Google’s “unfair advantage” with their MapReduce implementation. In 2007 the NY Times used Hadoop and Amazon EC2 to process 4TB of archives. 2008 web-size deployment occurred at Yahoo, and a race in sorting terabytes of data started with Google. Yahoo sorted a petabyte in 16h. The Hadoop summit attracted 750 people last year.

Design Axioms: 1) embrace failure 2) heal thyself 3) scale easily up and down 4) scale linearly 5) computation occurs near the data 6) be a simple operating system

HDFS works with 64MB block sizes (compare to 4kB on “normal” OSes) and replicates data. This allows you to have multiple workers process the same data. The cost per GB is a few cents only. When nodes fail the replication is restored. There is a processing language to operate on data. Even failures like “slower than normal” are supported. Since the data is in more than one place, it is easy to schedule redundant processing – the first finisher wins.

There are other subsystems as well that extend capabilities, for example Zookeeper, Avro, HBase, a key/value store, Pig, Hive, Sqoop, BI Reporting, ETL Tools – they make it easier to work with Hadoop because programming Hadoop is done in Java which is very general-puropose.

Relational databases are like a fast sports car, they support multi-step transactions and are expensive.

Hadoop is like a freight train – affordable and very effective when it has enough freight cars.

Currently the cost is $250/TB which is up to 20x less than RDBMS.

Training is available at

Posted in Uncategorized | Leave a comment