Wednesday, September 30, 2009

Low-level virtual machine subtleties

Virtual machines, such as the Java Virtual Machine, are astoundingly complex.

My friend Andrew sent me a link the other day to this great puzzle from the C# world. Similar things arise in Java, although it's not identical, since you have to explicitly invoke the equals() method to get logical equality (the == operator always performs reference equality).

Lippert's blog post is great, and very clearly written, even if I do have no idea whatsoever about the interview question that he parenthetically notes ("invent a performant algorithm for determining intransitivities in a simplified version of the 'better method' algorithm").

The discussion of string interning in C# reminded me, somewhat, of a fascinating recent performance optimization that Knut Anders Hatlen investigated in Derby: DERBY-3883. In this very specialized case, converting a logical equality test into a more sophisticated "check the hash equality first, then check the logical equality" implementation brought surprising benefits. It's worth reading the issue comments thoroughly for some clear explanations from Knut about why and when the performance optimization is useful.

The fact that Java performance behaviors can be unexpected and surprising is intriguing. A decade ago, optimizing and improving Java program performance seemed like it was substantially easier than it is nowadays. Partly this is because I find myself working to squeeze further levels of performance from code bases that have already been thoroughly and extensively examined, but partly it is also because modern VMs and modern hardware architectures have made performance analysis very tricky.

There is a wonderful set of notes that Joshua Bloch put together from the recent JVM languages summit which show some of the depths of this problem, and highlight the increasing frequency with which these problems are arising. As the paper by Lea, Bacon, and Grove observes:


The performance of two similar programs under varying mappings might differ by many orders of magnitude depending on issues such as branch prediction, cache misses, cache contention, the placement of memory fences, mappings of threads to cores or processors, OS scheduling policies, memory footprint, resolution of virtual dispatches, loop parallelization, SIMD or GPU parallelism, inlining, boxing of primitives, garbage collection policies, interpretation vs machine code compilation, reflection, opportunistic synchronization elimination, contention retry policies, remote messaging, load balancing, memory, file, and network layout, and so on.


I guess I should be pleased that (most of) the topics raised in this laundry list are familiar to me, but mostly I'm just reminded of how complicated modern systems software is, and how tricky it is to get it just right.

Speaking of GPU parallelism, I happened to see that NVidia have released their latest OpenCL libraries in the CUDA effort. I'm not sure how I feel about approaching the notion of a new "C-like language", since C itself is hard enough, but I may have a go at learning more about some of the Java-related CUDA apis.

Tuesday, September 29, 2009

The mystery of the extra database read

I find that I simultaneously both love and hate being stumped.

My object-relational library at work has been acting up. We have an internal protocol for handling object modifications, which is designed to maintain the clarity of the object cache: when an object is updated, we move it from the primary cache into a per-transaction lookaside cache for the duration of the modification process; at commit time, the updated objects are restored to the main cache.

Unfortunately, once in a blue moon, under quite high load, the cache loses track of an object. Something goes wrong during the commit process, and at least one object fails to be restored to the cache.

Because of the way the cache works, the symptoms of this are almost unnoticeable, as the library code simply reloads the object from the database the next time it is referenced. So for the longest time all we noticed was that we were doing slightly more database access than we expected. Eventually we noticed in our database traces that we were occasionally reading objects which ought to have been found in the cache, and slowly over time I've been trying to track down why that is.

I've been generating theories about how this could be, and constructing experiments or building assertions and scaffolding into the code, to prove or disprove the theories.

So far, all the theories have been disproven.

I call this the "Sherlock Holmes" technique of debugging: "once you have eliminated the impossible, whatever else remains, no matter how improbable, must be the truth." Unfortunately, I still have more impossibilities to eliminate, and I haven't yet found the truth.

My friend Walt says: "if you aren't struggling, you aren't learning", and he's absolutely right. It takes a challenging and infuriating problem to drag you deeply enough into the code to reach that breakthrough of discovery and insight.

As a believer in rationality and logic, I know that eventually, this problem too will succumb, and I'll arrive at a full explanation of the behavior I'm seeing.

But until then, I'm stumped.

Tuesday, September 22, 2009

Drag and Drop -- still a non-starter in HTML 5?

Why is drag and drop such a drag?

Several years ago I looked into using drag-and-drop in an internal web application that I support at work. I was using YUI, and YUI has considerable drag-and-drop support. I didn't take this very far, but it did what I needed it to do.

Then today I stumbled across an entire rat's nest full of rants, flames, and painful war stories about drag-and-drop in HTML 5.

Sounds like a major ouch!

This is a shame, a big shame, for several reasons:

  • HTML 5 is one of the most important languages of the future. Many very significant applications will be written using it. People have been sweating the HTML 5 details for a decade. We want it to be the best language it can be.

  • Drag-and-drop is a very important UI technique, and it's long past time that web developers be able to use it easily and naturally in simple standards-compliant HTML.

  • Up to this point, HTML 5 has been an nearly un-interrupted series of success stories, so to have such a big black eye is unfortunate.



Maybe I'm giving too much weight to the nay-sayers. However, PPK and Hixie are as respected as they come in web-developer-land, so if they say it, it's certainly so.

On the plus side, I'm pleased to have stumbled across this link to Francisco Tolmasky's web log. The 280Slides work is extremely impressive -- subscribed!

Monday, September 21, 2009

Netflix Prize

The Netflix Prize has been awarded!

I haven't been following the prize competition all that closely, but I'm intending to read through some of the technical descriptions and see what I can learn.

In the end, there were apparently two successful approaches: BellKor's Pragmatic Chaos, and The Ensemble.

The Ensemble apparently got a better score on the test, but BellKor's Pragmatic Chaos got a winning score earlier.

Both groups have made detailed information about their algorithms and techniques available, which is wonderful!

It's interesting that both these teams were collaborative efforts of multiple smaller teams. This seems to be increasingly common in modern Big Science. As one of the researchers noted:


"Some people on our team think this is the way that problems will be solved in the future," said Chris Hefele (updated) of The Ensemble, one of the two qualifying teams. "Large problems, with large teams all across the globe collaborating by internet."


The field of so-called "artifical intelligence" has been around for a long time, and comes and goes in favor. Some of the modern techniques, such as neural networks, Bayesian statistics, etc. are now extremely well developed, and obviously are quite successful. Perhaps this success will lead to both renewed interest and renewed respectability for these fields.

Meanwhile, Netflix is so pleased with the results that they're already planning Prize 2!

Friday, September 18, 2009

Year 0, Month 0, Day 0

Come with me, if you so desire, for a deep dive into Java date/time manipulation.

Due to a perhaps fortunate accident, we found ourselves looking at a strange behavior in a Derby test suite, which ended up boiled down to the following question: how should this code behave:


System.out.println(Timestamp.valueOf("0000-00-00 15:47:28.0"));


Now, there is no such thing as Year 0, nor Day 0, nor Month 0, so you might think that this code might throw an exception and refuse the date as illegal. But it doesn't! Instead, it prints:

0002-11-30 15:47:28.0

Now, this is strange in several ways:

  • Why is the year printed as year 2?

  • Why is the month and day printed as November 30?



Now, it turns out that the java.sql.Timestamp class is built atop java.sql.Date, which is built atop java.util.Date, which is built atop java.util.Calendar. So now we might wonder about how this code uses Calendar, and, in particular, about what the following code might do:


Calendar c = Calendar.getInstance();
c.set(0, 0, 0, 12, 13, 14);
System.out.println(c.getTime());


It turns out that this code prints:

Wed Dec 31 12:13:14 PST 0002

Well, that is very interesting!

It seems logical, somewhat, to say that day 0 of month 0 might be December 31. But why did the code print year 2?

It turns out that this is a known feature of the Calendar implementation:


GregorianCalendar represents a date with ERA and YEAR. 0 and negative year values are converted to (1 - year) with an ERA change to support the Julian calendar year numbering.


So:

  • Year 0 of ERA 1 is converted to year -1 of ERA 0, because the year before 1 AD was 1 BC (there really was no year 0 in this system)

  • But then year -1 in ERA 0 is stored as (1 - year) which is thus (1 - -1) which is 2



OK, so that explains the year 2 business. But how did we get November 30 in our output? Shouldn't we have had December 31?

Well, it turns out that Timestamp.valueOf() uses the (now obsolete) Timestamp constructor, which takes separate year, month, day, hour, minute, second, and nanosecond values.

This constructor, however, expects that while the date counts from 1, and is thus in the range 1-31, the month counts from 0, and thus should be in the range 0-11.

So Timestamp.valueOf takes the month portion of the string that it was provided and subtracts 1 from it, so that, e.g., 2009-09-18 gets converted into month 8, date 18.

So that meant that by the time we got to building a Calendar value, internally, we had a month value of -1, not 0, and so we were effectively running this code:


c.set(0, -1, 0, 12, 13, 14);
System.out.println(c.getTime());


And this code does indeed print:

Sun Nov 30 12:13:14 PST 0002

Here's the entire program: run it yourself on your copy of Java and see what it does!


import java.util.Calendar;
import java.sql.Timestamp;
public class Test
{
public static void main(String []args)
{
System.out.println(Timestamp.valueOf("0000-00-00 15:47:28.0"));

Calendar c = Calendar.getInstance();
c.set(0, 0, 0, 12, 13, 14);
System.out.println(c.getTime());

c = Calendar.getInstance();
c.set(0, -1, 0, 12, 13, 14);
System.out.println(c.getTime());
}
}


We now return you to your normal day, month, and year.

Wednesday, September 16, 2009

Using Markov chains for test data generation

(I know, doesn't that blog title just put you to sleep?)

Here's a nice essay about some folks that used Markov chains to write a nifty test data generator.

The article does a good job of explaining the underlying theory behind Markov chains, and also explains some of the implementation details they had to overcome, such as how they used a reference implementation as an oracle, to decide whether their code had passed a particular test or not, and how they had to train their test data generator by feeding it samples of test cases that they already had in place.

I also like the fact that they didn't turn to this technique until they had already developed a large testbed of test cases via other mechanisms:
  • internally-developed test cases
  • test cases contributed by customers
  • test cases constructed by harvesting data off the internet
These sort of test case generation techniques seem to me to be more likely to be fruitful first; it makes sense that after they felt they had exhausted the reach of such test cases, they got more innovative and turned to the Markov chain technique to try to generate more test cases.

You're never done testing. When you think you are, consider a new technique.

Tuesday, September 15, 2009

Eclipse Memory Analyzer rocks!

They say it's a poor craftsman who blames his tools, but jhat just about did me in.

I was trying to diagnose a complicated out-of-memory situation in a very busy application server instance. We had run the server with the HeapDumpOnOutOfMemory flag, so we had a full dump of memory, and I was trying to figure out what was going wrong. jhat showed me that there were 185,000 instances of class A, and I knew that each instance of class B has two class A pointers, but jhat said that there were only 203 instances of class B. What gives?

I think that I was running into the problem encountered by this user: jhat just doesn't handle multiple class loaders very well. In my case, one class loader did indeed have 185,000 instances of class A, and a different class loader did indeed have 203 instances of class B, but I was interested in following the reference relationships between all the objects in the first class loader, and jhat wasn't helping.

So, I downloaded and installed the Eclipse Memory Analyzer.

This tool is great!

Apart from one small installation snafu (which, to give them credit, is documented), download and installation was a breeze. And when I pointed it at my heap dump, it read the heap dump, and showed me just what I was expecting: 92,500 instances of class B.

Moreover, with just a couple clicks on 'Immediate Dominators' and 'Merge shortest paths to root', I was able to see the ArrayList instance in a background thread which was holding on to almost all of the 92,500 instances. Wonderful tool!

Unfortunately, the documentation seems rather chaotic, being organized primarily as a webcast and several years of blog posts.

But that's OK: for a tool this good, I'll take the time to read the old blog posts and watch the web cast, because having a solid, reliable, and high-performing memory analysis tool is about as good as it gets, in the programmer's world of development tools.

P.S. Markus, I know that you pointed me at the M.A.T. over two months ago, in a comment to an earlier post of mine. Thanks very much, and I'm just sorry it took me so long to follow up on your suggestion!

Friday, September 11, 2009

Reader-Writer locks and fair scheduling

We take a brief detour from kitchen remodeling.

I've been considering using a reader-writer lock in a project I'm doing at the office, but I'm concerned about the fair scheduling problem. So let's define those terms, briefly:
  • A reader-writer lock is a synchronization primitive which is more sophisticated that the simple mutual-exclusion lock. With a mutual-exclusion lock, only one thread at a time is allowed to enter the critical section. With a reader-writer lock, callers must identify themselves as readers or writers; multiple readers may enter the critical section simultaneously, but a writer may only enter the critical section alone.
  • Fair scheduling ensures that waiting threads aren't starved of access to the critical section. Many naive implementations of the reader-writer lock suffer from write starvation, in which a writer may wait an arbitrarily long time if readers continue to arrive at the critical section while the writer is waiting.
There are a variety of possible scheduling policies that can be used once threads begin to contend for a reader-writer lock.

It's interesting that the JDK 1.6 ReentrantReadWriteLock provides an execution mode which supports fair scheduling:

When constructed as fair, threads contend for entry using an approximately arrival-order policy. When the currently held lock is released either the longest-waiting single writer thread will be assigned the write lock, or if there is a group of reader threads waiting longer than all waiting writer threads, that group will be assigned the read lock.
The Sun JDK documentation for the ReadWriteLock class has a nice summary of the design issues to consider when investigating the use of a reader-writer lock for concurrency.

Kitchens are like software: a good design is crucial

We needed to design our new kitchen.

Design is a process of making choices, removing ambiguity, and getting specific. Which cabinets would we use, where would they go, and how would they fit with the appliances? What sort of doors, handles, shelves, knobs, and racks would go in the cabinets? What material should be used for the countertop (we chose a lovely piece of "Tropical Green" granite), and what size and color should it be?

Several early design decisions were crucial:
  • Since the granite countertop was the most important part of the upgrade, everything else had to work with it. The counter was to be large and striking, the centerpiece of the new kitchen.
  • To increase overall storage space, we decided to make the counter extra-large, and install a second set of base cabinets on the "far side" of the counter, with an overhang so people didn't bump their feet on them when sitting at the counter.
  • We would use IKEA cabinets.
IKEA provides software for kitchen planning, downloadable from their web site. It's quite powerful software, with tools for selecting different cabinets in different styles, arranging and assembling them into position, and presenting you with a 3-D view of the resulting kitchen, with a fly-through feature so you can zoom in-and-out and walk around in your new simulated kitchen. Unfortunately the software crashed routinely on my main computer, but it worked reliably on another computer so we used it there.

There are lots of hidden features within the IKEA planning software, which the IKEA employees are all aware of. So we took advantage of the feature that let us save our in-progress design to the IKEA web site, then we travelled to the store and opened the design at the IKEA planning booth inside the store, and various IKEA employees helped us improve and refine our choices.

We spent many, many hours with that IKEA planning software. I have no idea how hard it would have been to communicate our design desires to our contractors without it.

After we reached a design we were comfortable with, the planning software printed out a complete bill of materials, with every last panel, door, handle, and accessory listed, and gave us an extremely detailed price quote that we were able to use for our final budgeting and adjustment. And when we actually ordered and paid for the cabinets, we simply opened our saved kitchen design at the IKEA planning booth, and the IKEA employee activated the order.

Software design has lots of design tools, too: there are word processors, sketching and storyboarding tools, diagramming tools, rapid prototyping systems, and so forth. I use these tools quite frequently during software design, but I find that a major part of software design is the identification and clarification of constraints. There are usually some major constraints that govern the implementation of software systems, and elaborating and achieving agreement on these constraints is a crucial component of design. Constraints are typically restrictions on the software's behavior, for example:
  • All data must be tracked in such a way that a historical audit trail, showing all incremental changes over time, can be produced on demand,
  • The system must be deployable on a low-end laptop with 1 GB of physical memory, for demonstrations and early use, but must scale up to fully utilize 32-way server systems with 64 GB of memory.
As the design starts to emerge, I try to capture the most important choices and decisions and invariants in writing, so that we can all agree on them.

It would be great if there were an IKEA planning tool for server software design; maybe one day I'll see such a thing!

Wednesday, September 9, 2009

Kitchens are like software: requirements are unclear

The first decision we met in our kitchen remodelling was a requirements question.

Kitchens have a vast number of different aspects:

  • cabinets

  • countertops

  • appliances

  • lighting

  • plumbing, electricity, ventilation

  • flooring


and so forth. The question was: which aspects were most important for our particular kitchen?

It didn't take very long to come to an agreement about this. Our previous kitchen's countertop was a 30-year-old white ceramic tile counter; the tiles were cracked and chipped, and the dark grout lines between them made us feel like the counter was always dirty, even when it was spotlessly clean.

So we decided to put the bulk of our budget into a custom granite countertop, 4 feet wide and 10 feet long, with a deep sunken sink along one edge.

Software projects are constantly running afoul of requirements issues. For one thing, engineers are always eager to get on with the coding, which provides an under-the-surface pressure to finish the requirements discussions prematurely.

For another, software can be quite abstract and nebulous, which makes it hard to have specific discussions about requirements.

And, of course, there is that classic software engineering chestnut: fast, cheap, reliable; choose two. (Or more likely, choose one, if you're even that lucky!)

I'm working on a new project at work. To start, I picked up a bug, and tried to fix it. After several weeks, it's become clear that this "bug" is actually a requirements problem: nobody was sure what a certain feature was intended to do, and so whatever it does, it doesn't really seem to be doing the right thing.

In the end, I'll make that feature do something, but for the next release, I'll try to build a different feature which gets closer (I hope) to what the customer actually wanted.

Tuesday, September 8, 2009

Why kitchen remodelling is like a software project


We've been extremely busy over the last two months, remodelling our kitchen. We didn't really intend to remodel the kitchen this summer, but our dishwasher broke and leaked water under the floorboards, ruining the flooring and cabinets, and made our decision for us.





It's been interesting work, and it's caused me to think a lot about how remodelling your kitchen has a lot of similarities to running a software project. I'll be writing more about some of these ideas in some future posts, but here's a simple summary:


  • Requirements are unclear

  • Design is challenging

  • Communication is crucial

  • Teams must form and operate effectively

  • You have to take things apart to understand how they are built

  • Infrastructure is crucial

  • Every job is custom, even though we use many components

  • Scheduling is complex

  • You're never really done


Wednesday, September 2, 2009

What's in those files in my Derby DB?

If you look inside your Derby database on disk, you'll see a number of files:


-bash-2.05b$ ls db/seg0
c101.dat c1b1.dat c260.dat c31.dat c3d1.dat c71.dat c8f1.dat ca1.dat
c10.dat c1c0.dat c271.dat c321.dat c3e1.dat c81.dat c901.dat cb1.dat
c111.dat c1d1.dat c281.dat c331.dat c3f1.dat c850.dat c90.dat cc0.dat
c121.dat c1e0.dat c290.dat c340.dat c400.dat c861.dat c911.dat cd1.dat
c130.dat c1f1.dat c2a1.dat c351.dat c411.dat c871.dat c921.dat ce1.dat
c141.dat c200.dat c2b1.dat c361.dat c41.dat c881.dat c930.dat cf0.dat
c150.dat c20.dat c2c1.dat c371.dat c421.dat c891.dat c941.dat
c161.dat c211.dat c2d0.dat c380.dat c430.dat c8a1.dat c951.dat
c171.dat c221.dat c2e1.dat c391.dat c441.dat c8b1.dat c960.dat
c180.dat c230.dat c2f0.dat c3a1.dat c451.dat c8c1.dat c971.dat
c191.dat c241.dat c300.dat c3b1.dat c51.dat c8d1.dat c981.dat
c1a1.dat c251.dat c311.dat c3c0.dat c60.dat c8e1.dat c990.dat


What are these files?

The Derby storage layer uses the concept of a conglomerate, which is a collection of related data stored in a single file. Each table in your database is stored in a conglomerate, and each index of each table is stored in its own conglomerate, so you have one file per table-or-index in your database.

Note that sometimes indexes are created explicitly, via CREATE INDEX, and sometimes indexes are created automatically, such as when you declare a UNIQUE constraint or a PRIMARY KEY constraint.

Derby tracks each conglomerate separately, by its assigned conglomerate number. The conglomerate number is used to create the file name by converting it to hex and constructing the file name cHHH.dat, where HHH is the conglomerate number in hex.

Derby records information about each conglomerate in its system catalogs, so you can simply run a query to find out which conglomerate is used by which table:


select c.conglomeratenumber, c.isindex, t.tablename
from sys.sysconglomerates c,
sys.systables t
where c.tableid=t.tableid;


So, for example, if you're wondering what is being stored in file c850.dat, you can simply convert 850-hex to 2128-decimal, and then run:


select c.conglomeratenumber, c.isindex, t.tablename
from sys.sysconglomerates c,
sys.systables t
where c.tableid=t.tableid
and c.conglomeratenumber=2128;


Or, if you're wondering which files are used by table EMPLOYEES:


select c.conglomeratenumber, c.isindex, t.tablename
from sys.sysconglomerates c,
sys.systables t
where c.tableid=t.tableid
and t.tablename='EMPLOYEES';


Note that about 60 of the files are used for the system catalogs themselves, so every Derby database will have at least 60 files in its storage directory.

Tuesday, September 1, 2009

Advice to a new Perforce administrator

I was mouthing off about Perforce administration the other day, and thought this might be worth sharing with others.

  • Write to Perforce support directly. They're very helpful, even when you haven't yet bought the software. They will be willing to arrange an evaluation license so that you can install and try out the software on your own machines ahead of time.

  • Make sure you use a Linux or Solaris server; do NOT use a Windows server! Perforce handles Windows clients just fine, but the server does vastly better on Linux.

  • Perforce loves physical memory. We have 32 GB of memory on our Perforce server, and when we upgraded it from 16 GB a few years back it made a BIG difference. I have read that large sites routinely run Perforce on machines with 256 GB of memory, even more. But you can increase the memory over time; when we started out we had just 4 GB of memory and for the first 18 months that was fine. But memory is cheap.

  • Perforce doesn't need much special disk or CPU hardware. You'll need enough disk to hold all the data that you want to store in Perforce, plus some extra to take backups, etc. You can add disk space over time without any special procedures needed. Use a RAID configuration if you can to add an extra level of protection. Perforce is VERY CPU efficient. Give the machine a good network hookup if you can as it spends a lot of time transmitting data to the clients. I think we have about 100 GB of disk on our machine.

  • I'm very attentive to the backups. Currently we back up once a week, although for years we backed up daily. Have a good close look at http://perforce.com/perforce/doc.091/manuals/p4sag/02_backup.html#1043336. All I did was to implement that procedure. The duration of the backup basically depends on the amount of data you have, so it tends to grow over time; when you start the backup won't take much time at all so you can run it routinely. Oh, and test your backup (do a test restore) before you go live! You can also find a bunch of other contributed backup scripts in the Perforce public depot.

  • Browse through the user papers from the Perforce user conferences at http://www.perforce.com/perforce/conferences/. There are lots of good 'real world experience' presentations there about what various Perforce sites have experienced.


Good luck, and let me know how Perforce worked for you!