Monday, May 18, 2009

Build Farm Part 3: The core query

Returning to the topic of the Build Farm, recall that the core mission of the Build Farm is to determine the best job that a machine can run. We managed to distill that mission into a single query; this query is executed when a machine asks for a job to do, and the point of the query is to fetch the "best" available job. There may be no jobs currently available to run, or there may be several that the machine could run, and we want to find the "best".

Here's the query:


select wq.id, p.name as project_name, p.depot_root, p.label_name,
i.sync_at_start, wq.label_override,
i.operation, wq.item_id, i.max_duration,
wq.detail,i.status_token,i.working_dir,wq.group_inst_id,
i.failure_token,i.min_reqd_tests
from apt_work_queue wq join apt_item_desc i on wq.item_id = i.id
left outer join apt_projects p on wq.project_id = p.id
where wq.started_time is null and
wq.item_id not in ( select distinct item_id from apt_item_cap_map
where capability_id not in
(select capability_id from apt_machine_desc,
apt_machine_cap_map
where apt_machine_desc.id =
apt_machine_cap_map.machine_id and
apt_machine_desc.name = 'MACHINE-NAME' ))
and wq.project_id not in
( select distinct project_id from
apt_project_cap_map
where capability_id not in
(select capability_id from apt_machine_desc,
apt_machine_cap_map
where apt_machine_desc.id =
apt_machine_cap_map.machine_id and
apt_machine_desc.name = 'MACHINE-NAME' ))
and (wq.machine_id is null or
wq.machine_id=(select id from apt_machine_desc
where name='MACHINE-NAME'))
order by wq.priority desc


Wow, that's a mouthful! Let's try to break it down into smaller parts.

Firstly, for understanding the query, it doesn't really matter what particular columns are retrieved, so let's just collapse the columns together, and call them "STUFF":


select STUFF
from apt_work_queue wq join apt_item_desc i on wq.item_id = i.id
left outer join apt_projects p on wq.project_id = p.id
where wq.started_time is null and
wq.item_id not in ( select distinct item_id from apt_item_cap_map
where capability_id not in
(select capability_id from apt_machine_desc,
apt_machine_cap_map
where apt_machine_desc.id =
apt_machine_cap_map.machine_id and
apt_machine_desc.name = 'MACHINE-NAME' ))
and wq.project_id not in ( select distinct project_id from
apt_project_cap_map
where capability_id not in
(select capability_id from apt_machine_desc,
apt_machine_cap_map
where apt_machine_desc.id =
apt_machine_cap_map.machine_id and
apt_machine_desc.name = 'MACHINE-NAME' ))
and (wq.machine_id is null or wq.machine_id=(select id from apt_machine_desc
where name='MACHINE-NAME'))
order by wq.priority desc


Secondly, the condition wq.started_time is null means "job hasn't started yet", and the condition:

and (wq.machine_id is null or
wq.machine_id=(select id from apt_machine_desc
where name='MACHINE-NAME'))


says "the job either doesn't care what machine it runs on, or it has specifically requested this machine". Let's abbreviate that to "and ok-to-run-on-this-machine".

Thirdly, there's a sub-query which occurs twice in this larger query. This query says: "get this machine's capabilities":


select capability_id from apt_machine_desc, apt_machine_cap_map
where apt_machine_desc.id = apt_machine_cap_map.machine_id and
apt_machine_desc.name = 'MACHINE-NAME'


Each capability_id for a particular machine describes some bit of its configuration in a way which is important to the Build Farm scheduler. So for example, one our our build machines might have a capability for "OS=Windows", a capability for "Java=JDK 1.5", and a capability for "DBMS=Oracle 10g", meaning that this machine is running Windows, and has JDK 1.5 and Oracle 10g installed (and so can run jobs which require those resources). So we can replace this sub-query with "this machine's capabilities" and the query now looks like this:



select STUFF
from apt_work_queue wq join apt_item_desc i on wq.item_id = i.id
left outer join apt_projects p on wq.project_id = p.id
where job-hasn't-started-yet
and wq.item_id not in ( select distinct item_id from apt_item_cap_map
where capability_id not in (this-machine's-capabilities))
and wq.project_id not in ( select distinct project_id from apt_project_cap_map
where capability_id not in (this-machine's-capabilities))
and ok-to-run-on-this-machine
order by wq.priority desc


That's good, the query is starting to get a bit easier to read. Let's keep going.

The first remaining complex condition in the WHERE clause:


and wq.item_id not in ( select distinct item_id from apt_item_cap_map
where capability_id not in (this-machine's-capabilities))


means "job is not one that requires a capability this machine doesn't have", and the other, similar, condition on the job's project means "job's project is not one that requires a capability this machine doesn't have."

So now we have:


select STUFF
from apt_work_queue wq join apt_item_desc i on wq.item_id = i.id
left outer join apt_projects p on wq.project_id = p.id
where job-hasn't-started-yet
and job-doesn't-require-a-capability-this-machine-doesn't-have
and job's-project-doesn't-require-a-capability-this-machine-doesn't-have
and ok-to-run-on-this-machine
order by wq.priority desc


That's much better. Breaking down the "NOT IN ... NOT IN ..." double negatives into English hopefully makes them easier to understand, but they are still double negatives, unfortunately, which means you have to read the query forwards and backwards and let it bounce around in your head for a while.

Lastly, the ORDER BY clause arranges all the legitimate candidate jobs in order by priority, descending, so that the first row in the result is the best job for this machine to process.

Hopefully the query makes sense now!

No comments:

Post a Comment