# Orca: Differential Bug Localization in Large-Scale Services

- differential code analysis
- build provenance graph
- build commit risk prediction

a *ring* is a set of machines which runs the same build. Small ring is a few
thousand machines...
 

![](orcaArch.png)

The way this works:
- assume common meaningful terms appear in symptom and cause. ?
- testing and anomaly detection does not always work immediately
- build may have hundreds of commits (one bug might have required 200 commits
  to be analysed)
  - uses prediction of risk, based on machine learning, several ad hoc
    approaches
	- thousands of probes in system, failures/exceptions logged

![](orcaProvanence.png)

- abstract syntax trees to get *difference set*
- build provenance graph over *time* and *ring*

Search algorithm:
- query preprocessing: tokenization, stemming, stop-word removal, IQF value
  per token
- build graph traversal
- token-matching in code using differential code analysis
  - For a commit *C*, file *f*, and token *t* in system, get relevance for
    that commit (*R = ntiq*, where *n* is the number of times the token
    appears in the difference for that commit, and
  - then sum up for the entire commit.
- ranking: rank the commits by the above, plus *commit risk* from regression analysis using:
  - developer experience (how many years)
  - code ownership (fewer better)
  - code hotspots (found through unnamed features)
  - complexity: LOC, num files, volume of reviewer comments
