Why GitHub's Search API Misleads Your PR Review Metrics & How GraphQL Delivers Accuracy
Unpacking GitHub PR Review Metrics: Why Your Search API Queries Might Be Misleading
Accurately tracking development quality metrics, such as the rate of Pull Request (PR) reviews before merge, is crucial for any engineering team. It provides invaluable insights for product/project managers to gauge delivery health, for delivery managers to optimize workflows, and for CTOs to ensure overall technical excellence. However, developers often encounter discrepancies when attempting to gather these vital software project statistics directly from GitHub's Search API. A recent discussion in the GitHub Community highlights a common pitfall and offers a robust solution for accurate data collection.
The Problem with GitHub's Search API for PR Review Status
A developer, amitschang, sought to analyze PR review rates across their organization using the GitHub Search API. Their approach involved two seemingly logical queries:
is:pr is:merged review:approved org:{org}(intended for merged PRs with an approval)is:pr is:merged -review:approved org:{org}(intended for merged PRs without an approval)
The expectation was clear: combining these would yield the total merged PRs, and the first query would accurately identify all approved ones. However, spot checks using the GitHub web interface revealed a critical flaw: PRs that clearly had approvals were sometimes appearing in the 'unapproved' list. This inconsistency led to incorrect development quality metrics, undermining the reliability of their analysis.
Why the Search API Falls Short for Reliable Metrics
As community experts abbosaliboev and Gecko51 astutely explained, the core issue lies in the fundamental nature of the Search API's indexing. It's an "eventually consistent" system, meaning there's an inherent delay between an event occurring (like a PR being approved or merged) and that event being fully indexed and searchable. While this lag can explain some discrepancies, amitschang noted the behavior even for very old PRs, ruling out simple synchronization issues.
The more critical point is how the review:approved filter in the Search API operates. It reflects the *indexed review state*, which can be stale or inconsistent, particularly for older PRs. It does not consistently mean "had an approval at merge time" or "had an approval at any point in its history." Consider these scenarios:
- Dismissed Approvals: If a PR was approved, but then new commits were pushed (which often dismisses existing approvals based on branch protection rules), and then the PR was merged, the Search API's index might not accurately reflect the historical fact of an approval.
- State vs. Index: The Search API's index can lag behind the actual database state. What the GitHub UI shows might be the real-time state, while the Search API is still catching up or reflecting an older, cached state.
- Manual Overrides: Manual approvals (like an "LGTM" comment without an official review) or admin bypasses are not captured by
review:approved, further skewing results.
This fundamental mismatch makes the Search API an unreliable source for precise software project statistics, especially when granular historical data like "was this PR ever approved?" is required.
The Solution: Leveraging the GitHub GraphQL API for Accuracy
For precise, real-time, and historically accurate development quality metrics, the GitHub GraphQL API is the unequivocal solution. Unlike the Search API, GraphQL queries the database state directly, providing a much higher degree of reliability and consistency.
Initially, abbosaliboev suggested using the reviewDecision field in GraphQL:
graphql
{
organization(login: "your-org") {
repositories(first: 50) {
nodes {
pullRequests(states: MERGED, last: 100) {
nodes {
reviewDecision # Returns APPROVED, CHANGES_REQUESTED, or null
}
}
}
}
}
}
While reviewDecision is useful, Gecko51 provided a crucial clarification: reviewDecision returns the *current* decision based on branch protection rules, not whether *any* approval review exists in the PR's history. For amitschang's specific need – whether a merged PR had an approval at all, regardless of timing or subsequent changes – the correct field is reviews(states: APPROVED).
Here's the refined GraphQL query to achieve this:
graphql
{
repository(owner: "your-org", name: "your-repo") {
pullRequests(states: MERGED, first: 100, after: "CURSOR") {
pageInfo {
hasNextPage
endCursor
}
nodes {
number
mergedAt
reviews(states: APPROVED, first: 1) {
totalCount
}
}
}
}
}
With this query, if reviews.totalCount > 0, you can confidently assert that the PR had at least one approval at some point in its lifecycle. This approach is consistent and accurate, unaffected by subsequent commits, dismissed reviews, or indexing delays.
Implementing GraphQL for Comprehensive Org-Wide Metrics
While GraphQL offers superior accuracy, gathering org-wide software project statistics requires careful implementation. You'll need to:
- Paginate Repositories: Iterate through all repositories within your organization.
- Paginate Pull Requests: For each repository, paginate through its merged pull requests using the
after: "CURSOR"mechanism. - Manage Rate Limits: GraphQL has its own rate limits and a "node cost" system. Be mindful of these to avoid hitting API limits, especially for organizations with many repositories and high PR volumes. Monitor your rate limit headers.
Building a custom script or application to aggregate this data is essential for a complete picture. This investment in robust tooling pays dividends, providing engineering leaders with actionable data to drive productivity and improve delivery processes. Accurate data from GraphQL can feed into internal dashboards, inform discussions during team retrospectives, or even enhance free retrospective tools you might already be using, ensuring that decisions are based on truth, not eventual consistency.
Conclusion: Prioritize Accuracy for Actionable Insights
In the quest for meaningful development quality metrics, the choice of API matters. While GitHub's Search API offers convenience for quick searches, its eventual consistency and specific indexing behaviors make it unsuitable for precise historical analysis of PR review status. For engineering teams, product managers, and CTOs focused on optimizing productivity and delivery, the GitHub GraphQL API provides the necessary accuracy and reliability.
By shifting from the Search API to GraphQL, you move beyond mere approximations to concrete, verifiable software project statistics. This allows you to truly understand your team's review processes, identify bottlenecks, and make data-driven decisions that genuinely improve your software development lifecycle. Don't let misleading data obscure your path to better engineering outcomes.
