In Re: Zoloft v. , 858 F.3d 787 ( 2017 )


Menu:
  •                                    PRECEDENTIAL
    UNITED STATES COURT OF APPEALS
    FOR THE THIRD CIRCUIT
    ____________
    No. 16-2247
    ____________
    IN RE: ZOLOFT (SERTRALINE HYDROCHLORIDE)
    PRODUCTS LIABILITY LITIGATION
    Jennifer Adams, et al, Plaintiffs appealing dismissal
    by order entered April 5, 2016,
    Appellants
    On Appeal from the United States District Court
    for the Eastern District of Pennsylvania
    (D. C. Civil Action No. 2-12-md-02342)
    District Judge: Honorable Cynthia M. Rufe
    Argued on January 25, 2017
    Before: CHAGARES, RESTREPO and ROTH, Circuit
    Judges
    (Opinion filed: June 2, 2017)
    David C. Frederick [Argued]
    Derek T. Ho
    Kellogg Hansen Todd Figel & Frederick
    1615 M Street, N.W.
    Suite 400
    Washington, DC 20036
    Dianne M. Nast
    NastLaw
    1101 Market Street
    Suite 2801
    Philadelphia, PA 19107
    Mark P. Robinson, Jr.
    Robinson Calcagnie Robinson Shapiro Davis
    19 Corporate Plaza Drive
    Newport Beach, CA 92660
    Counsel for Appellants
    Sheila L. Birnbaum
    Mark S. Cheffo     [Argued]
    Quinn Emanuel Urquhart & Sullivan
    51 Madison Avenue
    22nd Floor
    New York, NY 10010
    Robert C. Heim
    Judy L. Leone
    Dechert
    2929 Arch Street
    18th Floor, Cira Centre
    Philadelphia, PA 19104
    Counsel for Appellees
    2
    Cory L. Andrews
    Washington Legal Foundation
    2009 Massachusetts Avenue, N.W.
    Washington, DC 20036
    Counsel for Amicus Washington Legal
    Foundation
    Brian D. Boone
    Alston & Bird
    101 South Tryon Street
    Suite 4000
    Charlotte, NC 28280
    David R. Venderbush
    Alston & Bird
    90 Park Avenue
    15th Floor
    New York, NY 10016
    Counsel of Amicus Chamber of
    Commerce of the United States
    Joe G. Hollingsworth
    Hollingsworth
    1350 I Street, N.W.
    Washington, DC 20005
    Counsel for Amicus American Tort
    Reform Association and Pharmaceutical
    Research and Manufacturers of America
    OPINION
    3
    ROTH, Circuit Judge:
    This case involves allegations that the anti-depressant
    drug Zoloft, manufactured by Pfizer, causes cardiac birth
    defects when taken during early pregnancy. In support of
    their position, plaintiffs, through a Plaintiffs’ Steering
    Committee (PSC), depended upon the testimony of Dr.
    Nicholas Jewell, Ph.D. Dr. Jewell used the “Bradford Hill”
    criteria 1 to analyze existing literature on the causal connection
    between Zoloft and birth defects. The District Court
    excluded this testimony and granted summary judgment to
    defendants. The PSC now appeals these orders, alleging that
    1) the District Court erroneously held that an expert opinion
    on general causation must be supported by replicated
    observational studies reporting a statistically significant
    association between the drug and the adverse effect, and 2) it
    was an abuse of discretion to exclude Dr. Jewell’s testimony.
    Because we find that the District Court did not establish such
    a legal standard and did not abuse its discretion in excluding
    Dr. Jewell’s testimony, we will affirm the District Court’s
    orders.
    I.
    This case arises from multi-district litigation involving
    315 product liability claims against Pfizer, alleging that
    Zoloft, a selective serotonin reuptake inhibitor (SSRI), causes
    cardiac birth defects. The PSC introduced a number of
    experts in order to establish causation. The testimony of each
    of these experts was excluded in whole or in part. In
    particular, the court excluded all of the testimony of Dr.
    Anick Bérard (an epidemiologist), which relied on the “novel
    1
    See Section II.B infra.
    4
    technique of drawing conclusions by examining ‘trends’
    (often statistically non-significant) across selected studies.” 2
    The PSC filed a motion for partial reconsideration of the
    decision to exclude the testimony of Dr. Bérard, which the
    District Court denied. The PSC then moved to admit Dr.
    Jewell (a statistician) as a general causation witness. Pfizer
    filed a motion to exclude Dr. Jewell, and the District Court
    conducted a Daubert 3 hearing.
    The District Court considered Dr. Jewell’s application
    of various methodologies, reviewing his expert report,
    rebuttal reports, party briefs, and oral testimony. The District
    Court first examined how Dr. Jewell applied the traditional
    methodology of analyzing replicated, significant results.
    While Dr. Jewell discussed many groupings of cardiac birth
    defects, he focused on the significant findings for all cardiac
    defects and septal defects. Dr. Jewell presented two studies
    reporting a significant association between Zoloft and all
    cardiac defects (Kornum (2010) 4 and Jimenez-Solem
    (2012) 5).   He also presented five studies reporting a
    2
    In re Zoloft (Sertraline Hydrochloride) Prods. Liab. Litig.
    (Zoloft I), 
    26 F. Supp. 3d 449
    , 465 (E.D. Pa. 2014). Since Dr.
    Jewell seems to provide similar testimony, we take into
    account the District Court’s rationale in excluding Dr. Bérard.
    3
    Daubert v. Merrell Dow Pharm., Inc., 
    509 U.S. 579
     (1993).
    4
    JA 1059-67. Jette B. Kornum, et al., Use of Selective
    Serotonin-Reuptake Inhibitors During Early Pregnancy and
    Risk of Congenital Malformations: Updated Analysis, 2 Clin.
    Epidemiol. 29 (2010).
    5
    JA 1040-51. Espen Jimenez-Solem, et al., Exposure to
    Selective Serotonin Reuptake Inhibitors and the Risk of
    5
    significant association between Zoloft and septal defects
    (Kornum (2010), Jimenez-Solem (2012), Louik (2007), 6
    Pedersen (2009), 7 and Bérard (2015) 8). After excluding two
    studies from its consideration, 9 the District Court expressed
    two concerns with the remaining studies: Jimenez-Solem
    (2012), Kornum (2010), and Pedersen (2009). First, despite
    the fact that the remaining studies produced consistent results,
    the District Court did not consider them to be independent
    replications because they used overlapping Danish
    Congenital Malformations: A Nationwide Cohort Study, 2
    British Med. J. Open 1148 (May 2012).
    6
    JA 5622-34. Carol Louik, et al., First-Trimester Use of
    Selective Serotonin-Reuptake Inhibitors and the Risk of Birth
    Defects, 356 N. Eng. J. Med. 2675 (June 2007).
    7
    JA 1030-39. Lars H. Pedersen, et al., Selective Serotonin
    Reuptake Inhibitors in Pregnancy and Congenital
    Malformations: Population Based Cohort Study, 339 British
    Med. J. 3569 (Sept. 2009).
    8
    JA 5987-99.        Anick Bérard, Sertraline Use During
    Pregnancy and the Risk of Major Malformations, 212 Am. J.
    Obstet. Gynecol. 795 (2015).
    9
    The District Court noted that during the trial, a transcription
    error was found in Louik (2007), which led to a significant
    result for septal defects being reclassified as insignificant. JA
    65. The New England Journal of Medicine (NEJM) required
    the author to revise his discussion in light of this change.
    Additionally, multiple people tried to replicate the results in
    Bérard (2015)—including Dr. Jewell, a member of the PSC’s
    legal team, and Pfizer’s experts—and failed. The District
    Court did not allow Dr. Jewell to rely on Bérard (2015) after
    Dr. Jewell consequently “expressed a lack of confidence”
    about its reliability on cross-examination. JA 64-65.
    6
    populations. Second, a larger study, Furu (2015), 10 included
    almost all the data from Jimenez-Solem (2012), Kornum
    (2010), and Pedersen (2009) and did not replicate the findings
    of those studies. Dr. Jewell did not explain the reasons why
    this attempted replication produced different results or why
    the new study did not contradict his opinion.
    The court then examined Dr. Jewell’s reliance on
    insignificant results, noting that it was very similar to Dr.
    Bérard’s methodology. The court noted that Dr. Jewell did
    not provide any evidence that the epidemiology or
    teratology11 communities value statistical significance 12 any
    10
    JA 4395-4404. Kari Furu, et al., Selective Serotonin
    Reuptake Inhibitors and Venlafaxine in Early Pregnancy and
    Risk of Birth Defects: Population Based Cohort Study and
    Sibling Design, 350 British Med. J. 1798 (Mar. 2015). This
    study was not available to Dr. Jewell when he prepared his
    report, but the District Court noted that Dr. Jewell testified
    that he was familiar with it. JA 63, 7297-327.
    11
    As the District Court noted, “[t]eratology is the scientific
    field which deals with the cause and prevention of birth
    defects. . . . [Where a drug is alleged to be] a teratogen, it is
    common to put forth experts whose opinions are based on
    epidemiological evidence.” JA 52.
    12
    The findings in these studies are often expressed in terms of
    “odds ratios.” Odds ratios are merely “a measure of
    association.” JA 2446. An odds ratio of 1, in the context of
    these studies, generally means that there is no observed
    association between taking Zoloft and experiencing a cardiac
    birth defect. Since these odds ratios are just estimates, a
    confidence interval is used to show the precision of the
    estimate. JA 2439-40. If the confidence interval contains the
    7
    less than it has traditionally been understood. 13 The court
    also expressed concern that Dr. Jewell inconsistently applied
    his “technique” of multiplying p-values 14 and his trend
    analysis.
    The District Court critiqued several other techniques
    Dr. Jewell used in analyzing the evidence. First, Dr. Jewell
    rejected meta-analyses on which he had previously relied in a
    lawsuit against another SSRI, Prozac. The meta-analyses
    reported insignificant associations with birth defects for
    Zoloft but not for Prozac. Dr. Jewell rationalized his decision
    to ignore these meta-analyses because the “heterogeneity” 15
    within its Zoloft studies was significant; the District Court
    odds ratio of 1, the risk of cardiac birth defects while taking
    Zoloft is not considered “significantly” greater than the risk
    while not taking Zoloft.
    13
    The District Court instead noted that the NEJM’s treatment
    of the Louik (2007) transcription error suggests that the
    epidemiology and teratology communities still strongly value
    significance. JA 67.
    14
    A “p-value” indicates the likelihood that the difference
    between the observed and the expected value (based on the
    null hypothesis) of a parameter occurs purely by chance. JA
    2396. In this context, the null hypothesis is that the odds ratio
    is one; rejecting the null hypothesis suggests there is a
    significant association between Zoloft and cardiac birth
    defects.
    15
    The District Court quoted Dr. Jewell in defining
    heterogeneity as “the measure of the variation among the
    effect sizes reported in [various] studies [and] . . . where
    heterogeneity is significant, the source of variation should be
    investigated and discussed.” JA 70.
    8
    accepted this explanation but questioned why Dr. Jewell
    “fails to statistically calculate the heterogeneity” across other
    studies instead of relying on trends. 16 Second, Dr. Jewell
    reanalyzed two studies, Jimenez-Solem (2012) and
    Huybrechts (2014), 17 both of which had originally concluded
    that there was no significant effect attributable to Zoloft.18
    The District Court questioned his rationale for conducting,
    and tactics for implementing, this reanalysis. Finally, Dr.
    Jewell conducted a meta-analysis with Huybrechts (2014) and
    Jimenez-Solem (2012). The District Court questioned why he
    used only those particular studies. 19
    Based on this analysis, the District Court found that
    Dr. Jewell, tasked with explaining his opinion about Zoloft’s
    effect on birth defects and reconciling contrary studies,
    16
    JA 72.
    17
    JA 4256-67. Krista F. Huybrechts, et al. Antidepressant
    Use in Pregnancy and the Risk of Cardiac Defect, 370 N.
    Eng. J. Med. 2397 (2014).
    18
    Jimenez-Solem (2012) found that both current Zoloft users
    and SSRI users who “paused” their use during pregnancy had
    elevated risks of birth defects; this study concluded that the
    increased risk resulted from a confounding factor. JA 1044,
    1047-48. Huybrechts (2014) found the increase in the risk of
    cardiac birth defects from taking Zoloft to be insignificant.
    JA 4257-67.
    19
    Additionally, the District Court found that Dr. Jewell may
    have relied on a Periodic Safety Update Report, which
    contains literature reviews, and email correspondence
    summarizing a literature review. The District Court excluded
    this testimony because this is not the type of information
    statisticians generally rely on. This exclusion is not contested
    here.
    9
    “failed to consistently apply the scientific methods he
    articulates, has deviated from or downplayed certain well-
    established principles of his field, and has inconsistently
    applied methods and standards to the data so as to support his
    a priori opinion.” 20 For this reason, on December 2, 2015,
    the District Court entered an order, excluding Dr. Jewell’s
    testimony, and on April 5, 2016, the court granted Pfizer’s
    motion for summary judgment. The PSC appeals the
    exclusion of Dr. Jewell and the grant of summary judgment. 21
    20
    JA 82.
    21
    The PSC concedes that if the exclusion of Dr. Jewell was
    proper, it is unable to establish general causation and
    summary judgment was properly granted. Oral Argument
    Recording                  at                  13:30-13:59,
    http://www2.ca3.uscourts.gov/oralargument/audio/16-
    2247In%20Re%20Zoloft.mp3.
    10
    II. 22
    In general, courts serve as gatekeepers for expert
    witness testimony. “A witness who is qualified as an expert
    by knowledge, skill, experience, training, or education may
    testify in the form of an opinion or otherwise if,” inter alia,
    “the testimony is the product of reliable principles and
    methods[] and . . . the expert has reliably applied the
    principles and methods to the facts of the case.” 23 In
    determining the reliability of novel scientific methodology,
    courts can consider multiple factors, including the testability
    of the hypothesis, whether it has been peer reviewed or
    published, the error rate, whether standards controlling the
    technique’s operation exist, and whether the methodology is
    22
    The District Court had jurisdiction over this claim under 
    28 U.S.C. § 1332
     and 
    28 U.S.C. § 1407
    (a). We have jurisdiction
    under 
    28 U.S.C. § 1291
    . We review questions of law de
    novo, and questions of fact for clear error. Ragen Corp. v.
    Kearney & Trecker Corp., 
    912 F.2d 619
    , 626 (3d Cir. 1990)
    (citations omitted). We review the decision to exclude expert
    testimony for abuse of discretion. In re Paoli R.R. Yard PCB
    Litig. (In re Paoli), 
    35 F.3d 717
    , 749 (3d Cir. 1994).
    However, when the exclusion of such evidence results in a
    summary judgment, we perform a “hard look” analysis to
    determine if a district court has abused its discretion. 
    Id. at 750
    . An abuse of discretion occurs when a court’s decision
    “rests upon a clearly erroneous finding of fact, an errant
    conclusion of law or an improper application of law to fact”
    or “when no reasonable person would adopt the district
    court's view.” Oddi v. Ford Motor Co., 
    234 F.3d 136
    , 146
    (3d Cir. 2000) (internal quotation marks and citation omitted).
    23
    Fed. R. Evid. 702.
    11
    generally accepted. 24 Both an expert’s methodology and the
    application of that methodology must be reviewed for
    reliability. 25 A court should not, however, usurp the role of
    the fact-finder; instead, an expert should only be excluded if
    “the flaw is large enough that the expert lacks the ‘good
    grounds’ for his or her conclusions.” 26
    Central to this case is the question of whether
    statistical significance is necessary to prove causality. We
    decline to state a bright-line rule. Instead, we reiterate that
    plaintiffs ultimately must prove a causal connection between
    Zoloft and birth defects. A causal connection may exist
    despite the lack of significant findings, due to issues such as
    random misclassification or insufficient power. 27 Conversely,
    a causal connection may not exist despite the presence of
    significant findings. If a causal connection does not actually
    exist, significant findings can still occur due to, inter alia,
    inability to control for a confounding effect or detection bias.
    A standard based on replication of statistically significant
    24
    In re Paoli, 
    35 F.3d at 742
    .
    25
    
    Id. at 745
     (“However, after Daubert [v. Merrell Dow
    Pharm., Inc., 
    509 U.S. 579
     (1993)], we no longer think that
    the distinction between a methodology and its application is
    viable.”).
    26
    In re TMI Litig., 
    193 F.3d 613
    , 665 (3d Cir. 1999),
    amended, 
    199 F.3d 158
     (3d Cir. 2000) (internal quotation
    marks and citation omitted).
    27
    Power is “the chance that a statistical test will declare an
    effect when there is an effect to be declared. This chance
    depends on the size of the effect and the size of the sample.
    Discerning subtle differences requires large samples; small
    samples may fail to detect substantial differences.” JA 2409.
    12
    findings obscures the essential issue: a causal connection.
    Given this, the requisite proof necessary to establish causation
    will vary greatly case by case. This is not to suggest,
    however, that statistical significance is irrelevant. Despite the
    problems with treating statistical significance as a magic
    criterion, it remains an important metric to distinguish
    between results supporting a true association and those
    resulting from mere chance. Discussions of statistical
    significance should thus not understate or overstate its
    importance.
    With this in mind, we proceed to the issues at hand.
    The PSC raises two issues on appeal: 1) whether the District
    Court erroneously concluded that reliability requires
    replicated, statistically significant findings, and 2) whether
    Dr. Jewell’s testimony was properly excluded.
    A.
    The PSC argues that the District Court erroneously
    held that replicated, statistically significant findings are
    necessary to satisfy reliability. This argument seems to have
    been originally raised in the motion for reconsideration of Dr.
    Bérard’s exclusion. Explaining its decision to exclude Dr.
    Bérard, the District Court cited a previous case, Wade-Greaux
    v. Whitehall Labs, Inc., for the proposition that the teratology
    community generally requires replicated, significant
    epidemiological results before inferring causality. 28 The PSC
    28
    Zoloft I, 26 F. Supp.3d at 454 n.13 (citing Wade-Greaux v.
    Whitehall Labs., Inc., 
    874 F. Supp. 1441
    , 1453 (D.V.I. 1994)
    aff'd, 
    46 F.3d 1120
     (3d Cir. 1994), for text, see No. 94-7199,
    
    1994 WL 16973481
     (3d Cir. Dec. 15, 1994)).
    13
    claims that in so doing, the District Court was asserting a
    legal standard that required replicated, significant findings for
    reliability. 29 Pfizer contends that the District Court merely
    made a factual finding about what the teratology community
    generally accepts.
    Upon review, it is clear that the District Court was not
    creating a legal standard, but merely making a factual finding.
    The PSC argues that the District Court must have created a
    legal standard because it did not cite any sources other than
    Wade-Greaux to support its assertion that the teratology
    community generally requires replicated, significant
    epidemiological findings. However, in its initial exclusion of
    Dr. Bérard, the District Court noted that it looked to the
    standards adopted by “other epidemiologists, even the very
    researchers [Dr. Bérard] cites in her report.” 30 Similarly, in
    29
    Relatedly, the PSC claims that the District Court made a
    legal standard that “it was not reliable for Dr. Jewell to
    invoke studies observing non-statistically significant positive
    associations.” However, the language cited does not support
    this conclusion: The District Court merely asserts that
    “experts may use congruent but non-significant data to bolster
    inferences drawn from replicated, statistically significant
    data. However, in this case . . . three of the studies Dr. Jewell
    relies upon to show replication use overlapping data . . . [and]
    have not been replicated by later, well-powered studies which
    attempt to control for various confounding factors and
    biases.” JA 67-68.
    30
    Zoloft I, 26 F. Supp. 3d at 456 (“There exists a well-
    established methodology used by scientists in her field of
    epidemiology, and Dr. Bérard herself has utilized it in her
    published, peer-reviewed work. The ‘evolution’ in thinking
    14
    its order denying general reconsideration of Dr. Bérard’s
    exclusion, the District Court clarified that it “made this
    factual finding after review of the published literature relied
    upon by Dr. Bérard and other experts, as well as its review of
    the reports and testimony of both parties” 31 and merely used
    this factual finding as part of its FRE 702 analysis. 32 While
    the District Court does cite Wade-Greaux, 33 it uses it merely
    to show “that other courts have made similar findings
    regarding the prevailing standards for scientists in Dr.
    Bérard’s field.” 34
    about the importance of statistical significance Dr. Bérard
    refers to does not appear to have been adopted by other
    epidemiologists, even the very researchers she cites in her
    report.”).
    31
    In re Zoloft (Sertraline Hydrocloride) Prod. Liab. Litig.
    (Zoloft II), No. 12-2342, 
    2015 WL 314149
    , at *2 (E.D. Pa.
    Jan. 23, 2015); see, e.g., JA 3962, 3971-72.
    32
    While general acceptance by the scientific community is no
    longer dispositive in the Rule 702 analysis, it remains a factor
    that a court may consider. Daubert, 
    509 U.S. at 594
     (“[A]
    known technique which has been able to attract only minimal
    support within the community may properly be viewed with
    skepticism.”) (internal quotation marks and internal citation
    omitted).
    33
    Wade-Greaux, 
    874 F. Supp. at 1453
     (noting that “[a]bsent
    consistent, repeated human epidemiological studies showing a
    statistically significant increased risk of particular birth
    defects associated with exposure to a specific agent, the
    community of teratologists does not conclude that the agent is
    a human teratogen.”).
    34
    Zoloft II, 
    2015 WL 314149
    , at *2.
    15
    Second, the course of the proceedings make clear that
    the replication of significant results was not dispositive in
    establishing whether the testimony of either Dr. Bérard or Dr.
    Jewell was reliable. In fact, the District Court expressly
    rejected Pfizer’s argument that the existence of a statistically
    significant, replicated result is a threshold issue before an
    expert can conduct the Bradford-Hill analysis. 35 In doing so,
    the District Court was clear that it was not requiring a
    threshold showing of statistical significance. Similarly, the
    District Court did not end its inquiry after analyzing whether
    there were replicated, significant results. Instead, the District
    Court examined other techniques of general trend analysis,
    reanalysis of other studies, and meta-analysis. Even though it
    ultimately rejected the application of these techniques as
    unreliable, it did not categorically reject alternative
    techniques, suggesting that it did not make a legal standard
    requiring replicated, significant results.
    For these reasons, we find that the District Court did
    not require replication of significant results to establish
    reliability. Instead, it merely made a factual finding that
    teratologists generally require replication of significant
    results, and this factual finding did not prevent it from
    considering other evidence of reliability. 36
    35
    
    Id.
     (“In so doing, the Court rejected Pfizer's argument that
    the Court could exclude Dr. Bérard's opinion without even
    reaching her Bradford–Hill analysis, because the Bradford–
    Hill criteria should only be applied after an association is well
    established”); see also Zoloft I, 26 F. Supp. 3d at 462.
    36
    The PSC also argues that the District Court did not discuss
    one study providing a significant, positive association
    between Zoloft and birth defects, Wemakor (2015). The PSC
    16
    B.
    The second issue on appeal is whether it was an abuse
    of discretion for the District Court to exclude Dr. Jewell’s
    testimony. Dr. Jewell utilized a combination of two methods:
    the “weight of the evidence” analysis and the Bradford Hill
    criteria. The “weight of the evidence” analysis involves a
    series of logical steps used to “infer[] to the best
    explanation[.]” 37 The Bradford Hill criteria are metrics that
    epidemiologists use to distinguish a causal connection from a
    mere association. These metrics include strength of the
    association, consistency, specificity, temporality, coherence,
    biological gradient, plausibility, experimental evidence, and
    analogy. 38 In his expert report, Dr. Jewell seems to utilize
    numerous “techniques” in implementing the weight of the
    evidence methodology. Dr. Jewell discusses whether the
    claims this is “reversible error because it inaccurately
    depicted Dr. Jewell’s opinion as unsupported by replicated,
    non-overlapping data.” Pfizer argues that the District Court
    did not have to mention each study and that Wemakor is
    unreliable, as the authors themselves admit that their findings
    are “compatible with confounding by depression as indication
    or other associated factors/exposures.” We conclude that this
    was not an error because it is clear the District Court
    considered Wemakor in the Daubert hearing. Even if the
    District Court had failed to consider Wemakor, we would find
    no error because it did not require replicated, statistically
    significant findings as a legal requirement.
    37
    Milward v. Acuity Specialty Prods. Grp., Inc., 
    639 F.3d 11
    ,
    17 (1st Cir. 2011) (internal quotation marks and citation
    omitted).
    38
    JA 5652-56.
    17
    conclusions drawn from these techniques satisfy the Bradford
    Hill criteria and support the existence of a causal
    connection. 39
    Pfizer does not seem to contest the reliability of the
    Bradford Hill criteria or weight of the evidence analysis
    generally; the dispute centers on whether the specific
    methodology implemented by Dr. Jewell is reliable. Flexible
    methodologies, such as the “weight of the evidence,” can be
    implemented in multiple ways; despite the fact that the
    methodology is generally reliable, each application is distinct
    and should be analyzed for reliability. In In re Paoli R.R.
    Yard PCB Litigation, this Circuit noted that while differential
    diagnosis—also a flexible methodology—is generally
    accepted, “no particular combination of techniques chosen by
    a doctor to assess an individual patient is likely to have been
    generally accepted.” 40      Accordingly, we subjected the
    expert’s specific differential diagnosis process to a Daubert
    inquiry. 41 We noted that “to the extent that a doctor utilizes
    standard diagnostic techniques in gathering this information,
    the more likely we are to find that the doctor’s methodology
    is reliable.” 42 While we did not require the expert to run
    specific tests or ascertain full information in order for the
    differential diagnosis to be reliable, we did require him to
    explain why his conclusion remained reliable in the face of
    39
    Pfizer argues that PSC did not previously use the “weight
    of the evidence” terminology for the method followed by Dr.
    Jewell. We assume for the sake of argument that this was the
    purported methodology all along.
    40
    In re Paoli, 
    35 F.3d 717
    , 758 (3d Cir. 1994).
    41
    
    Id.
    42
    
    Id.
    18
    alternate causes. 43
    This standard, while articulated with respect to
    differential diagnoses, applies to the weight of the evidence
    analysis. We have briefly encountered the Bradford Hill
    criteria/weight of the evidence methodology in Magistrini v.
    One Hour Martinizing Dry Cleaning, a nonprecedential
    affirmance of the District of New Jersey’s exclusion of an
    expert. 44 The expert followed the weight of the evidence
    methodology, including epidemiological findings assessed
    using the Bradford Hill criteria.        The District Court
    acknowledged that although the weight of the evidence
    methodology was generally reliable, “[t]he particular
    combination of evidence considered and weighed here has not
    been subjected to peer review.” 45 Similar concerns are
    arguably present for the Bradford Hill criteria, which are
    43
    
    Id. at 760
     (“[T]he district court abused its discretion in
    excluding that opinion under Rule 702 unless either (1) Dr.
    Sherman or DiGregorio engaged in very few standard
    diagnostic techniques by which doctors normally rule out
    alternative causes and the doctor offered no good explanation
    as to why his or her conclusion remained reliable, or (2) the
    defendants pointed to some likely cause of the plaintiff's
    illness other than the defendants’ actions and Dr. Sherman or
    DiGregorio offered no reasonable explanation as to why he or
    she still believed that the defendants' actions were a
    substantial factor in bringing about that illness.”).
    44
    Magistrini v. One Hour Martinizing Dry Cleaning, 68 F.
    App’x 356 (3d Cir. 2003).
    45
    Magistrini v. One Hour Martinizing Dry Cleaning, 
    180 F. Supp. 2d 584
    , 602 (D.N.J. 2002).
    19
    neither an exhaustive nor a necessary list. 46 An expert can
    theoretically assign the most weight to only a few factors, or
    draw conclusions about one factor based on a particular
    combination of evidence. The specific way an expert
    conducts such an analysis must be reliable; “all of the
    relevant evidence must be gathered, and the assessment or
    weighing of that evidence must not be arbitrary, but must
    itself be based on methods of science.” 47 To ensure that the
    Bradford Hill/weight of the evidence criteria “is truly a
    methodology, rather than a mere conclusion-oriented
    selection process . . . there must be a scientific method of
    weighting that is used and explained.” 48 For this reason, the
    specific techniques by which the weight of the
    evidence/Bradford Hill methodology is conducted must
    themselves be reliable according to the principles articulated
    in Daubert. 49
    In short, despite the fact that both the Bradford Hill
    and the weight of the evidence analyses are generally reliable,
    46
    Milward, 
    639 F.3d at 17
    .
    47
    Magistrini, 
    180 F. Supp. 2d at 602
    .
    48
    
    Id. at 607
    .
    49
    There has been very little circuit authority regarding the
    application of the Bradford Hill criteria in the weight of the
    evidence analysis. The First Circuit has warned against
    “treat[ing] the separate evidentiary components of [the]
    analysis atomistically, as though [the] ultimate opinion was
    independently supported by each.” Milward, 
    639 F.3d at 23
    .
    In contrast, the Tenth Circuit briefly discussed the Bradford
    Hill criteria, and then separately conducted a Daubert
    analysis for each body of evidence. Hollander v. Sandoz
    Pharm. Corp., 
    289 F.3d 1193
    , 1204-13 (10th Cir. 2002).
    20
    the “techniques” used to implement the analysis must be 1)
    reliable and 2) reliably applied. In discussing the conclusions
    produced by such techniques in light of the Bradford Hill
    criteria, an expert must explain 1) how conclusions are drawn
    for each Bradford Hill criterion and 2) how the criteria are
    weighed relative to one another. Here, we accept that the
    Bradford Hill and weight of the evidence analyses are
    generally reliable. We also assume that the “techniques” used
    to implement the analysis (here, meta-analysis, trend analysis,
    and reanalysis) are themselves reliable. However, we find
    that Dr. Jewell did not 1) reliably apply the “techniques” to
    the body of evidence or 2) adequately explain how this
    analysis supports specified Bradford Hill criteria. Because
    “any step that renders the analysis unreliable under the
    Daubert factors         renders the       expert’s testimony
    inadmissible,” 50 this is sufficient to show that the District
    Court did not abuse its discretion in excluding Dr. Jewell’s
    testimony.
    1.
    It was not an abuse of discretion for the District Court
    to find Dr. Jewell’s application of trend analysis, reanalysis,
    and meta-analysis to the body of evidence to be unreliable.
    Here, we assume the techniques listed are generally reliable
    and rest on the fact that they were unreliably applied. As
    stated in In re Paoli, use of standard techniques bolster the
    inference of reliability; 51 nonstandard techniques need to be
    well-explained. Additionally, if an expert applies certain
    techniques to a subset of the body of evidence and other
    50
    In re Paoli, 
    35 F.3d at 745
    .
    51
    
    Id. at 758
    .
    21
    techniques to another subset without explanation, this raises
    an inference of unreliable application of methodology. 52
    First, we find no abuse of discretion in the District
    Court’s determination that Dr. Jewell unreliably analyzed the
    trend in insignificant results.      Dr. Jewell applied this
    technique by qualitatively discussing the probative value of
    multiple positive, insignificant results. In justifying this
    approach, he relied on a quantitative method by which one
    can calculate the likelihood of seeing multiple positive but
    insignificant results if there were actually no true effect.53
    However, after alluding to this presumably reliable
    mathematical calculation technique for analyzing trends in
    even insignificant results, Dr. Jewell did not actually
    implement it; instead he qualitatively discussed the general
    trend in the data. In light of the opportunity to actually
    conduct such quantitative analysis, his refusal to do so—
    without explanation—suggests that he did not reliably apply
    his stated methodology. 54
    Even assuming the reliability of Dr. Jewell’s version of
    52
    See Magistrini, 
    180 F. Supp. 2d at 607
     (noting that a
    scientific method of weighting must be explained to prevent a
    “conclusion-oriented selection process.”).
    53
    Dr. Jewell used this as an illustrative example in his report
    and at the Daubert hearing but on appeal PSC identifies this
    technique as Fisher’s combined probability test. Insofar as
    this is part of a meta-analysis or is sensitive to the same
    heterogeneity issues articulated by Dr. Jewell, we reiterate
    our concerns below.
    54
    JA 69 (“[T]he Court finds Dr. Jewell’s failure to apply the
    methodology he outlined to the studies he reviewed
    problematic.”).
    22
    trend analysis, Dr. Jewell identified trends and interpreted
    insignificant results differently based on the outcome of the
    study.     The District Court concluded that Dr. Jewell
    “selectively emphasize[d] observed consistency . . . only
    when the consistent studies support his opinion.” 55 Dr. Jewell
    emphasized the insignificance of results reporting odds ratios
    below 1 but not the insignificance of those reporting odds
    ratios above 1. He also paid attention to the upper bounds of
    the confidence intervals associated with odds ratios below 1,
    but not to the lower bounds.
    Second, we interpret the District Court’s discussion of
    heterogeneity as raising the concern that Dr. Jewell
    selectively used meta-analyses. He did this in two ways:
    First, without explanation, Dr. Jewell performed a meta-
    analysis on two studies but not on any of the other studies.
    The District Court questioned why Dr. Jewell did not conduct
    a meta-analysis on the remaining studies instead of using the
    qualitative general trend analysis. While Dr. Jewell was not
    required to do specific tests, the lack of explanation made his
    inconsistent application of meta-analysis to certain studies
    unreliable. 56 Second, when he did perform a meta-analysis,
    Dr. Jewell only included two studies utilizing “exposed” and
    “paused” groups even though each had a different definition
    55
    JA 69.
    56
    Dr. Jewell admitted that he did not “attempt to do a meta-
    analysis where [he] defined an a priori – an a priori
    inclusion/exclusion set of criteria, generated a return set of
    studies, assessed heterogeneity and then considered whether
    by further adjustment or accommodation, [he] could come up
    with a meaningful set of statistics.” He cryptically claimed
    that he “determined you couldn’t.” JA 4898.
    23
    of “paused,” without an adequate explanation for why these
    studies can be lumped together. He also inexplicably
    excluded another study (Kornum (2010)) utilizing similar
    methodology. Again, while there may have been legitimate
    reasons for these inconsistencies, the fact that he did not give
    an adequate explanation for doing so makes his testimony
    unreliable.
    Finally, Dr. Jewell reanalyzed two studies to control
    for confounding by indication. The need for conducting this
    reanalysis on Huybrechts (2014) was unclear. Dr. Jewell said
    that he wanted to control for indication by comparing the
    outcomes for “paused” Zoloft users to “exposed” Zoloft
    users; however, the study already controlled for indication. If
    Dr. Jewell wanted to correct for misclassification, the original
    study already controlled for that as well through extensive
    sensitivity analyses. 57   Given that the study originally
    concluded that Zoloft was not associated with a statistically
    significant increase in the likelihood of birth defects, this
    reanalysis seems conclusion-driven.
    Ultimately, the fact that Dr. Jewell applied these
    techniques inconsistently, without explanation, to different
    subsets of the body of evidence raises real issues of
    reliability.   Conclusions drawn from such unreliable
    application are themselves questionable.
    57
    It is true that these sensitivity analyses had less power
    because they involved looking at a subset of the population,
    making them less likely to find a significant difference;
    however, we could not find that Dr. Jewell has raised this
    point as a reason for reanalysis.
    24
    2.
    Using the techniques discussed above, Dr. Jewell went
    on to evaluate the Bradford Hill criteria. While Dr. Jewell did
    discuss the applicable Bradford Hill criteria and how he
    weighed the factors together, he did not explain how he drew
    conclusions for certain criteria, namely the strength of
    association and consistency.
    Dr. Jewell concluded that the strength of association
    weighs in favor of causality. In doing so, he focused on
    studies reporting odds ratios between two and three (Colvin
    (2011), 58 Jimenez-Solem (2012), Malm (2011), 59 Pedersen
    (2009), and Louik (2007)). He rationalized that such a large
    association is unlikely to be associated with confounding
    alone. 60 He later bolstered this argument by estimating the
    percent of the effect generally attributable to confounding by
    indication. He estimated this percent by observing the
    percent decrease in odds ratios after controlling for indication
    over a few studies. When pressed by counsel at the Daubert
    hearing, Dr. Jewell admitted that this was not a scientifically
    58
    JA 6011-28. Lyn Colvin, et al., Dispensing Patterns and
    Pregnancy Outcomes for Women Dispensed Selective
    Serotonin Reuptake Inhibitors in Pregnancy, 91 Birth Defects
    Res. A Clin. Mol. Teratol. 142 (2011).
    59
    JA 7697-7707. Heki Malm, et al., Selective Serotonin
    Reuptake Inhibitors and Risk for Major Congenital
    Anomalies, 118 Obstetrics & Gynecology 111 (2011).
    60
    Dr. Jewell also notes that the link between depression and
    cardiac defects being missing undercuts the confounding by
    indication argument. JA 7468-69.
    25
    rigorous adjustment. 61 Such reliance on ad hoc adjustments
    supports the District Court’s decision to exclude Dr. Jewell’s
    testimony.
    Similarly, while Dr. Jewell found that the causal effect
    of Zoloft on cardiac birth defects is consistent, it is not clear
    how he drew this conclusion. As noted above, Dr. Jewell
    classified insignificant odds ratios above one as supporting a
    “consistent” causality result, downplaying the possibility that
    they support no association between Zoloft use and cardiac
    birth defects. While an insignificant result may be consistent
    with a causal effect, Dr. Jewell’s discussion is too far-
    reaching, sometimes understating the importance of statistical
    significance. For example, Furu (2015)—a study that
    incorporated almost all the data in Pedersen (2009), Jimenez-
    Solemn (2012), and Kornum (2010)—included a larger
    sample but, unlike the former three studies, reported no
    significant association between Zoloft and cardiac birth
    defects. Insignificant results can occur merely because a
    study lacks power to produce a significant result, and, all else
    being equal, a larger sample size increases the power of a
    test. 62 Unless there are other significant differences, we
    61
    JA 7470-71 (“I said, I didn't put that in my report. I put in
    that if you wanted as a statistician, if somebody came to me
    now as you're sort of hinting at and said [Colvin] didn’t adjust
    for confounding, well, that could make a big impact, I agree,
    it could, just if I knew nothing else. . . . [A] statistician knows
    from doing simulations and computation that we alluded to
    yesterday how much of an impact could you take -- get from
    adjusting for confounding even though in this particular
    population we [aren’t] able to do it. It’s not a definitive
    result.”)
    62
    Insofar as Dr. Jewell finds Furu to be less powerful than the
    26
    would expect Furu to be better able to capture a true effect
    than the preceding three studies. While an insignificant result
    from a low-powered study does not necessarily undermine a
    statistically significant result from a higher-powered study,
    the opposite argument (i.e., that an insignificant finding from
    a presumably better-powered study is evidence of consistency
    with significant findings from lower-powered studies)
    requires further explanation. 63 While there may be a reason
    that such a result could be consistent with the past significant
    effects, Dr. Jewell did not meaningfully discuss why this may
    be. 64      Without adequate explanation, this argument
    understates the importance of statistical significance. Like
    the expert in Magistrini, Dr. Jewell should have “sufficiently
    discredit[ed] other studies that found no association or a
    negative association with much more precise confidence
    intervals, [or] sufficiently explain[ed] why he did not accord
    weight to those studies.” 65 Claiming a consistent result
    without meaningfully addressing these alternate explanations,
    as noted in In re Paoli, undermines reliability. 66
    previous studies based on factors other than sample size, he
    has not articulated this argument.
    63
    For example, Dr. Jewell could have argued that, despite
    having a larger sample, Furu (2015) was not better powered
    for other reasons or utilized flawed methodology.
    64
    In fact, upon appeal, the PSC argues that Furu (2015) is
    consistent with Dr. Jewell’s causal result merely because it
    reports odds ratios above one (1.05 and 1.13).
    65
    Magistrini, 
    180 F. Supp. 2d at 607
     (emphasis added).
    66
    In re Paoli, 
    35 F.3d at 760
     (noting the importance of
    explaining why a conclusion remains reliable in the face of
    alternate explanations).
    27
    For these reasons, the District Court determined that
    Dr. Jewell did not consistently assess the evidence supporting
    each criterion or explain his method for doing so. Thus, it
    was not an abuse of discretion to find that Dr. Jewell’s
    application of the Bradford Hill criteria was unreliable.
    This is not to suggest that all of the District Court’s
    criticisms were necessarily justified. For example, the fact
    that in his reanalysis Dr. Jewell drew a different conclusion
    from a study than its authors did is not necessarily a problem.
    Similarly, his imposition of a different assumption about the
    “exposed” group in Huybrechts (2014) did not require expert
    knowledge about psychology; he was merely testing the
    robustness of the results to Huybrechts’ original assumption.
    Similarly, the District Court credited the claim that
    overlapping samples did not provide replicated results,
    despite the fact that Dr. Jewell claimed it provided some
    informational value. 67 These inquiries are more appropriately
    left to the jury.
    On the whole, however, the District Court did not
    improperly usurp the jury’s role in assessing Dr. Jewell’s
    credibility. There is sufficient reason to find Dr. Jewell’s
    testimony was unreliable. Indeed, “any step that renders the
    analysis unreliable under the Daubert factors renders the
    expert’s testimony inadmissible.” 68 The fact that Dr. Jewell
    unreliably applied the techniques underlying the weight of the
    evidence analysis and the factors of the Bradford Hill analysis
    satisfies this standard for inadmissibility.
    67
    JA 7164 (noting that overlapping analysis still “provides a
    modicum of replication”).
    68
    In re Paoli, 
    35 F.3d at 745
    .
    28
    III.
    This case involves complicated facts, statistical
    methodology, and competing claims of appropriate standards
    for assessing causality from observational epidemiological
    studies. Ultimately, however, the issue is quite clear. As a
    gatekeeper, courts are supposed to ensure that the testimony
    given to the jury is reliable and will be more informative than
    confusing. Dr. Jewell’s application of his purported methods
    does not satisfy this standard.         By applying different
    techniques to subsets of the data and inconsistently discussing
    statistical significance, Dr. Jewell does not reliably analyze
    the weight of the evidence. Selecting these conclusions to
    discuss certain Bradford Hill factors also contributes to the
    unreliability. While the District Court may have flagged a
    few issues that are not necessarily indicative of an unreliable
    application of methods, there is certainly sufficient evidence
    on the record to suggest that the court did not abuse its
    discretion in excluding Dr. Jewell as an expert on the basis of
    the unreliability of his methods. For these reasons, we will
    affirm the orders of the District Court, excluding the
    testimony of Dr. Jewell and granting summary judgment in
    favor of Pfizer.
    29