Easily phylotyping E. coli via the EzClermont web app and command-line tool

The Clermont PCR method of phylotyping Escherichia coli has remained a useful classification scheme despite the proliferation of higher-resolution sequence typing schemes. We have implemented an in silico Clermont PCR method as both a web app and as a command-line tool to allow researchers to easily apply this phylotyping scheme to genome assemblies easily. Availability and Implementation EzClermont is available as a web app at https://nickp60.pythonanywhere.com. For local use, Ez-Clermont can be installed with pip or installed from the source code at https://github.com/nickp60/ezclermont. All analysis was done with version 0.4.0. Contact n.waters4@nuigalway.ie Supplementary information Table S1: test dataset; S2: validation dataset; S3: results.

All these methods classify E. coli with greater accuracy and granularity than the phylotyping, but at the cost of interpretability. The Clermont 2013 phylotyping scheme remains a regularly utilised tool in classifying E. coli.
We developed EzClermont to provide a simple implementation of the Clermont phylotyping algorithm to genome as- 15 semblies. For researchers unfamiliar with command-line tools, we have implemented the software as a web application; for those needing to process large numbers of assemblies, a command-line interface can be installed via pip.
In short, the software uses constrained string matching as an in silico PCR to determine the presence or absence of the alleles used to determine the phylotype. As assemblies may contain alleles interrupted by breaks between contigs, we give the user the option to allow partial matches (ie, if one of the two primers matched, but the expected position 20 of the other primer fell beyond the sequence end).
As PCR primers do not necessarily need 100% sequence identity to function, we determined the variability at the priming sites in 523 strains. To do this, we downloaded the genome assemblies from NCBI Bioprojects PRJNA218110, PRJNA231221, and PRJNA352562. From each assembly, we extracted the 7 regions matching the theoretical amplicons of the quadriplex, E-specific, C-specific, and E/C control primer sets from Clermont 2013. Any differences 25 between a sequence and the primer sequence reported in Clermont 2013 were incorporated into the search query, except for differences in the last 5 nucleotides on the 3' regions (as those can be used to differentiate alleles) [14].
To assess the performance of EzClermont, we selected a test dataset and a validation dataset. Additionally, the strains from Clermont, 2013 Figure 1 are used as unit tests in the package.
As a test set, we used strains listed in Sims and Kim 2011 [13] (Table S1), and the validation set of 95 strains was the 30 genomes from Clermont 2015 [5] (Table S2)  did not agree, but two of those (IAI39, SMS-3-5) were shown by other works to have the phylotype that EzClermont predicted (see Table 1). The one strain that typed differently (APEC01) was examined and was found to have the 1 6 of the 101 total strains were omitted as no genome assembly was available.
2 ArpA allele that is not normally detected in B2 strains. of the 95 strains classifications matched. To determine whether the inconsistent phylogroup assignments matched phylogeny, we then generated a parsimony tree using kSNP3 [7], and plotted with ggtree [17]. This revealed that the EzClermont classification of ECOR46 (similar IAI39 and SMS-3-5) appears to match the true phylogeny, as opposed to the phylogroup reported in the literature (Figure 1). Of those that didn't match, all detected at least one theoretical 40 amplicon that was not reported to be there (Table S3).  Considering both the testing and validation datasets (114 strains), EzClermont has an accuracy of 94%. Given the ease of use of the web app for simple queries, and the speed of execution for larger batches, we hope that EzClermont will be of use to the community.