Examples#
import pandas as pd
pd.set_option("display.max_rows", 10)
Higher level interface#
Work on DataFrame using classes.
from pycdhit import read_fasta, read_clstr, CDHIT
Read a fasta file as a DataFrame containing some amino acid sequences. (The data are extracted from part of the Antimicrobial Peptide Database.)
df_in = read_fasta("apd.fasta")
df_in
identifier | sequence | |
---|---|---|
0 | 00001 | GLWSKIKEVGKEAAKAAAKAAGKAALGAVSEAV |
1 | 00002 | YVPLPNVPQPGRRPFPTFPGQGPFNPKIKWPQGY |
2 | 00003 | DGVKLCDVPSGTWSGHCGSSSKCSQQCKDREHFAYGGACHYQFPSV... |
3 | 00004 | NLCERASLTWTGNCGNTGHCDTQCRNWESAKHGACHKRGNWKCFCYFDC |
4 | 00005 | VFIDILDKVENAIHNAAQVGIGFAKPFEKLINPK |
... | ... | ... |
45 | 00046 | GPLSCRRNGGVCIPIRCPGPMRQIGTCFGRPVKCCRSW |
46 | 00047 | GPLSCGRNGGVCIPIRCPVPMRQIGTCFGRPVKCCRSW |
47 | 00048 | SGISGPLSCGRNGGVCIPIRCPVPMRQIGTCFGRPVKCCRSW |
48 | 00049 | GIGALSAKGALKGLAKGLAEHFAN |
49 | 00050 | GIGASILSAGKSALKGLAKGLAEHFAN |
50 rows × 2 columns
Initialize a command object, and show the help message.
cdhit = CDHIT(prog="cd-hit", path="~/cd-hit")
cdhit.help()
Set options and run the command to cluster the sequences.
df_out, df_clstr = cdhit.set_options(c=0.7, d=0, sc=1).cluster(df_in)
df_clstr
identifier | cluster | size | is_representative | identity | |
---|---|---|---|---|---|
0 | 00037 | 0 | 40 | False | 100.00 |
1 | 00038 | 0 | 42 | True | 100.00 |
2 | 00041 | 0 | 42 | False | 85.71 |
3 | 00042 | 0 | 40 | False | 90.00 |
4 | 00043 | 0 | 38 | False | 86.84 |
... | ... | ... | ... | ... | ... |
44 | 00001 | 24 | 33 | True | 100.00 |
45 | 00035 | 25 | 26 | True | 100.00 |
46 | 00045 | 26 | 40 | True | 100.00 |
47 | 00036 | 27 | 38 | True | 100.00 |
48 | 00002 | 28 | 34 | True | 100.00 |
49 rows × 5 columns
The stdout of the finished program can also be retrieved.
print(cdhit.subprocess.stdout)
Lower level interface#
Work on files using functions.
from pycdhit import cd_hit
Set the path of installed CD-HIT as an environment variable CD_HIT_DIR
, if it is not added to PATH
.
import os
os.environ["CD_HIT_DIR"] = "~/cd-hit"
Cluster the sequences using cd-hit. Path objects can also be used in arguments.
res = cd_hit(
i="apd.fasta",
o="out",
c=0.7,
d=0,
sc=1,
bak=1,
)
The output files are a fasta file of representative sequences and a text file of clusters. Read the clstr file.
df_clstr = read_clstr("out.clstr")
df_clstr
identifier | cluster | size | is_representative | identity | |
---|---|---|---|---|---|
0 | 00037 | 0 | 40 | False | 100.00 |
1 | 00038 | 0 | 42 | True | 100.00 |
2 | 00041 | 0 | 42 | False | 85.71 |
3 | 00042 | 0 | 40 | False | 90.00 |
4 | 00043 | 0 | 38 | False | 86.84 |
... | ... | ... | ... | ... | ... |
44 | 00001 | 24 | 33 | True | 100.00 |
45 | 00035 | 25 | 26 | True | 100.00 |
46 | 00045 | 26 | 40 | True | 100.00 |
47 | 00036 | 27 | 38 | True | 100.00 |
48 | 00002 | 28 | 34 | True | 100.00 |
49 rows × 5 columns
Read the bak.clstr file. Note that the cluster IDs are different.
df_bak_clstr = read_clstr("out.bak.clstr")
df_bak_clstr
identifier | cluster | size | is_representative | identity | |
---|---|---|---|---|---|
0 | 00001 | 14 | 33 | True | 100.00 |
1 | 00002 | 12 | 34 | True | 100.00 |
2 | 00003 | 2 | 54 | True | 100.00 |
3 | 00004 | 3 | 49 | True | 100.00 |
4 | 00005 | 13 | 34 | True | 100.00 |
... | ... | ... | ... | ... | ... |
44 | 00046 | 8 | 38 | False | 94.74 |
45 | 00047 | 8 | 38 | False | 100.00 |
46 | 00048 | 8 | 42 | True | 100.00 |
47 | 00049 | 15 | 24 | False | 95.83 |
48 | 00050 | 15 | 27 | True | 100.00 |
49 rows × 5 columns
The stdout of the finished program can also be retrieved.
print(res.stdout)