Examples#

import pandas as pd
pd.set_option("display.max_rows", 10)

Higher level interface#

Work on DataFrame using classes.

from pycdhit import read_fasta, read_clstr, CDHIT

Read a fasta file as a DataFrame containing some amino acid sequences. (The data are extracted from part of the Antimicrobial Peptide Database.)

df_in = read_fasta("apd.fasta")
df_in
identifier sequence
0 00001 GLWSKIKEVGKEAAKAAAKAAGKAALGAVSEAV
1 00002 YVPLPNVPQPGRRPFPTFPGQGPFNPKIKWPQGY
2 00003 DGVKLCDVPSGTWSGHCGSSSKCSQQCKDREHFAYGGACHYQFPSV...
3 00004 NLCERASLTWTGNCGNTGHCDTQCRNWESAKHGACHKRGNWKCFCYFDC
4 00005 VFIDILDKVENAIHNAAQVGIGFAKPFEKLINPK
... ... ...
45 00046 GPLSCRRNGGVCIPIRCPGPMRQIGTCFGRPVKCCRSW
46 00047 GPLSCGRNGGVCIPIRCPVPMRQIGTCFGRPVKCCRSW
47 00048 SGISGPLSCGRNGGVCIPIRCPVPMRQIGTCFGRPVKCCRSW
48 00049 GIGALSAKGALKGLAKGLAEHFAN
49 00050 GIGASILSAGKSALKGLAKGLAEHFAN

50 rows × 2 columns

Initialize a command object, and show the help message.

cdhit = CDHIT(prog="cd-hit", path="~/cd-hit")
cdhit.help()

Set options and run the command to cluster the sequences.

df_out, df_clstr = cdhit.set_options(c=0.7, d=0, sc=1).cluster(df_in)
df_clstr
identifier cluster size is_representative identity
0 00037 0 40 False 100.00
1 00038 0 42 True 100.00
2 00041 0 42 False 85.71
3 00042 0 40 False 90.00
4 00043 0 38 False 86.84
... ... ... ... ... ...
44 00001 24 33 True 100.00
45 00035 25 26 True 100.00
46 00045 26 40 True 100.00
47 00036 27 38 True 100.00
48 00002 28 34 True 100.00

49 rows × 5 columns

The stdout of the finished program can also be retrieved.

print(cdhit.subprocess.stdout)

Lower level interface#

Work on files using functions.

from pycdhit import cd_hit

Set the path of installed CD-HIT as an environment variable CD_HIT_DIR, if it is not added to PATH.

import os
os.environ["CD_HIT_DIR"] = "~/cd-hit"

Cluster the sequences using cd-hit. Path objects can also be used in arguments.

res = cd_hit(
    i="apd.fasta",
    o="out",
    c=0.7,
    d=0,
    sc=1,
    bak=1,
)

The output files are a fasta file of representative sequences and a text file of clusters. Read the clstr file.

df_clstr = read_clstr("out.clstr")
df_clstr
identifier cluster size is_representative identity
0 00037 0 40 False 100.00
1 00038 0 42 True 100.00
2 00041 0 42 False 85.71
3 00042 0 40 False 90.00
4 00043 0 38 False 86.84
... ... ... ... ... ...
44 00001 24 33 True 100.00
45 00035 25 26 True 100.00
46 00045 26 40 True 100.00
47 00036 27 38 True 100.00
48 00002 28 34 True 100.00

49 rows × 5 columns

Read the bak.clstr file. Note that the cluster IDs are different.

df_bak_clstr = read_clstr("out.bak.clstr")
df_bak_clstr
identifier cluster size is_representative identity
0 00001 14 33 True 100.00
1 00002 12 34 True 100.00
2 00003 2 54 True 100.00
3 00004 3 49 True 100.00
4 00005 13 34 True 100.00
... ... ... ... ... ...
44 00046 8 38 False 94.74
45 00047 8 38 False 100.00
46 00048 8 42 True 100.00
47 00049 15 24 False 95.83
48 00050 15 27 True 100.00

49 rows × 5 columns

The stdout of the finished program can also be retrieved.

print(res.stdout)