Examples#

import pandas as pd
pd.set_option("display.max_rows", 10)

Higher level interface#

Work on DataFrame using classes.

from pycdhit import read_fasta, read_clstr, CDHIT

Read a fasta file as a DataFrame containing some amino acid sequences. (The data are extracted from part of the Antimicrobial Peptide Database.)

df_in = read_fasta("apd.fasta")
df_in

	identifier	sequence
0	00001	GLWSKIKEVGKEAAKAAAKAAGKAALGAVSEAV
1	00002	YVPLPNVPQPGRRPFPTFPGQGPFNPKIKWPQGY
2	00003	DGVKLCDVPSGTWSGHCGSSSKCSQQCKDREHFAYGGACHYQFPSV...
3	00004	NLCERASLTWTGNCGNTGHCDTQCRNWESAKHGACHKRGNWKCFCYFDC
4	00005	VFIDILDKVENAIHNAAQVGIGFAKPFEKLINPK
...	...	...
45	00046	GPLSCRRNGGVCIPIRCPGPMRQIGTCFGRPVKCCRSW
46	00047	GPLSCGRNGGVCIPIRCPVPMRQIGTCFGRPVKCCRSW
47	00048	SGISGPLSCGRNGGVCIPIRCPVPMRQIGTCFGRPVKCCRSW
48	00049	GIGALSAKGALKGLAKGLAEHFAN
49	00050	GIGASILSAGKSALKGLAKGLAEHFAN

50 rows × 2 columns

Initialize a command object, and show the help message.

cdhit = CDHIT(prog="cd-hit", path="~/cd-hit")
cdhit.help()

Set options and run the command to cluster the sequences.

df_out, df_clstr = cdhit.set_options(c=0.7, d=0, sc=1).cluster(df_in)
df_clstr

	identifier	cluster	size	is_representative	identity
0	00037	0	40	False	100.00
1	00038	0	42	True	100.00
2	00041	0	42	False	85.71
3	00042	0	40	False	90.00
4	00043	0	38	False	86.84
...	...	...	...	...	...
44	00001	24	33	True	100.00
45	00035	25	26	True	100.00
46	00045	26	40	True	100.00
47	00036	27	38	True	100.00
48	00002	28	34	True	100.00

49 rows × 5 columns

The stdout of the finished program can also be retrieved.

print(cdhit.subprocess.stdout)

Work on files using functions.

from pycdhit import cd_hit

Set the path of installed CD-HIT as an environment variable CD_HIT_DIR, if it is not added to PATH.

import os
os.environ["CD_HIT_DIR"] = "~/cd-hit"

Cluster the sequences using cd-hit. Path objects can also be used in arguments.

res = cd_hit(
    i="apd.fasta",
    o="out",
    c=0.7,
    d=0,
    sc=1,
    bak=1,
)

The output files are a fasta file of representative sequences and a text file of clusters. Read the clstr file.

df_clstr = read_clstr("out.clstr")
df_clstr

	identifier	cluster	size	is_representative	identity
0	00037	0	40	False	100.00
1	00038	0	42	True	100.00
2	00041	0	42	False	85.71
3	00042	0	40	False	90.00
4	00043	0	38	False	86.84
...	...	...	...	...	...
44	00001	24	33	True	100.00
45	00035	25	26	True	100.00
46	00045	26	40	True	100.00
47	00036	27	38	True	100.00
48	00002	28	34	True	100.00

49 rows × 5 columns

Read the bak.clstr file. Note that the cluster IDs are different.

df_bak_clstr = read_clstr("out.bak.clstr")
df_bak_clstr

	identifier	cluster	size	is_representative	identity
0	00001	14	33	True	100.00
1	00002	12	34	True	100.00
2	00003	2	54	True	100.00
3	00004	3	49	True	100.00
4	00005	13	34	True	100.00
...	...	...	...	...	...
44	00046	8	38	False	94.74
45	00047	8	38	False	100.00
46	00048	8	42	True	100.00
47	00049	15	24	False	95.83
48	00050	15	27	True	100.00

49 rows × 5 columns

The stdout of the finished program can also be retrieved.

print(res.stdout)