Allan Espinosa

Splitting bioinformatics FASTA files

2009 October 8

I keep forgetting where my scripts were in my home directories. Below is my ruby script to split a large FASTA [1] sequence into N sequences per file:

[sourcecode language=“ruby”] #!/usr/bin/env ruby # # Script: dumpseq.rb # Description: Parses the a BLAST Fasta file and dumps each sequence to a # file. # Usage: dumpseq.rb [fasta_file]

require ‘fileutils’

fasta_db = File.new(ARGV[0])

sno = 0 d = 0

file = nil

while true x = fasta_db.readline(“>”).sub(/>$/, "") x =~ />(.*)/ if sno % 2 == 0 # 2 seqs per query file.close if file != nil dir = sprintf(“D%04d000”, d / 1000) FileUtils.mkdir_p dir # short filenames fname = sprintf “SEQ%07d.fasta”, d d += 1 file = File.new(“#{dir}/#{fname}”,“w”) end file << x sno += 1 fasta_db.ungetc ?> end

Its pretty hackish-looking. But then I found out that BioRuby [2] wrappers for parsing FASTA files.

[1] http://en.wikipedia.org/wiki/Fasta [2] http://www.bioruby.org