Table of Contents
Preface v
Chapter 1 Preparing Your Computing Environment 1
1.1 Buying Your Own Computer 1
1.2 Setting up a Computing Server 4
1.3 Establishing a Remote Connection to a Server 6
Chapter 2 Learning Basic Linux Commands 17
2.1 No Need to be a Linux Guru to use Linux Effectively 17
2.2 Folder (Directory) Operations 18
Controlling your command prompt 25
2.3 File Operations 32
2.4 Assignment of Permissions 39
The path 46
2.5 Understanding System Status 47
UNIX redirection and pipes 52
2.6 Other Useful Commands 54
Chapter 3 Checking Sequence Quality 59
3.1 Basic High-throughput Sequencing 59
3.2 Challenges of High Throughput Genome Sequencing 61
3.3 Standards of Quality Score 62
3.4 Quality Check 65
FastQC 65
FASTX-Toolkit 72
Chapter 4 Sequence Alignment 81
4.1 The Purpose of Sequence Alignment 81
Sequence assembly 83
4.2 Selection of the Sequence Alignment Tools 83
Burrows Wheeler 85
The BWT encoding-decoding algorithm 88
4.3 Actual Operation of the Sequence Alignment 90
Download and installation of Bow tie 90
Executing sequence alignment 96
4.4 Sequence Alignment Results File Conversion 99
Downloading SAMtools 99
4.5 Using the Genome Browser 108
Chapter 5 Speeding-up with GPUs 117
5.1 Computational Advantages of the Graphics Card 117
5.2 Industry Standards and Usage of GPU Computing 119
5.3 Practical CUDA Applications in Bioinformatics 138
Preparing the reference sequence 140
Alignment with CUSHAW2-GPU 143
5.4 The Reason for the Limited Success of GPUs 145
Chapter 6 Establishing a Research Workflow Pipeline 147
6.1 Automating Your Computational Workflow 147
6.2 Scripting Language 148
Script command 150
6.3 Testing and Debugging 157
Keeping track of the current project 158
Complementing tests of code blocks 159
Calculating the execution time 160
6.4 Implementation Case Studies 162
6.5 Case Study of Common Mistakes 170
Mistake 1 Confusing mess of relative paths 170
Mistake 2 Failure to change the necessary permissions 172
Mistake 3 The disk becomes full during execution 172
Mistake 4 Ignoring cross-platform shell portability considerations 174
Chapter 7 Using a Bioinformatics Cloud Computing Platform 177
7.1 Simple Introduction to the Cloud Computing Platform 177
7.2 Amazon Web Service 178
7.3 Bioinformatics Cloud Computing Platforms 182
Logging in to use Galaxy services 184
Uploading sequence data 187
Sequence quality testing 195
Execution of sequence alignment 202
Selecting other Galaxy servers 205
Design and use of research workflows 207
Establishing new research workflows 207
Sharing and publishing process 209
Execution of research workflows 212
Downloading or exporting research workflows 212
Importing research workflows 214
7.4 Installing and Setting up your Own Galaxy Server 215
Downloading the latest version of the Galaxy 216
Starting your Galaxy server 217
Allowing external execution 220
Installation of bioinformatics tools 220
Adding new reference sequences 229
Appendix Learning Regular Expressions through Practising Simple Data Processing 235
Regular Expressions 236
One character pattern match 236
Numbering a file and printing line number of a hit 236
Counting number of grepped hits 237
UNIX redirection using pipes 237
Grep and output several lines of context around the hit 237
Grepping for non-matching lines 238
Grepping for unwanted characters 238
Mistake of logic 238
Egrep or grep -E extended regular expression grep 239
Egrep and the character class 239
Egrep character class negation 240
Regular expression: Beginning of line anchor ∧ 241
Case-sensitive and case-insensitive grep 241
Regular expression: End of line anchor 242
More about regular expressions 242
Even more regular expression 244
Substitution with SED Awk and Perl 244
Using Excel to do data processing 250
Index 259