#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0
1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:26:103,0,26
1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:61:61,0,255

Today at work, I heard something that seemed weird to me: python would be doing some implicit data transformation in order to compare data of different types between each others.

After digging a little bit on Stackoverflow, in cPython’s documentation and in cPython’s github, here’s what I have found:

Python will try hard to compare your objects.
1) If objects are of the same types, if we have a tp_compare function (comparison function) defined for objects of this type, we make the comparison. We check if the first object is an instance. If it is, we check the results of our tp_compare function. If there is no issue, we return its result. In that case, the comparison is done with object-specific comparison function
2) If this fails, we try the function try_rich_to_3way_compare
3) If this fails, we try the function try_3way_compare
4) If this fails too, we return the result of the default comparison function.

Here is the default comparison function

As you can see, when we’ve arrived there, we’re already in trouble. I expect this function to be modified in python 3.x as comparison between different types would fail as demonstrated there:

Here’s what I have found about this on stackoverflow, answer by Alex Martelli (who is kind of a famous python programmer):

The original design motivation for allowing order-comparisons of arbitrary objects was to allow sorting of heterogenous lists — usefully, that would put all strings next to each other in alphabetical order, and all numbers next to each other in numerical order, although which of the two blocks came first was not guaranteed by the language. For example, this allowed getting only unique items in any list (even one with non-hashable items) in O(N log N) worst-case time

Over the years, this pragmatical arrangement was eroded. The first crack was when the ability to order-compare complex numbers was taken away, quite a few versions ago: suddenly, the ability to sort any list disappeared (did not apply any more if the list contained complex numbers, possibly together with items of other types). Then Guido started disliking heterogeneous lists more generally, and thus thinking that it didn’t really matter if such lists could be usefully sorted or not… because such lists should not exist in the first place, according to his new thinking. He didn’t do anything to forbid them, but was not inclined to accept any compromises to support them either.

Note that both changes move the balance a little bit away from the “practicality beats purity” item of the Zen of Python (which was written earlier, back when complex numbers still could be order-compared;-) — a bit more purity, a bit less practicality.

Nevertheless the ability to order-compare two arbitrary objects (as long as neither was a complex number;-) remained for a long time, because around that same time Guido started really insisting on maintaining strong backwards compatibility (a shift that’s both practical and pure;-).

So, it’s only in Python 3, which explicitly and deliberately removed the constraint of strong backwards compatibility to allow some long-desired but backwards incompatible enhancements (especially simplifications and removal of obsolete, redundant way to perform certain tasks), that order comparison of instances of different types became an error.

So this historical and philosophical treatise is basically the only way to truly respond to your “why” question…!-)

http://stackoverflow.com/questions/2384078/why-is-0-true-in-python

With this answer, everything’s clearer ! Thank you Alex

Resources:

cPython object.c on github
StackoverFlow: Why is ”>0 True in Python?
cPython 2.7 Manual: about comparisons
cPython 2.7: C API Manual

I made a small program last week that would take as input a file containing genomic regions (a lot of them) and a pileup file. The intent was to output a binned file on the genomic regions I had in input.

Thing is, I used to store my genomic regions in a defaultdict using as a key the chromosome in which they were. So for each position in my pileup, python would look up in the list, find the good regions and return it (if it were inside at all).

Continue reading

Hi,

This is a post that is meant to be updated many times. I will explain how to transform your genomic data in order to import them inside cBioPortal.

Here is a description of the different data formats: https://github.com/cBioPortal/cbioportal/wiki/File-Formats

  1. Mutation files (VCF format)

To begin with, you need to transform your VCF files (which I generated using MuTecT 1.6) into MAF files (vcf2maf.pl. NB: This actually supports VEP83. I’ve sent a pull request with modifications for VEP84).

If you need to batch transform a whole lot of them, you can look on my github page and get some inspiration from my (tiny) script (getMaf.sh).

 

I recently wanted to switch two of my 1TB hard drives on a 2TB raid 1 array. Since I couldn’t just use mdadm to directly convert my 2 x 1TB drives into a raid 1 array without suffering any loss (it seems like I misconfigured them in past: they were recognized as raid 0 arrays with 1 missing drive each), I had to make that transfer. But in order not to suffer from too much interuption of service, files had to be synced.

In order to copy a partition, one might use “dd”, which would copy the entire structure, bit by bit. This is not a good solution here, since partition size and file systems are different.

I decided to use rsync in the following way:

The progress, stats and v triggers are for advanced logging. -u trigger is for updating only: no need to copy again files that were already copied previously. By default, the -u trigger is based on modification date. Rsync also provides users with hashing in order to compare files (but this solution is really slower). -a trigger stands for archives and is meant to preserve a whole lot of metadata such as user owner, access rights, modification date, and many other stuff.

I strongly advise you not to turn on hashing or compression since you are working in local, and especially if you’re dealing with very large files. This is a loss of time, memory, energy and finally, a loss of money.

Before the rsync, switch daemon that uses your old partition down so they won’t produce any new additional files. Launch the rsync. Update configuration so it will use the new path for your raid array. Fire the daemons back up. You’re good to go !

If you’re a little bit paranoid like me, you like your network well secured, but you like even more if there can be several layers.
In order to secure a SSH access from outside, and in addition to authentication methods, it is possible to use port knocking (https://en.wikipedia.org/wiki/Port_knocking) to make an eventual attacker’s life a little bit harder.
The principle is simple: you define a random sequence of packet that must be received by your server in order to open a specific port for a short amount of time.
For instance, say I want to connect to my server from my phone, but I would still like the SSH port to be closed when I do not try to connect. I can set a specific temporal sequence, like one TCP SYN packet on port 46289, one UDP dataframe on port 7392, one TCP packet on port 2736, a pause for 10 seconds, and a last TCP SYN packet on port 8921, that the server will recognize, without any need for my iptables configuration to be changed. Upon respect of that sequence, TCP port 22 will accept incoming connection from my IP address for 10 seconds. Since my firewall is configured to authorize already established connections, this will allow me access to my server without displaying any opened port.

In order to set up that solution on linux, you can use the daemon “knockd”. To set it up on debian, use “apt-get install knockd”. Then you will need to edit /etc/knockd.conf. The config file is pretty straightforward, and a lot of documentation is available.

You could also add another layer of security by blacklisting, using loging and fail2ban. For instance, you could bannish any IP trying different combinations of port knocking using fail2ban by connecting it to knockd’s log file.

If you want config examples, I will gladly post some if you ask for it.

In bash, you cannot pass an array to a function just like you would in another language. You need to use this trick and to pass the array like it was a big string.

It will be easy for you to return something else instead of True and False, like array index, or 0 / 1.

For my NAS server,  I wanted to create RAID 1 arrays (e.g. each data is copied on two hard drives, or more, but in my case two, since data storage device is kinda expensive).

To create, remove, and manipulate logical RAID arrays, you need the “mdadm” tool. cfdisk might be util if you do not want to mess up with your partition table without a minimal GUI. I often like to use it instead of fdisk when I mess up with my disks, it gives me a little feeling of comfort that I always lack when I risk loosing all my data on a single letter mistake.

On debian:   sudo apt-get install mdadm cfdisk 

  1. Preparing the disks

Once those are installed, you need to locate your two newly created disks. For that, I usually infer those from:

Lets say those are /dev/sdc and /dev/sde in my case (which it was really, so that shows you that this can be mind tricky, and if you choose wrongly, you will feel bad about having messed up your system almost certainly)

Continue reading

Here you will find a small callback function I just wrote for my pipelines in bash. It watches processes and calls the function you will pass as an argument when all processes which PIDs are defined in $pids array are over (meaning ps $pid returns 1 instead of 0)

Seeking the driver in tumours with apparent normal molecular profile on comparative genomic hybridization and targeted gene panel sequencing: what is the added value of whole exome sequencing?

 

Background Molecular tumour profiling technologies have become increasingly important in the era of precision medicine, but their routine use is limited by their accessibility, cost, and tumour material availability. It is therefore crucial to assess their relative added value to optimize the sequence and combination of such technologies.

Patients and methods Within the MOSCATO-01 trial, we investigated the added value of whole exome sequencing (WES) in patients that did not present any molecular abnormality on array comparative genomic hybridization (aCGH) and targeted gene panel sequencing (TGPS) using cancer specific panels. The pathogenicity potential and actionability of mutations detected on WES was assessed.

Results Among 420 patients enrolled between December 2011 and December 2013, 283 (67%) patients were analysed for both TGPS and aCGH. The tumour sample of 25 (8.8%) of them presented a flat (or low-dynamic) aCGH profile and no pathogenic mutation on TGPS. We selected the first eligible 10 samples—corresponding to a heterogeneous cohort of different tumour types—to perform WES. This allowed identifying eight mutations of interest in two patients: FGFR3, PDGFRB, and CREBBPmissense single-nucleotide variants (SNVs) in an urothelial carcinoma;FGFR2, FBXW7, TP53, and MLH1 missense SNVs as well as an ATMframeshift mutation in a squamous cell carcinoma of the tongue. TheFGFR3 alteration had been previously described as an actionable activating mutation and might have resulted in treatment by an FGFR inhibitor.CREBBP and ATM alterations might also have suggested a therapeutic orientation towards epigenetic modifiers and ataxia-telangectasia and Rad3-related inhibitors, respectively.

Conclusion The therapeutic added value of performing WES on tumour samples that do not harbour any genetic abnormality on TGPS and aCGH might be limited and variable according to the histotype. Alternative techniques, including RNASeq and methylome analysis, might be more informative in selected cases.

Link: http://annonc.oxfordjournals.org/content/early/2015/12/28/annonc.mdv570

Recently, I had to fix an old SUN server (SF 880) that went down unexpectedly. The corresponding KVM switch wouldn’t display anything, so our sysadmin suggested I use the RSC console.

RSC console is a remote administration console similar to Dell’s iDRAC. Except it’s old and can be accessed via telnet. If you cannot manage to get your SUN server running anymore, you might be able to fix it using the RSC console.

By typing “console” into the RSC interface, you will get a minimal shell, that will allow you to run some utils, like fsck. For me, that did the trick. Repairing the filesystems using fsck allowed me to get back up the server.

RSC will also allow you to display some informations (using command: “environment”) about your server, such as power status or system temperature.

This page helped me a lot: https://docs.oracle.com/cd/E19713-01/816-3314-12/ucm_shell_chap.html

Here you’ll find a file that might be useful to you. It summarizes, in some INI-like format that is easily parsable, genes involved in cancerogenesis or cancer progression.

The data was extracted from KEGG this year (2016) and contains well known pathways such as MAPK or BER (Base-excision repair), processes (Apoptosis and Cell Cycle), or pathology-associated genes (Melanoma, Bladder Cancer, …). It also contains two generic classes: miRNA in cancer, and cancer associated.

You can add to this list by using the following format:

Here is the file: Cancer_genes_KEGG

If you have any other cancer-related list of gene, I would be glad to hear about it !

I needed some data structure that would act like a dictionary with parental inheritence for my config object in Pypes.

Here’s the code in case you’d ever need something like that. I will act autovivification later !

Sometimes, it is faster to tackle the problem than trying to reuse what has already been done.

I was trying to make defaultdict aware of any defaultdict parent, then I tried to make Raymond Hettinger’s solution (very nice by the way, it inspired me a lot and there is actually a lot of the structure’s code that’s his) work for me, but I had to stop trying to make reuse everything and start coding by myself to really find what I needed. I also tried to improve over my PipelineTree structure with a different tree walking, but I had a hard time making it.

 

If you want to:

  • Count occurences of lines with ‘x’ occurences of char
  • Extract required memory for a java program
  • Stall a java program until you get enough memory using bash, awk and perl.

If you want to see some standard python library source code, go to https://hg.python.org/cpython/file/<python-version>/Lib/

For instance, with python 3.5: https://hg.python.org/cpython/file/3.5/Lib/

If you ever get stuck with a problem with some of python standard library, like I was, go see the source code there. It was my first time looking at it today, and it is really great and inspiring to see !

You’ll also have a local copy that python is using, that’s located for me (Debian sid) at: /usr/lib/<python-version>/

Here is the online documentation: https://docs.python.org/3.5/

Hi,

In bioinformatics, you often need to make a lot of tools work together as a pipeline.

If you decide to go for shell scripts, you’re fine, there is no real complications. But you will not get all the goods objects can lever for you, and you will probably find yourself limited by the language using shell scripts.

So I decided to write another Pipeline, that would replace them all, in Python 2.7. One thing a pipeline should be able to do well is launch multiple tasks asynchronously. One other thing a pipeline should handle is event signals. (NB: There’s a pretty good module that provides signal handling: pyDispatcher.)

Here, I will send my solution for the asynchronous task launching. It also features some pretty basic “Design by Contract” features. I intended to design that system so that python newbies could add new tools to the pipeline as well and not care too much about the technical part. Turns out it served as a decorator course for my co-workers.

Continue reading

I was looking for a tool that could enable me to draw UML diagrams. After looking at ArgoUML, which doesn’t support all kinds of UML diagrams (i.e.: Component Diagrams), I found a really nice and promising tool: draw.io. I suggest you look it up if you do not want advanced features like automatic code generation. I will post some diagrams made with it once they’ll be done !

Here’s an example of things that can be done with it:

Component Diagram

This will tell about the procedure to follow in order to create encrypted logical raid 1 arrays using mdadm and cryptsetup (luks).

  • First, create partition on two disks:

I created whole disk partition of type “FD” (linux raid autodetect).

  • Then, create the raid array using mdadm.

  • Now with the encryption part using luks. Luks allows you to encrypt a partition or even a whole disk.

At first, I was thinking about using USB flash drives as keys, and erase the one always plugged in, in case of emergency. But, this article made me understand that one couldn’t remove file securely from flash drives. Those shouldn’t be considered as secure for another reason: data leakage. Please consider the situation: if your data gets corrupted, you might not be able to unlock your Luks partition anymore.

I would advise to use a strong password as key when asked by cryptsetup, or to look for professional solution such as security tokens if you are willing to pay to be safe.

To finish, I will provide two small scripts that will allow you to mount and unmount your newly created raid/luks partitions:

I will post when I’ll get the time some information about attack vectors against that kind of technology I would be aware off.

Hope this will help some people !