裏 RjpWiki

Julia ときどき R, Python によるコンピュータプログラム,コンピュータ・サイエンス,統計学

Julia で VolcanoPlot

2021年07月15日 | ブログラミング

一部でよく使われている散布図の一種ということで,R で ggplot を使ったり,Python でも書かれたりしているので,Julia でもやって見ようと思う次第。

以下が,特に参考になった。

19.11 Volcano plots
https://biocorecrg.github.io/CRG_RIntroduction/volcano-plots.html

順をおって Julia に書き換えていく。

# The RDS format is used to save a single R object to a file, and to restore it.
# Extract that object in the current session:
# tmp <- readRDS("de_df_for_volcano.rds")

# import Pkg; Pkg.add("RData")
using RData # for load()
tmp = load("de_df_for_volcano.rds"); # read *.rds file

typeof(tmp) # DataFrames.DataFrame
size(tmp)   # (21276, 3)

first(tmp, 5);
#=
5 rows × 3 columns

 gene_symbol log2FoldChange pvalue
 String Float64 Float64⍰
1 Ndufa9 -0.876454 3.38515e-34
2 Rpp21 -0.976487 1.56044e-26
3 Amotl1 -0.758269 2.16481e-24
4 Gm44064 -0.848502 5.12505e-20
5 Hcfc1r1 -0.602129 3.7995e-17
=#

last(tmp, 5);
#=
5 rows × 3 columns

 gene_symbol log2FoldChange pvalue
 String Float64 Float64⍰
1 Ccny 0.0210335 missing
2 Ccne2 -0.00542747 missing
3 Ptk2 -0.00718793 missing
4 Gm26762 -0.0127217 missing
5 Smg1 -0.00845303 missing
=#

# remove rows that contain 'missing' (NA in R) values
# de <- tmp[complete.cases(tmp), ]

using DataFrames
de = dropmissing(tmp);
size(de) # (13421, 3)
first(de, 5);
#=
5 rows × 3 columns

 gene_symbol log2FoldChange pvalue
 String Float64 Float64
1 Ndufa9 -0.876454 3.38515e-34
2 Rpp21 -0.976487 1.56044e-26
3 Amotl1 -0.758269 2.16481e-24
4 Gm44064 -0.848502 5.12505e-20
5 Hcfc1r1 -0.602129 3.7995e-17
=#
tail(de, 5);
#=
5 rows × 3 columns

 gene_symbol log2FoldChange pvalue
 String Float64 Float64
1 Gm28370 -0.0237516 0.999997
2 Shfm1 -0.0267844 0.999997
3 Epc1 0.0161881 0.999997
4 Star -0.0688607 0.999997
5 Gm14137 0.0368722 0.999997
=#

# The basic scatter plot: x is "log2FoldChange", y is "pvalue"
# ggplot(data=de, aes(x=log2FoldChange, y=pvalue)) + geom_point()

using Plots
pyplot(grid=false,
       size=(400, 400), label="")
plt = scatter(de.log2FoldChange, de.pvalue, markerstrokewidth=0,
              alpha=0.4, tick_direction=:out, grid=false,
              xlabel="log2FoldChange", ylabel="pvalue", label="")
savefig("fig1")

ここまでの図

# Convert the p-value into a -log10(p-value)
plt = scatter(de.log2FoldChange, -log10.(de.pvalue),
          markersize=2, markerstrokewidth=0, alpha=0.4,
          tick_direction=:out, tickfontsize=6,
          xlabel="log2FoldChange", ylabel="-log10(pvalue)",
          guidefontsize=8, label="")
# Add vertical lines for log2FoldChange thresholds, and one horizontal line for the p-value threshold
vline!([-0.6, 0.6], color=:red, linewidth=1, linestyle=:dot)
hline!([-log10(0.05)], color=:red, linewidth=1, linestyle=:dot)

# add a column of NAs
# de$diffexpressed <- "NO"
# if log2Foldchange > 0.6 and pvalue < 0.05, set as "UP"
# de$diffexpressed[de$log2FoldChange > 0.6 & de$pvalue < 0.05] <- "UP"
# if log2Foldchange < -0.6 and pvalue < 0.05, set as "DOWN"
# de$diffexpressed[de$log2FoldChange < -0.6 & de$pvalue < 0.05] <- "DOWN"
insertcols!(de, 4, :diffexpressed => "NO");
de.diffexpressed[(de.log2FoldChange .> 0.6) .& (de.pvalue .< 0.05)] .= "UP";
de.diffexpressed[(de.log2FoldChange .< -0.6) .& (de.pvalue .< 0.05)] .= "DOWN";
savefig("fig2.png")

ここまでの図

# Re-plot but this time color the points with "diffexpressed"
# p <- ggplot(data=de, aes(x=log2FoldChange, y=-log10(pvalue), col=diffexpressed)) + geom_point() + theme_minimal()

assigncolor(x) = x == "NO" ? :black : x == "DOWN" ? :blue : :red
plt2 = scatter(de.log2FoldChange, -log10.(de.pvalue),
          color=assigncolor.(de.diffexpressed),
          markersize=3, markerstrokewidth=0, alpha=0.5,
          tick_direction=:out, tickfontsize=6,
          xlabel="log2FoldChange", ylabel="-log10(pvalue)",
          guidefontsize=8, label="")
vline!([-0.6, 0.6], color=:red, linewidth=1, linestyle=:dot)
hline!([-log10(0.05)], color=:red, linewidth=1, linestyle=:dot)
ypos = (maximum(-log10.(de.pvalue))-log10.(0.05)) / 1.5
annotate!(-0.6, ypos, text(" DOWN", 8, :blue, :left))
annotate!( 0.6, ypos, text("UP ",   8, :red,  :right))
savefig("fig3.png")

ここまでの図

# Now write down the name of genes beside the points...
# Create a new column "delabel" to de, that will contain the name of genes differentially expressed (NA in case they are not)

今までのにプラスして,

for i = 1:size(de, 1)
      x = de[i, :log2FoldChange]
      y = -log10(de[i, :pvalue])
      d = de[i, :diffexpressed]
      if d == "DOWN"
            annotate!(x, y,
                  text(" $(de.gene_symbol[i]) ", 6, assigncolor(d),
                  y > 15 ? :left : :right))
      elseif d == "UP"
            annotate!(x, y,
                  text(" $(de.gene_symbol[i]) ", 6, assigncolor(d),
                  y > 15 ? :right : :left))
      end
end
savefig("fig4.png")

最終の図

ただし,ラベルがグッチャリとしているのは解消できず。

このあと,一般化した関数を提示する予定。

 

コメント
  • X
  • Facebookでシェアする
  • はてなブックマークに追加する
  • LINEでシェアする

PVアクセスランキング にほんブログ村

PVアクセスランキング にほんブログ村