Icon representing a recipe

Recipe: NetSurfP 1.0

created by LociOiling

Profile


Name
NetSurfP 1.0
ID
102278
Shared with
Public
Parent
None
Children
Created on
January 08, 2017 at 23:27 PM UTC
Updated on
January 08, 2017 at 23:27 PM UTC
Description

Convert NetSurfP webpage output into secondary structure prediction and copy-and-paste spreadsheet format.

Best for


Code


--[[ NetSurfP - convert NetSurfP format NetSurfP, www.cbs.dtu.dk/services/NetSurfP/, takes a primary sequence and outputs predicted surface accessibility and secondary structure. "Surface accessibility" seems to be more or less the inverse of what's called "predicted residue burial" in Foldit. It's the chances that a given residue will be on the outside of a protein. NetSurfP outputs most of its values as probabilities, and it uses a columnar format. Unfortunately, the formatting used on the output page does not lend itself to being copied and pasted into a spreadsheet. This recipe converts the columnar format of NetSurfP output to a tab-delimited format which can be copied and pasted into a spreadsheet. The recipe also attempts to create a Foldit secondary structure string from the NetSurfP probabilities. version 1.0 -- 2017/01/08 -- LociOiling * new recipe ]]-- -- -- Globals -- Recipe = "NetSurfP" Version = "1.0" ReVersion = Recipe .. " v." .. Version -- -- end of globals section -- function NSPReader ( nspentry ) local linecnt = 0 local comments = 0 local unknown = 0 local louts = "" -- whole shebang in spreadsheet format local ssp = "" -- ss prediction -- -- manifest constants to column positions, -- change these if NetSurfP format changes -- local PHELIX = 8 local PSHEET = 9 local PLOOP = 10 -- -- column header, also needs attention in input format changes -- local cHead = "\"burial\"\t\"aa\"\t\"seqnam\"\t\"segnum\"\t\"rsa\"\t\"absaccess\"\t\"zFit\"\t\"pHelix\"\t\"pSheet\"\t\"pLoop\"\n" for line in nspentry:gmatch ( "(.-)[\n*\r*]" ) do if line ~= nil then local pHelix = 0 local pSheet = 0 local pLoop = 0 if line:match ( "#" ) then comments = comments + 1 else local lout = "" local col = 0 for toke in line:gmatch ( "[%S]+" ) do if lout:len () > 0 then lout = lout .. "\t" end if tonumber ( toke ) == nil then lout = lout .. "\"" .. toke .. "\"" else lout = lout .. toke end col = col + 1 if col == PHELIX then pHelix = toke elseif col == PSHEET then pSheet = toke elseif col == PLOOP then pLoop = toke end end lout = lout .. "\n" if louts:len () == 0 then louts = louts .. cHead end louts = louts .. lout -- -- pick highest probability for secondary structure -- local pred = "L" if pHelix > pLoop then if pHelix > pSheet then pred = "H" else pred = "E" end else if pSheet > pLoop then pred = "E" end end ssp = ssp .. pred end end linecnt = linecnt + 1 end print ( "number of lines = " .. linecnt ) print ( "number of comments = " .. comments ) return louts, ssp end function GetNetSurfP () local dlog = dialog.CreateDialog ( ReVersion ) local tab = "" dlog.tab0 = dialog.AddLabel ( "NetSurfP Output" ) dlog.tab = dialog.AddTextbox ( "output", tab ) dlog.u0 = dialog.AddLabel ( "" ) dlog.u1 = dialog.AddLabel ( "Usage: copy the NetSurfP output for a sequence" ) dlog.u2 = dialog.AddLabel ( "and paste into the output box" ) dlog.w0 = dialog.AddLabel ( "" ) dlog.ok = dialog.AddButton ( "OK" , 1 ) dlog.exit = dialog.AddButton ( "Exit" , 0 ) if ( dialog.Show ( dlog ) > 0 ) then tab = dlog.tab.value return tab else return "" end return tab end function ShowResults ( csv, ssp ) local dlog = dialog.CreateDialog ( ReVersion ) dlog.tab0 = dialog.AddLabel ( "NetSurfP Reformatted Output" ) dlog.csv = dialog.AddTextbox ( "csv", csv ) dlog.ssp = dialog.AddTextbox ( "SS pred", ssp ) dlog.u0 = dialog.AddLabel ( "" ) dlog.u1 = dialog.AddLabel ( "csv is \"comma separated values\" for spreadsheet" ) dlog.u2 = dialog.AddLabel ( "SS pred is secondary structure prediction" ) dlog.w0 = dialog.AddLabel ( "" ) dlog.u1 = dialog.AddLabel ( "Usage: use select all and copy, cut, or paste" ) dlog.u2 = dialog.AddLabel ( "to save or change secondary structure" ) dlog.w0 = dialog.AddLabel ( "" ) dlog.w1 = dialog.AddLabel ( "Windows: ctrl + a = select all" ) dlog.w2 = dialog.AddLabel ( "Windows: ctrl + x = cut" ) dlog.w3 = dialog.AddLabel ( "Windows: ctrl + c = copy" ) dlog.w4 = dialog.AddLabel ( "Windows: ctrl + v = paste" ) dlog.z0 = dialog.AddLabel ( "" ) dlog.ok = dialog.AddButton ( "OK" , 1 ) dialog.Show ( dlog ) end function main () print ( ReVersion ) print ( "Puzzle: " .. puzzle.GetName () ) print ( "Track: " .. ui.GetTrackName () ) local changeNum = 0 local nsp = "" nsp = GetNetSurfP () if nsp:len () > 0 then local csv = "" local ssp = "" csv, ssp = NSPReader ( nsp ) if csv ~= nil and csv:len () > 0 and ssp ~= nil and ssp:len () > 0 then ShowResults ( csv, ssp ) print ( "run #" .. changeNum ) print ( "---spreadsheet format---" ) print ( csv ) print ( "---secondary structure prediction---" ) print ( ssp ) else print ( "no results, input format may be wrong" ) end end cleanup () end function cleanup ( errmsg ) -- -- optionally, do not loop if cleanup causes an error -- (any loop here is automatically terminated after a few iterations, however) -- if CLEANUPENTRY ~= nil then return end CLEANUPENTRY = true print ( "---" ) -- -- model 100 - print recipe name, puzzle, track, time, score, and gain -- local reason local start, stop, line, msg if errmsg == nil then reason = "complete" else -- -- model 120 - civilized errmsg reporting, -- thanks to Bruno K. and Jean-Bob -- start, stop, line, msg = errmsg:find ( ":(%d+):%s()" ) if msg ~= nil then errmsg = errmsg:sub ( msg, #errmsg ) end if errmsg:find ( "Cancelled" ) ~= nil then reason = "cancelled" else reason = "error" end end print ( ReVersion .. " " .. reason ) print ( "Puzzle: " .. puzzle.GetName () ) print ( "Track: " .. ui.GetTrackName () ) if reason == "error" then print ( "Unexpected error detected" ) print ( "Error line: " .. line ) print ( "Error: \"" .. errmsg .. "\"" ) end end xpcall ( main, cleanup )

Comments


LociOiling Lv 1

NetSurfP is yet another web-based secondary structure prediction service. (JPred is another.)

NetSurfP outputs its results in a columnar format. The predictions for helix, sheet, and loop are expressed as probabilities.

(NetSurfP also predicts he "surface accessibility" of a given residue, which seem to be more or less the inverse of the likelihood the residue is buried in the hydrophobic core.)

The formatting for the NetSurfP results doesn't lend itself to being pasted directly into a spreadsheet.

This recipe does two things. First, it converts the NetSurfP output to a tab-delimited format that can be pasted into a spreadsheet. Second, it creates a secondary structure string. Both formats can be found in print protein 2.4.

To use this recipe, run a NetSurfP prediction, then copy the output to the clipboard. (It's not necessary to include the comment lines, but you can if you wish.)

Run the recipe, and paste the NetSurfP output into the textbox on the first screen. When you click OK, the secondary screen displays two text boxes, one containing the spreadsheet output, and one containing the secondary structure string.

The secondary structure string is created by picking the secondary structure type with the highest probability for each segment. The picking logic is quite simple, and doesn't worry about ties or close finishes.

This recipe depends heavily on the NetSurfP output format. Any changes to NetSurfP output will require revisions to the recipe.

Sample NetSurfP output:

# For publication of results, please cite:
# A generic method for assignment of reliability scores applied to solvent accessibility predictions.
# Bent Petersen, Thomas Nordahl Petersen, Pernille Andersen, Morten Nielsen and Claus Lundegaard
# BMC Structural Biology 2009, 9:51 doi:10.1186/1472-6807-9-51
#
# Column 1: Class assignment - B for buried or E for Exposed - Threshold: 25% exposure, but not based on RSA
# Column 2: Amino acid
# Column 3: Sequence name
# Column 4: Amino acid number
# Column 5: Relative Surface Accessibility - RSA
# Column 6: Absolute Surface Accessibility
# Column 7: Z-fit score for RSA prediction
# Column 8: Probability for Alpha-Helix
# Column 9: Probability for Beta-strand
# Column 10: Probability for Coil
E T  Sequence               1    0.865 120.003   0.476   0.003   0.003   0.994
E E  Sequence               2    0.758 132.423   0.324   0.694   0.003   0.303
E E  Sequence               3    0.741 129.488   0.588   0.782   0.003   0.216
E R  Sequence               4    0.409  93.707   0.281   0.858   0.002   0.139
E K  Sequence               5    0.380  78.063   0.459   0.923   0.002   0.076
E K  Sequence               6    0.597 122.844   1.114   0.938   0.007   0.055
E E  Sequence               7    0.609 106.340   1.316   0.970   0.001   0.030
B I  Sequence               8    0.066  12.284  -0.022   0.970   0.001   0.030
E Q  Sequence               9    0.436  77.941   0.867   0.970   0.001   0.030
E K  Sequence              10    0.613 126.012   1.216   0.970   0.001   0.030
...

LociOiling Lv 1

I had to kludge a bit to get the line breaks on Windows correct. When pasted into a Foldit textbox on Windows, the lines apparently retain the CRLF format (0x0d0a). The Lua regular expression used to read the pasted lines allows for a carriage return.

On Mac and Linux, the pasted lines will probably have just a newline (0x0a). Hopefully, the code allows for this, but it's not been tested on other platforms.

Please send a PM if there are problems.

BarrySampson Lv 1

I may be missing something, but I can't work out how to save the output from this routine. My computer won't let me select the output in thd text boxes